Skip to content

Fix pdf_oxide parser text extraction issues with preprocessing#31

Open
rager306 wants to merge 1 commit intomainfrom
pdfoxide-fix-review-3365727040001916273
Open

Fix pdf_oxide parser text extraction issues with preprocessing#31
rager306 wants to merge 1 commit intomainfrom
pdfoxide-fix-review-3365727040001916273

Conversation

@rager306
Copy link
Copy Markdown
Owner

@rager306 rager306 commented Feb 6, 2026

Implemented text preprocessing for pdf_oxide parser to fix formatting issues causing extraction failures. Added unit tests for the new preprocessing logic.


PR created automatically by Jules for task 3365727040001916273 started by @rager306

Summary by CodeRabbit

  • New Features

    • Enhanced PDF text parsing with improved handling of timestamps, monetary amounts, and user identifiers for more accurate data extraction.
    • Better automatic formatting of metadata headers in parsed documents.
  • Tests

    • Added comprehensive test coverage for PDF text preprocessing functionality.

During review, it was identified that the `pdf_oxide` parser had a 0% success rate in benchmarks due to "smashed" text output (missing newlines between fields).

This change implements `preprocess_text` in `pdfparser/pdfoxide_parser.py` to:
1.  Separate timestamps from descriptions.
2.  Fix smashed amounts (e.g., `100.00200.00` -> `100.00\n200.00`).
3.  Separate User IDs from descriptions.
4.  Ensure metadata headers start on new lines.

Added comprehensive tests in `tests/test_pdfoxide_preprocessing.py`.
Verified that existing tests pass and linters are clean.

Co-authored-by: rager306 <248269686+rager306@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 6, 2026

Walkthrough

Added a new preprocess_text() utility function in the pdfoxide parser module to handle multi-step text preprocessing including timestamp separation, monetary amount splitting, user ID isolation, and metadata header formatting. The function is integrated into the existing PDF parsing workflow and covered by comprehensive unit tests.

Changes

Cohort / File(s) Summary
Text Preprocessing Implementation
pdfparser/pdfoxide_parser.py
Added preprocess_text() module-level function with regex-based text transformations. Imported re module at file level. Integrated preprocessing into parse_pdf_pdfoxide() to apply transformations to header text and page text before extraction. Removed redundant local re import from fallback path.
Preprocessing Test Suite
tests/test_pdfoxide_preprocessing.py
New test module with TestPdfoxidePreprocessing class containing seven test methods validating timestamp separation, smashed amount splitting, user ID isolation, metadata header formatting, combined scenarios, empty input handling, and idempotent behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding text preprocessing to fix extraction issues in the pdf_oxide parser, which aligns with the core objective of implementing a preprocess_text function.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pdfoxide-fix-review-3365727040001916273

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@pdfparser/pdfoxide_parser.py`:
- Around line 47-63: The alternation in pdfoxide_parser.py uses headers list so
shorter substrings like "Unit Kerja" match before longer ones like "Alamat Unit
Kerja"; modify the code that builds headers_pattern (working with the headers
variable used to create headers_pattern, pattern, and text) to sort headers by
length descending (longer headers first) before escaping and joining so longer
headers are tried before their substrings, preventing mid-header newline
insertion.

In `@tests/test_pdfoxide_preprocessing.py`:
- Around line 40-49: Add a unit test named
test_separates_alamat_unit_kerja_header that exercises preprocess_text to ensure
the full header "Alamat Unit Kerja" is kept together and prefixed by a newline
rather than being split by the shorter alternation "Unit Kerja"; assert that the
processed string contains "\nAlamat Unit Kerja" and does NOT contain the broken
form "Alamat \nUnit Kerja". This will expose the alternation-order bug and
validate the regex behavior in preprocess_text.
🧹 Nitpick comments (3)
pdfparser/pdfoxide_parser.py (2)

30-32: Timestamp regex: \s* collapses all trailing whitespace (including newlines) into a single \n.

This is fine for the stated preprocessing goal, but be aware that if a timestamp is followed by, say, a blank line (\n\n), that blank line will be collapsed. If blank lines are meaningful delimiters downstream, this could cause subtle issues.


34-38: Two passes handles chains of up to 3 smashed amounts; longer chains would need more passes.

For bank statement rows (typically debit + credit + balance = 3 amounts) this is sufficient. If you ever need to handle longer chains, consider a while-loop approach:

prev = None
while prev != text:
    prev = text
    text = re.sub(amount_separator, r"\1\n\2", text)

Not urgent given the known data shape.

tests/test_pdfoxide_preprocessing.py (1)

75-78: Idempotence test is valuable but could be more thorough.

The current test uses an already-correctly-formatted string, which is great. Consider adding a round-trip assertion: preprocessing the output of test_combined_smashed_transaction should produce no further changes.

Comment on lines +47 to +63
headers = [
"Unit Kerja",
"Business Unit",
"Nama Produk",
"Product Name",
"Statement Date",
"Valuta",
"Currency",
"Periode Transaksi",
"Transaction Period",
"Alamat Unit Kerja",
"Business Unit Address",
]
# Escape headers for regex safety (though these are simple strings)
headers_pattern = "|".join(re.escape(h) for h in headers)
pattern = r"(?i)(?<!\n)(" + headers_pattern + ")"
text = re.sub(pattern, r"\n\1", text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Alternation order causes "Alamat Unit Kerja" to be split incorrectly.

"Unit Kerja" appears before "Alamat Unit Kerja" in the alternation. When the input contains "Alamat Unit Kerja", the regex matches the shorter "Unit Kerja" substring first and inserts a newline inside the longer header, producing "Alamat \nUnit Kerja" instead of "\nAlamat Unit Kerja".

The same pattern applies to "Business Unit" / "Business Unit Address", although that one happens to work by coincidence (the newline lands at the right spot). Relying on this is fragile.

Fix: sort the headers list so longer strings come first, ensuring longer matches are tried before their substrings.

🐛 Proposed fix — order longer headers first
     headers = [
-        "Unit Kerja",
-        "Business Unit",
-        "Nama Produk",
-        "Product Name",
-        "Statement Date",
-        "Valuta",
-        "Currency",
-        "Periode Transaksi",
-        "Transaction Period",
         "Alamat Unit Kerja",
         "Business Unit Address",
+        "Unit Kerja",
+        "Business Unit",
+        "Periode Transaksi",
+        "Transaction Period",
+        "Statement Date",
+        "Nama Produk",
+        "Product Name",
+        "Valuta",
+        "Currency",
     ]

Alternatively, sort programmatically so this stays correct as headers are added:

headers.sort(key=len, reverse=True)
🤖 Prompt for AI Agents
In `@pdfparser/pdfoxide_parser.py` around lines 47 - 63, The alternation in
pdfoxide_parser.py uses headers list so shorter substrings like "Unit Kerja"
match before longer ones like "Alamat Unit Kerja"; modify the code that builds
headers_pattern (working with the headers variable used to create
headers_pattern, pattern, and text) to sort headers by length descending (longer
headers first) before escaping and joining so longer headers are tried before
their substrings, preventing mid-header newline insertion.

Comment on lines +40 to +49
def test_separates_metadata_headers(self):
"""Verify metadata headers are forced to new lines."""
raw = "Product NameBusiness Unit Address Valuta: IDR"
expected = "\nProduct Name\nBusiness Unit Address \nValuta: IDR"
# The regex inserts newline before header.
# "Product Name" -> "\nProduct Name"
# "Business Unit Address" -> "\nBusiness Unit Address"
# "Valuta" -> "\nValuta"

assert preprocess_text(raw) == expected
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing test case for "Alamat Unit Kerja" header — would expose the alternation-order bug.

This test only covers "Business Unit Address" which coincidentally works despite "Business Unit" being earlier in the alternation (since the newline lands at the right spot). Adding a test for "Alamat Unit Kerja" would reveal the split issue flagged in the parser review.

def test_separates_alamat_unit_kerja_header(self):
    """Verify 'Alamat Unit Kerja' is not split by 'Unit Kerja' match."""
    raw = "SomeText Alamat Unit Kerja: Jl. Example"
    processed = preprocess_text(raw)
    # Should NOT produce "SomeText Alamat \nUnit Kerja: Jl. Example"
    assert "Alamat \nUnit Kerja" not in processed
    assert "\nAlamat Unit Kerja" in processed
🤖 Prompt for AI Agents
In `@tests/test_pdfoxide_preprocessing.py` around lines 40 - 49, Add a unit test
named test_separates_alamat_unit_kerja_header that exercises preprocess_text to
ensure the full header "Alamat Unit Kerja" is kept together and prefixed by a
newline rather than being split by the shorter alternation "Unit Kerja"; assert
that the processed string contains "\nAlamat Unit Kerja" and does NOT contain
the broken form "Alamat \nUnit Kerja". This will expose the alternation-order
bug and validate the regex behavior in preprocess_text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant