Fix pdf_oxide parser text extraction issues with preprocessing by rager306 · Pull Request #31 · rager306/b-pdf-parser

rager306 · 2026-02-06T14:58:12Z

Implemented text preprocessing for pdf_oxide parser to fix formatting issues causing extraction failures. Added unit tests for the new preprocessing logic.

PR created automatically by Jules for task 3365727040001916273 started by @rager306

Summary by CodeRabbit

New Features
- Enhanced PDF text parsing with improved handling of timestamps, monetary amounts, and user identifiers for more accurate data extraction.
- Better automatic formatting of metadata headers in parsed documents.
Tests
- Added comprehensive test coverage for PDF text preprocessing functionality.

During review, it was identified that the `pdf_oxide` parser had a 0% success rate in benchmarks due to "smashed" text output (missing newlines between fields). This change implements `preprocess_text` in `pdfparser/pdfoxide_parser.py` to: 1. Separate timestamps from descriptions. 2. Fix smashed amounts (e.g., `100.00200.00` -> `100.00\n200.00`). 3. Separate User IDs from descriptions. 4. Ensure metadata headers start on new lines. Added comprehensive tests in `tests/test_pdfoxide_preprocessing.py`. Verified that existing tests pass and linters are clean. Co-authored-by: rager306 <248269686+rager306@users.noreply.github.com>

google-labs-jules · 2026-02-06T14:58:14Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-02-06T14:58:33Z

Walkthrough

Added a new preprocess_text() utility function in the pdfoxide parser module to handle multi-step text preprocessing including timestamp separation, monetary amount splitting, user ID isolation, and metadata header formatting. The function is integrated into the existing PDF parsing workflow and covered by comprehensive unit tests.

Changes

Cohort / File(s)	Summary
Text Preprocessing Implementation `pdfparser/pdfoxide_parser.py`	Added `preprocess_text()` module-level function with regex-based text transformations. Imported `re` module at file level. Integrated preprocessing into `parse_pdf_pdfoxide()` to apply transformations to header text and page text before extraction. Removed redundant local `re` import from fallback path.
Preprocessing Test Suite `tests/test_pdfoxide_preprocessing.py`	New test module with `TestPdfoxidePreprocessing` class containing seven test methods validating timestamp separation, smashed amount splitting, user ID isolation, metadata header formatting, combined scenarios, empty input handling, and idempotent behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding text preprocessing to fix extraction issues in the pdf_oxide parser, which aligns with the core objective of implementing a preprocess_text function.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pdfoxide-fix-review-3365727040001916273

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@pdfparser/pdfoxide_parser.py`:
- Around line 47-63: The alternation in pdfoxide_parser.py uses headers list so
shorter substrings like "Unit Kerja" match before longer ones like "Alamat Unit
Kerja"; modify the code that builds headers_pattern (working with the headers
variable used to create headers_pattern, pattern, and text) to sort headers by
length descending (longer headers first) before escaping and joining so longer
headers are tried before their substrings, preventing mid-header newline
insertion.

In `@tests/test_pdfoxide_preprocessing.py`:
- Around line 40-49: Add a unit test named
test_separates_alamat_unit_kerja_header that exercises preprocess_text to ensure
the full header "Alamat Unit Kerja" is kept together and prefixed by a newline
rather than being split by the shorter alternation "Unit Kerja"; assert that the
processed string contains "\nAlamat Unit Kerja" and does NOT contain the broken
form "Alamat \nUnit Kerja". This will expose the alternation-order bug and
validate the regex behavior in preprocess_text.

🧹 Nitpick comments (3)

pdfparser/pdfoxide_parser.py (2)
30-32: Timestamp regex: \s* collapses all trailing whitespace (including newlines) into a single \n.

This is fine for the stated preprocessing goal, but be aware that if a timestamp is followed by, say, a blank line (\n\n), that blank line will be collapsed. If blank lines are meaningful delimiters downstream, this could cause subtle issues.

34-38: Two passes handles chains of up to 3 smashed amounts; longer chains would need more passes.

For bank statement rows (typically debit + credit + balance = 3 amounts) this is sufficient. If you ever need to handle longer chains, consider a while-loop approach:
prev = None
while prev != text:
    prev = text
    text = re.sub(amount_separator, r"\1\n\2", text)
Not urgent given the known data shape.
tests/test_pdfoxide_preprocessing.py (1)

75-78: Idempotence test is valuable but could be more thorough.

The current test uses an already-correctly-formatted string, which is great. Consider adding a round-trip assertion: preprocessing the output of test_combined_smashed_transaction should produce no further changes.

coderabbitai · 2026-02-06T15:02:15Z

+    headers = [
+        "Unit Kerja",
+        "Business Unit",
+        "Nama Produk",
+        "Product Name",
+        "Statement Date",
+        "Valuta",
+        "Currency",
+        "Periode Transaksi",
+        "Transaction Period",
+        "Alamat Unit Kerja",
+        "Business Unit Address",
+    ]
+    # Escape headers for regex safety (though these are simple strings)
+    headers_pattern = "|".join(re.escape(h) for h in headers)
+    pattern = r"(?i)(?<!\n)(" + headers_pattern + ")"
+    text = re.sub(pattern, r"\n\1", text)


⚠️ Potential issue | 🟠 Major

Alternation order causes "Alamat Unit Kerja" to be split incorrectly.

"Unit Kerja" appears before "Alamat Unit Kerja" in the alternation. When the input contains "Alamat Unit Kerja", the regex matches the shorter "Unit Kerja" substring first and inserts a newline inside the longer header, producing "Alamat \nUnit Kerja" instead of "\nAlamat Unit Kerja".

The same pattern applies to "Business Unit" / "Business Unit Address", although that one happens to work by coincidence (the newline lands at the right spot). Relying on this is fragile.

Fix: sort the headers list so longer strings come first, ensuring longer matches are tried before their substrings.

🐛 Proposed fix — order longer headers first

headers = [ - "Unit Kerja", - "Business Unit", - "Nama Produk", - "Product Name", - "Statement Date", - "Valuta", - "Currency", - "Periode Transaksi", - "Transaction Period", "Alamat Unit Kerja", "Business Unit Address", + "Unit Kerja", + "Business Unit", + "Periode Transaksi", + "Transaction Period", + "Statement Date", + "Nama Produk", + "Product Name", + "Valuta", + "Currency", ]

Alternatively, sort programmatically so this stays correct as headers are added:

headers.sort(key=len, reverse=True)

🤖 Prompt for AI Agents

In `@pdfparser/pdfoxide_parser.py` around lines 47 - 63, The alternation in pdfoxide_parser.py uses headers list so shorter substrings like "Unit Kerja" match before longer ones like "Alamat Unit Kerja"; modify the code that builds headers_pattern (working with the headers variable used to create headers_pattern, pattern, and text) to sort headers by length descending (longer headers first) before escaping and joining so longer headers are tried before their substrings, preventing mid-header newline insertion.

coderabbitai · 2026-02-06T15:02:15Z

+    def test_separates_metadata_headers(self):
+        """Verify metadata headers are forced to new lines."""
+        raw = "Product NameBusiness Unit Address Valuta: IDR"
+        expected = "\nProduct Name\nBusiness Unit Address \nValuta: IDR"
+        # The regex inserts newline before header.
+        # "Product Name" -> "\nProduct Name"
+        # "Business Unit Address" -> "\nBusiness Unit Address"
+        # "Valuta" -> "\nValuta"
+
+        assert preprocess_text(raw) == expected


⚠️ Potential issue | 🟠 Major

Missing test case for "Alamat Unit Kerja" header — would expose the alternation-order bug.

This test only covers "Business Unit Address" which coincidentally works despite "Business Unit" being earlier in the alternation (since the newline lands at the right spot). Adding a test for "Alamat Unit Kerja" would reveal the split issue flagged in the parser review.

def test_separates_alamat_unit_kerja_header(self): """Verify 'Alamat Unit Kerja' is not split by 'Unit Kerja' match.""" raw = "SomeText Alamat Unit Kerja: Jl. Example" processed = preprocess_text(raw) # Should NOT produce "SomeText Alamat \nUnit Kerja: Jl. Example" assert "Alamat \nUnit Kerja" not in processed assert "\nAlamat Unit Kerja" in processed

🤖 Prompt for AI Agents

In `@tests/test_pdfoxide_preprocessing.py` around lines 40 - 49, Add a unit test named test_separates_alamat_unit_kerja_header that exercises preprocess_text to ensure the full header "Alamat Unit Kerja" is kept together and prefixed by a newline rather than being split by the shorter alternation "Unit Kerja"; assert that the processed string contains "\nAlamat Unit Kerja" and does NOT contain the broken form "Alamat \nUnit Kerja". This will expose the alternation-order bug and validate the regex behavior in preprocess_text.

coderabbitai Bot reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pdf_oxide parser text extraction issues with preprocessing#31

Fix pdf_oxide parser text extraction issues with preprocessing#31
rager306 wants to merge 1 commit intomainfrom
pdfoxide-fix-review-3365727040001916273

rager306 commented Feb 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Feb 6, 2026

Uh oh!

coderabbitai Bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Feb 6, 2026

Uh oh!

coderabbitai Bot Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rager306 commented Feb 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented Feb 6, 2026

Uh oh!

coderabbitai Bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rager306 commented Feb 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 6, 2026 •

edited

Loading