Fix pdf_oxide parser text extraction issues with preprocessing#31
Fix pdf_oxide parser text extraction issues with preprocessing#31
Conversation
During review, it was identified that the `pdf_oxide` parser had a 0% success rate in benchmarks due to "smashed" text output (missing newlines between fields). This change implements `preprocess_text` in `pdfparser/pdfoxide_parser.py` to: 1. Separate timestamps from descriptions. 2. Fix smashed amounts (e.g., `100.00200.00` -> `100.00\n200.00`). 3. Separate User IDs from descriptions. 4. Ensure metadata headers start on new lines. Added comprehensive tests in `tests/test_pdfoxide_preprocessing.py`. Verified that existing tests pass and linters are clean. Co-authored-by: rager306 <248269686+rager306@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
WalkthroughAdded a new Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@pdfparser/pdfoxide_parser.py`:
- Around line 47-63: The alternation in pdfoxide_parser.py uses headers list so
shorter substrings like "Unit Kerja" match before longer ones like "Alamat Unit
Kerja"; modify the code that builds headers_pattern (working with the headers
variable used to create headers_pattern, pattern, and text) to sort headers by
length descending (longer headers first) before escaping and joining so longer
headers are tried before their substrings, preventing mid-header newline
insertion.
In `@tests/test_pdfoxide_preprocessing.py`:
- Around line 40-49: Add a unit test named
test_separates_alamat_unit_kerja_header that exercises preprocess_text to ensure
the full header "Alamat Unit Kerja" is kept together and prefixed by a newline
rather than being split by the shorter alternation "Unit Kerja"; assert that the
processed string contains "\nAlamat Unit Kerja" and does NOT contain the broken
form "Alamat \nUnit Kerja". This will expose the alternation-order bug and
validate the regex behavior in preprocess_text.
🧹 Nitpick comments (3)
pdfparser/pdfoxide_parser.py (2)
30-32: Timestamp regex:\s*collapses all trailing whitespace (including newlines) into a single\n.This is fine for the stated preprocessing goal, but be aware that if a timestamp is followed by, say, a blank line (
\n\n), that blank line will be collapsed. If blank lines are meaningful delimiters downstream, this could cause subtle issues.
34-38: Two passes handles chains of up to 3 smashed amounts; longer chains would need more passes.For bank statement rows (typically debit + credit + balance = 3 amounts) this is sufficient. If you ever need to handle longer chains, consider a
while-loop approach:prev = None while prev != text: prev = text text = re.sub(amount_separator, r"\1\n\2", text)Not urgent given the known data shape.
tests/test_pdfoxide_preprocessing.py (1)
75-78: Idempotence test is valuable but could be more thorough.The current test uses an already-correctly-formatted string, which is great. Consider adding a round-trip assertion: preprocessing the output of
test_combined_smashed_transactionshould produce no further changes.
| headers = [ | ||
| "Unit Kerja", | ||
| "Business Unit", | ||
| "Nama Produk", | ||
| "Product Name", | ||
| "Statement Date", | ||
| "Valuta", | ||
| "Currency", | ||
| "Periode Transaksi", | ||
| "Transaction Period", | ||
| "Alamat Unit Kerja", | ||
| "Business Unit Address", | ||
| ] | ||
| # Escape headers for regex safety (though these are simple strings) | ||
| headers_pattern = "|".join(re.escape(h) for h in headers) | ||
| pattern = r"(?i)(?<!\n)(" + headers_pattern + ")" | ||
| text = re.sub(pattern, r"\n\1", text) |
There was a problem hiding this comment.
Alternation order causes "Alamat Unit Kerja" to be split incorrectly.
"Unit Kerja" appears before "Alamat Unit Kerja" in the alternation. When the input contains "Alamat Unit Kerja", the regex matches the shorter "Unit Kerja" substring first and inserts a newline inside the longer header, producing "Alamat \nUnit Kerja" instead of "\nAlamat Unit Kerja".
The same pattern applies to "Business Unit" / "Business Unit Address", although that one happens to work by coincidence (the newline lands at the right spot). Relying on this is fragile.
Fix: sort the headers list so longer strings come first, ensuring longer matches are tried before their substrings.
🐛 Proposed fix — order longer headers first
headers = [
- "Unit Kerja",
- "Business Unit",
- "Nama Produk",
- "Product Name",
- "Statement Date",
- "Valuta",
- "Currency",
- "Periode Transaksi",
- "Transaction Period",
"Alamat Unit Kerja",
"Business Unit Address",
+ "Unit Kerja",
+ "Business Unit",
+ "Periode Transaksi",
+ "Transaction Period",
+ "Statement Date",
+ "Nama Produk",
+ "Product Name",
+ "Valuta",
+ "Currency",
]Alternatively, sort programmatically so this stays correct as headers are added:
headers.sort(key=len, reverse=True)🤖 Prompt for AI Agents
In `@pdfparser/pdfoxide_parser.py` around lines 47 - 63, The alternation in
pdfoxide_parser.py uses headers list so shorter substrings like "Unit Kerja"
match before longer ones like "Alamat Unit Kerja"; modify the code that builds
headers_pattern (working with the headers variable used to create
headers_pattern, pattern, and text) to sort headers by length descending (longer
headers first) before escaping and joining so longer headers are tried before
their substrings, preventing mid-header newline insertion.
| def test_separates_metadata_headers(self): | ||
| """Verify metadata headers are forced to new lines.""" | ||
| raw = "Product NameBusiness Unit Address Valuta: IDR" | ||
| expected = "\nProduct Name\nBusiness Unit Address \nValuta: IDR" | ||
| # The regex inserts newline before header. | ||
| # "Product Name" -> "\nProduct Name" | ||
| # "Business Unit Address" -> "\nBusiness Unit Address" | ||
| # "Valuta" -> "\nValuta" | ||
|
|
||
| assert preprocess_text(raw) == expected |
There was a problem hiding this comment.
Missing test case for "Alamat Unit Kerja" header — would expose the alternation-order bug.
This test only covers "Business Unit Address" which coincidentally works despite "Business Unit" being earlier in the alternation (since the newline lands at the right spot). Adding a test for "Alamat Unit Kerja" would reveal the split issue flagged in the parser review.
def test_separates_alamat_unit_kerja_header(self):
"""Verify 'Alamat Unit Kerja' is not split by 'Unit Kerja' match."""
raw = "SomeText Alamat Unit Kerja: Jl. Example"
processed = preprocess_text(raw)
# Should NOT produce "SomeText Alamat \nUnit Kerja: Jl. Example"
assert "Alamat \nUnit Kerja" not in processed
assert "\nAlamat Unit Kerja" in processed🤖 Prompt for AI Agents
In `@tests/test_pdfoxide_preprocessing.py` around lines 40 - 49, Add a unit test
named test_separates_alamat_unit_kerja_header that exercises preprocess_text to
ensure the full header "Alamat Unit Kerja" is kept together and prefixed by a
newline rather than being split by the shorter alternation "Unit Kerja"; assert
that the processed string contains "\nAlamat Unit Kerja" and does NOT contain
the broken form "Alamat \nUnit Kerja". This will expose the alternation-order
bug and validate the regex behavior in preprocess_text.
Implemented text preprocessing for pdf_oxide parser to fix formatting issues causing extraction failures. Added unit tests for the new preprocessing logic.
PR created automatically by Jules for task 3365727040001916273 started by @rager306
Summary by CodeRabbit
New Features
Tests