Releases: AryanBV/pdf-edit-engine
v0.2.0 — editing-depth (encrypted PDFs · CID-keyed CFF · CJK · honesty taxonomy)
Editing-depth release. v0.2.0 widens the set of PDFs the engine can edit faithfully — password-protected documents, CID-keyed CFF/Type1C fonts, and CJK text — and hardens the honesty contract: anything the engine can't do cleanly is now surfaced as a typed Degradation (or refused) instead of silently corrupting output. DegradationKinds: 13 → 30. The public API is backward-compatible.
📦 pip install pdf-edit-engine==0.2.0 — https://pypi.org/project/pdf-edit-engine/0.2.0/
Added
- Edit encrypted PDFs end-to-end — opened, edited, and re-saved with encryption preserved (new
password=kwarg); honestencryption_droppedfallback when re-encryption isn't possible. - CID-keyed CFF / Type1C in-place glyph injection — Identity-H CIDFonts with CFF outlines (not just TrueType
glyf) can now be extended; collision-freeCID == GID, pre-existing CIDs never renumbered. - CJK / UAX#14 line-break — a spaceless CJK paragraph wider than its column now wraps at ideograph boundaries instead of silently overflowing; the Latin path is byte-identical.
- ToUnicode recovery — Type0/Identity-H fonts with no
/ToUnicodemap get their CID→Unicode recovered from the embedded cmap so text is locatable. - Opt-in shrink-to-fit (
fit="shrink") to fit text into a fixed-height region. - Honesty-taxonomy DX surface — root-level exports (
Degradation,DegradationKind,FONT_AFFECTING_KINDS,DEGRADATION_KINDS),FidelityReport.summary()/.is_clean/.max_severity/.warnings(), andEditResult.to_dict().
Fixed — silent corruptions now refused or surfaced
Multi-match same-operator collisions, content deletion (no more eaten neighbours), ligature integrity, rotated-text flattening, and reflow byte-stability — plus color, indent, and declared leading preserved through reflow.
Security
Graphics-state (q/Q) depth cap, font/CMap decompression-bomb guard, and linearization (Fast Web View) preservation on save.
Full details in the CHANGELOG.
v0.1.2 — post-audit hardening, invariant suite, CI polish
Mostly an internal hardening release. The public API is unchanged from 0.1.1, but the engine underneath went through two passes of audit and root-fixing.
The first pass was the Ultimate Audit Charter (docs/ultimate-audit-charter.md), an invariant-driven release-gate process I ran in a fresh session against this branch. It produced 75 invariant probes and surfaced 9 bugs across overflow signalling, stale TextMatch handling, pikepdf exception translation, orphan annotations, and locator extraction. Each one was fixed structurally rather than per-callsite, and the probes now run as part of make test so they regression-guard the contracts they cover.
The second pass was a comprehensive review (docs/comprehensive-audit-2026-05-02.md) — five parallel sub-agents looking at different lenses, then verifying everything against actual source. It found 6 more issues: a missing OSError catch in open_pdf, a doc/code drift in CLAUDE.md, a dry_run overclaim in the README, a missing substitution_log thread on the bbox replace path, a sequential-mode batch_replace_block regression that mispositioned the second section after a failed first one, and a few CI-hygiene gaps (no coverage gate, no macOS in matrix, no Dependabot).
Bigger architectural changes worth calling out:
_pathutil.open_pdfis the single canonical PDF-open entry now. Sixteen scatteredpikepdf.Pdf.opencall sites collapsed into one translator; raw pikepdf exceptions stopped leaking.EditResult.__post_init__enforces the overflow-warning contract at the dataclass boundary. Future code paths that flipoverflow_detected=Trueautomatically get a caller-visible warning by construction.- Module-level
FontResolverCacheis gone (ARY-283). Every public entry constructs fresh per-call caches threaded through internal helpers as explicit parameters, so a future caller weaving surgeon and structural in one transaction won't see stale resolver state. - Reflow now carves vertical room via
_shift_content_below_inplacewhen the rewritten paragraph is taller, instead of silently overlapping the paragraph below.
Security got a real audit too. The headline finding was a path-traversal hole in validate_output_path where Windows directory junctions slipped past the original symlink check (junctions carry a different reparse-point tag than NTFS symlinks). Found by empirical exploit, root-fixed by replacing the resolve-then-walk approach with a realpath vs abspath comparison that catches POSIX symlinks and Windows junctions on both platforms. Two CVE floors pinned: lxml >= 6.1.0 for CVE-2026-41066 (XXE through pikepdf's transitive lxml) and pytest >= 9.0.3 for CVE-2025-71176 (dev-only).
CI is now ubuntu + windows × Python 3.12 + 3.13, with a separate pip-audit security job per PR. macOS was attempted and surfaced 50+ macOS-specific issues (most macOS system fonts are .ttc TrueType Collections, not single-font .ttf, and the test setup assumed the latter), so it's deferred to 0.1.3 to fix properly rather than band-aid.
741 tests passing, 88% line coverage, mypy strict, ruff clean.
Line-by-line detail in CHANGELOG.md under [0.1.2].
pip install pdf-edit-engine==0.1.2
MCP server (TypeScript wrapper exposing the engine to AI agents): npx -y @aryanbv/pdf-edit-mcp
v0.1.1 — ARY-276 + ARY-278 bugfix release
Bugfix release — fixes three classes of CIDFont/Identity-H text-replacement failures discovered on real-world Chrome and Word PDFs. Fully backwards-compatible: public API unchanged.
[0.1.1] — 2026-04-15
Fixed
- ARY-276: Identity-H CIDFont replacement on large-font titles with per-glyph
Tm+Tjemission (Word and Chrome generators) no longer garbles spacing. The operator merge logic now has an all-narrow anchor fallback that collapses chains of narrowTm+Tjoperators into a single anchor, so replacement text flows past the original operator boundaries as the PDF spec allows (surgeon.pyF0 fallback, commitf2b4aad). - ARY-278: Narrow Identity-H subsets (e.g., Chrome's 179-glyph ArialMT) now extend via in-place glyph injection. Missing glyphs are appended to the embedded font at fresh GIDs, preserving every pre-existing CID→GID mapping. The previous Tier 2 subset-and-replace approach renumbered CIDs and corrupted unrelated content-stream text (the
1ova ,ndustriesMode 2 symptom) — replaced entirely (fonts.py_extend_tier2, commits4c262d4..77d3912). - Cross-font resolver pollution in
replace_all:_apply_single_replacementnow always re-fetches the resolver frommatch.characters[0].font_name, discarding any stale resolver passed in by the caller. Previously,replace_all's per-page loop reused one pre-fetched resolver across every match on the page. When matches used different fonts, the stale resolver validatedcan_encodeagainst the wrong font, extension was skipped, and content-stream operators were encoded with the wrong font's CIDs. Symptom on real Chrome PDFs with multiple Identity-H fonts per page:"ova ndustries"extraction because the emitted CIDs only mapped to N/I in the other font's ToUnicode CMap. Pre-existing bug, surfaced during 0.1.1 real-PDF validation. FontResolverCache: now evicts by font-dict object generation number, so pages that share a font via indirect reference are invalidated together after font mutation (encoding.py, commit8acbd49)./Wand/ToUnicodededup entries on repeatextend_subsetcalls to prevent bloat (fonts.py, commit60a1697).- mypy strict: resolved 15 pre-existing strict-mode errors in
structural.pyandreflow.py. The CI mypy step is now blocking (previously had|| true).
Verified
- Tested against real-world Chrome (Skia/PDF m147) and Microsoft Word PDFs that reproduced the original ARY-276 garble. Both round-trip cleanly with no Mode-1 or Mode-2 garble tokens in extracted text and no silent font substitutions.
- 636 tests passing (up from 628), mypy strict clean on all 16 source files, ruff clean.
Known scope limits
- CFF / Type1 embedded fonts still raise
FontNotFoundErrorwith a clear message when the engine needs to inject glyphs into them. Tier 1.5 handles TrueType only; CFF support is tracked in ARY-279 for 0.2.0.
v0.1.0
[0.1.0] — 2026-04-07
Initial release — format-preserving PDF text editing.
- Text search, replacement, and batch editing at the content stream operator level
- Two-tier font subset extension (CMap-only fast path + full re-embed)
- FidelityReport on every edit — programmatic quality verification
- 15 PDF wrapper operations (merge, split, rotate, encrypt, etc.)
- Paragraph detection and greedy line-breaking reflow
- 628 tests, 85% coverage
- Zero external binaries, zero API keys, zero network calls