Skip to content

Releases: AryanBV/pdf-edit-engine

v0.2.0 — editing-depth (encrypted PDFs · CID-keyed CFF · CJK · honesty taxonomy)

02 Jun 07:05

Choose a tag to compare

Editing-depth release. v0.2.0 widens the set of PDFs the engine can edit faithfully — password-protected documents, CID-keyed CFF/Type1C fonts, and CJK text — and hardens the honesty contract: anything the engine can't do cleanly is now surfaced as a typed Degradation (or refused) instead of silently corrupting output. DegradationKinds: 13 → 30. The public API is backward-compatible.

📦 pip install pdf-edit-engine==0.2.0https://pypi.org/project/pdf-edit-engine/0.2.0/

Added

  • Edit encrypted PDFs end-to-end — opened, edited, and re-saved with encryption preserved (new password= kwarg); honest encryption_dropped fallback when re-encryption isn't possible.
  • CID-keyed CFF / Type1C in-place glyph injection — Identity-H CIDFonts with CFF outlines (not just TrueType glyf) can now be extended; collision-free CID == GID, pre-existing CIDs never renumbered.
  • CJK / UAX#14 line-break — a spaceless CJK paragraph wider than its column now wraps at ideograph boundaries instead of silently overflowing; the Latin path is byte-identical.
  • ToUnicode recovery — Type0/Identity-H fonts with no /ToUnicode map get their CID→Unicode recovered from the embedded cmap so text is locatable.
  • Opt-in shrink-to-fit (fit="shrink") to fit text into a fixed-height region.
  • Honesty-taxonomy DX surface — root-level exports (Degradation, DegradationKind, FONT_AFFECTING_KINDS, DEGRADATION_KINDS), FidelityReport.summary() / .is_clean / .max_severity / .warnings(), and EditResult.to_dict().

Fixed — silent corruptions now refused or surfaced

Multi-match same-operator collisions, content deletion (no more eaten neighbours), ligature integrity, rotated-text flattening, and reflow byte-stability — plus color, indent, and declared leading preserved through reflow.

Security

Graphics-state (q/Q) depth cap, font/CMap decompression-bomb guard, and linearization (Fast Web View) preservation on save.

Full details in the CHANGELOG.

v0.1.2 — post-audit hardening, invariant suite, CI polish

02 May 07:37

Choose a tag to compare

Mostly an internal hardening release. The public API is unchanged from 0.1.1, but the engine underneath went through two passes of audit and root-fixing.

The first pass was the Ultimate Audit Charter (docs/ultimate-audit-charter.md), an invariant-driven release-gate process I ran in a fresh session against this branch. It produced 75 invariant probes and surfaced 9 bugs across overflow signalling, stale TextMatch handling, pikepdf exception translation, orphan annotations, and locator extraction. Each one was fixed structurally rather than per-callsite, and the probes now run as part of make test so they regression-guard the contracts they cover.

The second pass was a comprehensive review (docs/comprehensive-audit-2026-05-02.md) — five parallel sub-agents looking at different lenses, then verifying everything against actual source. It found 6 more issues: a missing OSError catch in open_pdf, a doc/code drift in CLAUDE.md, a dry_run overclaim in the README, a missing substitution_log thread on the bbox replace path, a sequential-mode batch_replace_block regression that mispositioned the second section after a failed first one, and a few CI-hygiene gaps (no coverage gate, no macOS in matrix, no Dependabot).

Bigger architectural changes worth calling out:

  • _pathutil.open_pdf is the single canonical PDF-open entry now. Sixteen scattered pikepdf.Pdf.open call sites collapsed into one translator; raw pikepdf exceptions stopped leaking.
  • EditResult.__post_init__ enforces the overflow-warning contract at the dataclass boundary. Future code paths that flip overflow_detected=True automatically get a caller-visible warning by construction.
  • Module-level FontResolverCache is gone (ARY-283). Every public entry constructs fresh per-call caches threaded through internal helpers as explicit parameters, so a future caller weaving surgeon and structural in one transaction won't see stale resolver state.
  • Reflow now carves vertical room via _shift_content_below_inplace when the rewritten paragraph is taller, instead of silently overlapping the paragraph below.

Security got a real audit too. The headline finding was a path-traversal hole in validate_output_path where Windows directory junctions slipped past the original symlink check (junctions carry a different reparse-point tag than NTFS symlinks). Found by empirical exploit, root-fixed by replacing the resolve-then-walk approach with a realpath vs abspath comparison that catches POSIX symlinks and Windows junctions on both platforms. Two CVE floors pinned: lxml >= 6.1.0 for CVE-2026-41066 (XXE through pikepdf's transitive lxml) and pytest >= 9.0.3 for CVE-2025-71176 (dev-only).

CI is now ubuntu + windows × Python 3.12 + 3.13, with a separate pip-audit security job per PR. macOS was attempted and surfaced 50+ macOS-specific issues (most macOS system fonts are .ttc TrueType Collections, not single-font .ttf, and the test setup assumed the latter), so it's deferred to 0.1.3 to fix properly rather than band-aid.

741 tests passing, 88% line coverage, mypy strict, ruff clean.

Line-by-line detail in CHANGELOG.md under [0.1.2].


pip install pdf-edit-engine==0.1.2

MCP server (TypeScript wrapper exposing the engine to AI agents): npx -y @aryanbv/pdf-edit-mcp

v0.1.1 — ARY-276 + ARY-278 bugfix release

15 Apr 14:16

Choose a tag to compare

Bugfix release — fixes three classes of CIDFont/Identity-H text-replacement failures discovered on real-world Chrome and Word PDFs. Fully backwards-compatible: public API unchanged.

[0.1.1] — 2026-04-15

Fixed

  • ARY-276: Identity-H CIDFont replacement on large-font titles with per-glyph Tm+Tj emission (Word and Chrome generators) no longer garbles spacing. The operator merge logic now has an all-narrow anchor fallback that collapses chains of narrow Tm+Tj operators into a single anchor, so replacement text flows past the original operator boundaries as the PDF spec allows (surgeon.py F0 fallback, commit f2b4aad).
  • ARY-278: Narrow Identity-H subsets (e.g., Chrome's 179-glyph ArialMT) now extend via in-place glyph injection. Missing glyphs are appended to the embedded font at fresh GIDs, preserving every pre-existing CID→GID mapping. The previous Tier 2 subset-and-replace approach renumbered CIDs and corrupted unrelated content-stream text (the 1ova ,ndustries Mode 2 symptom) — replaced entirely (fonts.py _extend_tier2, commits 4c262d4..77d3912).
  • Cross-font resolver pollution in replace_all: _apply_single_replacement now always re-fetches the resolver from match.characters[0].font_name, discarding any stale resolver passed in by the caller. Previously, replace_all's per-page loop reused one pre-fetched resolver across every match on the page. When matches used different fonts, the stale resolver validated can_encode against the wrong font, extension was skipped, and content-stream operators were encoded with the wrong font's CIDs. Symptom on real Chrome PDFs with multiple Identity-H fonts per page: "ova ndustries" extraction because the emitted CIDs only mapped to N/I in the other font's ToUnicode CMap. Pre-existing bug, surfaced during 0.1.1 real-PDF validation.
  • FontResolverCache: now evicts by font-dict object generation number, so pages that share a font via indirect reference are invalidated together after font mutation (encoding.py, commit 8acbd49).
  • /W and /ToUnicode dedup entries on repeat extend_subset calls to prevent bloat (fonts.py, commit 60a1697).
  • mypy strict: resolved 15 pre-existing strict-mode errors in structural.py and reflow.py. The CI mypy step is now blocking (previously had || true).

Verified

  • Tested against real-world Chrome (Skia/PDF m147) and Microsoft Word PDFs that reproduced the original ARY-276 garble. Both round-trip cleanly with no Mode-1 or Mode-2 garble tokens in extracted text and no silent font substitutions.
  • 636 tests passing (up from 628), mypy strict clean on all 16 source files, ruff clean.

Known scope limits

  • CFF / Type1 embedded fonts still raise FontNotFoundError with a clear message when the engine needs to inject glyphs into them. Tier 1.5 handles TrueType only; CFF support is tracked in ARY-279 for 0.2.0.

v0.1.0

11 Apr 10:23

Choose a tag to compare

[0.1.0] — 2026-04-07

Initial release — format-preserving PDF text editing.

  • Text search, replacement, and batch editing at the content stream operator level
  • Two-tier font subset extension (CMap-only fast path + full re-embed)
  • FidelityReport on every edit — programmatic quality verification
  • 15 PDF wrapper operations (merge, split, rotate, encrypt, etc.)
  • Paragraph detection and greedy line-breaking reflow
  • 628 tests, 85% coverage
  • Zero external binaries, zero API keys, zero network calls

PyPI: https://pypi.org/project/pdf-edit-engine/0.1.0/