Skip to content

fix: render HTML tables as proper GFM pipe tables#16

Open
1broseidon wants to merge 2 commits into
mainfrom
fix/issue-14-table-rendering
Open

fix: render HTML tables as proper GFM pipe tables#16
1broseidon wants to merge 2 commits into
mainfrom
fix/issue-14-table-rendering

Conversation

@1broseidon

@1broseidon 1broseidon commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Fixes #14

Issue #14 was not reliably fixed by adding the table plugin alone because readability can drop large, real data tables from the main extracted subtree (for example, Wikipedia GDP pages). This PR replaces that fallback with a re-injection flow based on deterministic table fingerprints.

Changes

  • Keep the shared markdownConverter with base, commonmark, and table plugins.
  • Before readability, parse raw HTML and collect data-table candidates using strict criteria:
    • rows >= 10, or
    • columns > 4, or
    • <thead> with <th> cells.
  • Exclude layout tables when role="presentation" or class-name contains any of: navbox, sidebar, toc, ambox (case-insensitive substring match).
  • For each candidate, compute a normalized fingerprint from cell text (lowercase, whitespace-collapsed, punctuation stripped) and keep outer table HTML.
  • Run readability extraction and convert to markdown as before.
  • Parse surviving markdown pipe tables, fingerprint them with the same normalizer, and identify missing (dropped) candidates.
  • Re-inject only missing tables by converting candidate HTML via markdownConverter.ConvertString() and appending to article markdown body.
  • Preserve readability title when available; retain original title fallback for raw extraction.

Behavior

  • Pages with no substantial tables: zero candidates, zero re-injection, existing path unchanged.
  • Data-heavy pages: readability still gets prose cleanup, while dropped tables are restored as GFM pipe tables.
  • Duplicate-safe dedupe by fingerprint so tables kept by readability are not re-appended.

Testing

  • go test ./... passed
  • Added unit coverage for:
    • readability-dropped large table re-injection,
    • no duplicate re-injection when table survives readability,
    • layout-only tables excluded from re-injection

Note: this PR intentionally uses append-only placement for re-injected tables; in-order reinsertion is deferred to a follow-up.

Replace the package-level md.ConvertString() with a converter that
explicitly registers the table plugin. Adds a fallback: when readability
strips table markup, the raw path is tried as a retry.

- Registers base, commonmark, and table plugins with sensible defaults
  (header promotion, skip empty rows, preserve newlines, minimal padding)
- Detects when readability loses table structure and falls back to raw
  conversion for table-heavy pages
- Adds unit tests for normal tables, nested cell content, and selector
  extraction

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee1a4351fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread extract/extract.go Outdated
Comment on lines +62 to +63
if hasHTMLTable(html) && !hasMarkdownPipeTable(markdown) {
if raw, rawErr := extractRaw(html); rawErr == nil && hasMarkdownPipeTable(raw.Markdown) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep unrelated tables from bypassing readability

When the original document contains a table outside the readable article (for example a sidebar, nav, footer, or pricing widget) and the article itself has no table, this condition still falls back to extractRaw as long as raw conversion can render that unrelated table. That replaces the cleaned readability output with the full noisy page, regressing scrape/search content for otherwise normal articles that happen to share a page with any table; the table check should be scoped to the rendered article HTML or otherwise verify the missing table was part of the extracted content.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Triage summary (automated daily run)

What it is: Fixes #14 by replacing the bare md.ConvertString() call with an explicit markdownConverter (base + commonmark + table plugins) and adding a pre/post-readability re-injection flow that rescues data tables dropped by the readability pass.

What it changes in output: Markdown body from ketch scrape will now contain GFM pipe tables where previously there were none (or plain text). Re-injected tables are appended at the end of the article body, not at their original DOM position (noted as a known limitation in the PR description).

Why it needs human review (gates that tripped):

  1. Large diff — 334 lines of new production logic in extract/extract.go. The triage policy requires human approval for anything beyond a tight, localized change.

  2. Broad hasTheadTH heuristic — any table containing a <thead><th> cell is classified as a data table and becomes a re-injection candidate, regardless of row count. A 2-row pricing table with a header row qualifies, as does a navigation table that lacks role=presentation and doesn't match the four hard-coded layout class names (navbox, sidebar, toc, ambox). This could silently re-inject tables on sites with unusual markup.

  3. No CI runs — the PR carries no automated check runs; the only test evidence is the author's manual go test ./.... Worth running make build && make test && make lint against the branch before merging.

  4. Append-only placement — re-injected tables always land after the article prose, which may confuse LLM consumers expecting in-order content.

Tests: Coverage for the intended scenarios is solid (re-injection, no-duplicate, layout-table exclusion, role=presentation exclusion, ExtractSelector rendering). No obvious gaps in the happy-path cases.

Suggested next steps for the human reviewer:

  • Run make build && make test && make lint on the fix/issue-14-table-rendering branch.
  • Confirm the hasTheadTH breadth is acceptable (or tighten with a minimum row/column threshold).
  • Consider whether append-only placement needs a follow-up issue now vs. later.

Generated by Claude Code

@1broseidon 1broseidon added triaged Processed by the daily triage loop needs-review Triage loop flagged this for human approval labels Jun 24, 2026 — with Claude
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-review Triage loop flagged this for human approval triaged Processed by the daily triage loop

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Markdown rendering of tables

1 participant