fix: render HTML tables as proper GFM pipe tables#16
Conversation
Replace the package-level md.ConvertString() with a converter that explicitly registers the table plugin. Adds a fallback: when readability strips table markup, the raw path is tried as a retry. - Registers base, commonmark, and table plugins with sensible defaults (header promotion, skip empty rows, preserve newlines, minimal padding) - Detects when readability loses table structure and falls back to raw conversion for table-heavy pages - Adds unit tests for normal tables, nested cell content, and selector extraction
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee1a4351fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if hasHTMLTable(html) && !hasMarkdownPipeTable(markdown) { | ||
| if raw, rawErr := extractRaw(html); rawErr == nil && hasMarkdownPipeTable(raw.Markdown) { |
There was a problem hiding this comment.
Keep unrelated tables from bypassing readability
When the original document contains a table outside the readable article (for example a sidebar, nav, footer, or pricing widget) and the article itself has no table, this condition still falls back to extractRaw as long as raw conversion can render that unrelated table. That replaces the cleaned readability output with the full noisy page, regressing scrape/search content for otherwise normal articles that happen to share a page with any table; the table check should be scoped to the rendered article HTML or otherwise verify the missing table was part of the extracted content.
Useful? React with 👍 / 👎.
|
Triage summary (automated daily run) What it is: Fixes #14 by replacing the bare What it changes in output: Markdown body from Why it needs human review (gates that tripped):
Tests: Coverage for the intended scenarios is solid (re-injection, no-duplicate, layout-table exclusion, Suggested next steps for the human reviewer:
Generated by Claude Code |
Fixes #14
Issue #14 was not reliably fixed by adding the table plugin alone because readability can drop large, real data tables from the main extracted subtree (for example, Wikipedia GDP pages). This PR replaces that fallback with a re-injection flow based on deterministic table fingerprints.
Changes
markdownConverterwithbase,commonmark, andtableplugins.rows >= 10, orcolumns > 4, or<thead>with<th>cells.role="presentation"or class-name contains any of:navbox,sidebar,toc,ambox(case-insensitive substring match).markdownConverter.ConvertString()and appending to article markdown body.Behavior
Testing
go test ./...passedNote: this PR intentionally uses append-only placement for re-injected tables; in-order reinsertion is deferred to a follow-up.