Skip to content

Conversation

@ai-man-codes
Copy link

Pull Request

Description

Add HTMLtoMarkdown class for converting HTML to clean Markdown output, suitable for LLM ingestion and dataset creation. The converter uses Python's built-in HTMLParser to handle common HTML elements and produce readable Markdown.

Supports:

  • Headings (h1-h6) → # syntax
  • Paragraphs and line breaks
  • Bold/italic formatting (strong, b, em, i)
  • Links → [text](url)
  • Images → ![alt](src) (with optional skip)
  • Ordered and unordered lists (including nested)
  • Tables → GitHub Flavored Markdown
  • Code blocks → fenced with configurable delimiter
  • Blockquotes

Navigation elements (nav, aside, header, footer) are skipped by default via skip_nav parameter.

Related Issue(s)

Closes #243

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes, no API changes)
  • Performance improvement
  • Tests (adding missing tests or correcting existing tests)
  • Build or CI/CD related changes

How Has This Been Tested?

poetry run task test
poetry run task lint
poetry run task format

Add HTMLtoMarkdown class for converting HTML to clean Markdown output,
suitable for LLM ingestion and dataset creation. Supports headings,
paragraphs, formatting, links, images, lists, tables, and code blocks.
Navigation elements are optionally skipped.

Closes autoscrape-labs#243
@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Cover previously untested code paths:
- Tab.to_markdown() with all parameter combinations
- Error handling for invalid result structures
- Image skip behavior and empty table edge cases
@ai-man-codes
Copy link
Author

@thalissonvs can you review this PR and add the feature ?

@thalissonvs
Copy link
Member

hello @ai-man-codes, thanks! I'll take a look as soon as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Export HTML to Markdown

2 participants