Convert HTML to clean, structured Markdown or plain text. Perfect for extracting readable content from web pages with robust boilerplate removal and language-aware processing.
- π§Ή Smart Cleaning: Automatically removes navigation, footers, ads, and other boilerplate
- π Flexible Output: Convert to Markdown or plain text
- π Language-Aware: Special support for Bengali and English with automatic language detection
- π Link Control: Choose to keep or remove links and images
- π Multiple Input Sources: Process HTML strings, files, or URLs
- β‘ CLI & Python API: Use from command line or integrate into your Python projects
- π¦ Minimal Dependencies: Modern, lightweight dependency stack
pip install html2cleantextOr install from source:
git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .import html2cleantext
# From HTML string
html = "<h1>Hello World</h1><p>This is a test with a <a href='https://example.com'>link</a>.</p>"
markdown = html2cleantext.to_markdown(html) # Output: ... [link](https://example.com) ...
text = html2cleantext.to_text(html, keep_links=True) # Output: ... link [Link:https://example.com] ...
# From file
markdown = html2cleantext.to_markdown("page.html")
# From URL
markdown = html2cleantext.to_markdown("https://example.com")
# With options
clean_text = html2cleantext.to_text(
html,
keep_links=True, # Use [Link:URL] format in plain text
keep_images=False,
remove_boilerplate=True
)# Convert to Markdown (default, links as [text](URL))
html2cleantext input.html
# Convert to plain text (links as [Link:URL])
html2cleantext input.html --mode text --keep-links
# From URL
html2cleantext https://example.com --output clean.md
# Remove links and images
html2cleantext input.html --no-links --no-images
# Keep all content (no boilerplate removal)
html2cleantext input.html --no-remove_boilerplateConvert HTML to clean Markdown format.
Parameters:
html_input(str|Path): HTML string, file path, or URLkeep_links(bool): Preserve links (default: True)keep_images(bool): Preserve images (default: True)remove_boilerplate(bool): Remove boilerplate content (default: True)normalize_lang(bool): Apply language normalization (default: True)language(str, optional): Language code for normalization (auto-detected if None)
Returns: Clean Markdown text (str)
Convert HTML to clean plain text format.
Parameters:
- Same as
to_markdown()but with different defaults: keep_links(bool): Default Falsekeep_images(bool): Default False
Returns: Clean plain text (str)
positional arguments:
input HTML input: file path, URL, or raw HTML string
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--mode {markdown,text}, -m {markdown,text}
Output format (default: markdown)
--output OUTPUT, -o OUTPUT
Output file path (default: stdout)
--keep-links Preserve links in the output
--no-links Remove links from the output
--keep-images Preserve images in the output
--no-images Remove images from the output
--remove_boilerplate Remove navigation, footers, and boilerplate content
--no-remove_boilerplate
Keep all content including navigation and footers
--language LANGUAGE, -l LANGUAGE
Language code for normalization
--no-normalize Skip language-specific normalization
--verbose, -v Enable verbose logging
- Markdown output: Links are converted to standard Markdown format
[text](URL)for compatibility with Markdown renderers. - Plain text and CLI output: Links are converted to
[Link:URL]format (e.g.,My Link [Link:https://example.com]) for easy parsing and clear distinction from other text.
import html2cleantext
# Simple HTML to Markdown
html = """
<html>
<head><title>Test Page</title></head>
<body>
<nav>Navigation menu</nav>
<main>
<h1>Main Title</h1>
<p>This is the main content with a <a href=\"https://example.com\">link</a>.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</main>
<footer>Footer content</footer>
</body>
</html>
"""
result_md = html2cleantext.to_markdown(html)
print(result_md)
# Output:
# Main Title
#
# This is the main content with a [link](https://example.com).
#
# * Item 1
# * Item 2
result_txt = html2cleantext.to_text(html, keep_links=True)
print(result_txt)
# Output:
# Main Title
#
# This is the main content with a link [Link:https://example.com].
#
# Item 1
# Item 2# Basic conversion (Markdown, links as [text](URL))
html2cleantext index.html > clean.md
# Plain text with links as [Link:URL]
html2cleantext index.html --mode text --keep-links > clean.txthtml2cleantext provides enhanced support for:
- English: Smart quote normalization, punctuation cleanup
- Bengali: Unicode normalization, punctuation handling
- Auto-detection: Automatically detects language when not specified
Additional languages can be easily added by extending the normalization functions.
The package follows a clean pipeline architecture:
- Input Processing: Handles HTML strings, files, or URLs
- HTML Parsing: Uses BeautifulSoup with lxml parser
- Cleaning: Removes scripts, styles, and unwanted attributes
- Boilerplate Removal: Strips navigation, footers, ads using readability-lxml or manual rules
- Language Detection: Auto-detects content language
- Conversion: Converts to Markdown using markdownify or extracts plain text
- Normalization: Applies language-specific text cleanup
- Output: Returns clean text or writes to file
beautifulsoup4- HTML parsinglxml- Fast XML/HTML parsermarkdownify- HTML to Markdown conversionreadability-lxml- Content extraction and boilerplate removallangdetect- Language detectionrequests- HTTP requests for URL fetching
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .[dev] # Install with development dependencies
# OR
pip install -e . # Install package only
pip install -r requirements-dev.txt # Install dev dependencies separatelypython -m pytest tests/This project is licensed under the MIT License - see the LICENSE file for details.
- Initial release
- Core HTML to Markdown/text conversion
- Boilerplate removal using readability-lxml
- Language-aware normalization for Bengali and English
- Command-line interface
- Support for HTML strings, files, and URLs