feat(sources): add RSS/Atom feed adapter with content extraction#24
Merged
foundatron merged 1 commit intomainfrom Mar 11, 2026
Merged
feat(sources): add RSS/Atom feed adapter with content extraction#24foundatron merged 1 commit intomainfrom
foundatron merged 1 commit intomainfrom
Conversation
Add `extract_content` option to `RSSAdapter` that fetches and strips HTML from linked article pages after parsing the feed. Introduces `_HTMLToTextParser` (stdlib html.parser) to extract visible text while skipping script/style/nav/ header/footer elements, and `_fetch_content` with charset detection, bounded reads, and full error isolation. Adds `RSSSourceConfig` dataclass mirroring the pattern of other source configs, wires it through config loading and CLI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #5
Changes
1.
tentacle/sources/rss.py— Add HTML-to-text parser and optional content extraction_HTMLToTextParser(html.parser.HTMLParser)class (stdlib) to strip HTML tags and extract plain text. Skip content inside<script>,<style>,<nav>,<header>,<footer>. Collapse whitespace._fetch_content(url: str, timeout: int, max_bytes: int) -> str | Nonemodule-level function:urllib.request.urlopenwith the given timeout.max_bytesviaresp.read(max_bytes).Content-Typeheader, falls back to UTF-8 witherrors="replace"._HTMLToTextParser.Noneon failure.__init__(self, extract_content: bool = False, content_timeout: int = 30, content_max_bytes: int = 1_048_576)toRSSAdapter._parse_rssand_parse_atom, after building eachArticle: ifself._extract_contentis enabled and the article has a URL, call_fetch_contentand setarticle.full_text.2.
tentacle/config.py— AddRSSSourceConfigwithextract_contentfieldRSSSourceConfig(SourceConfig)dataclass withextract_content: bool = False.Config.rssfield type fromSourceConfigtoRSSSourceConfig(default factoryRSSSourceConfig)._apply_toml()to readextract_contentfrom the[sources.rss]TOML dict and apply it to the config, following the same pattern used for HN'smin_points, arXiv'sdays_back, etc.DEFAULT_CONFIG_TEMPLATEto add# extract_content = falsecomment under[sources.rss].3.
tentacle/cli.py(~line 85–86) — Passextract_contenttoRSSAdapterRSSAdapter()toRSSAdapter(extract_content=config.rss.extract_content).4.
config.example.toml— Addextract_contentcomment under[sources.rss]# extract_content = falseline.Review Findings
The most actionable fix is #1 (log the exception). Items #2 and #3 are worth addressing before this ships to production — #2 for operational sanity and #3 to avoid leaking script content on malformed pages.