feat(sources): add RSS/Atom feed adapter with content extraction by foundatron · Pull Request #24 · foundatron/tentacle

foundatron · 2026-03-11T02:14:10Z

Closes #5

Changes

1. tentacle/sources/rss.py — Add HTML-to-text parser and optional content extraction

Add _HTMLToTextParser(html.parser.HTMLParser) class (stdlib) to strip HTML tags and extract plain text. Skip content inside <script>, <style>, <nav>, <header>, <footer>. Collapse whitespace.
Add _fetch_content(url: str, timeout: int, max_bytes: int) -> str | None module-level function:
- Fetches via urllib.request.urlopen with the given timeout.
- Reads up to max_bytes via resp.read(max_bytes).
- Detects charset from Content-Type header, falls back to UTF-8 with errors="replace".
- Parses HTML to text via _HTMLToTextParser.
- Catches all exceptions, logs warning, returns None on failure.
Add __init__(self, extract_content: bool = False, content_timeout: int = 30, content_max_bytes: int = 1_048_576) to RSSAdapter.
In _parse_rss and _parse_atom, after building each Article: if self._extract_content is enabled and the article has a URL, call _fetch_content and set article.full_text.

2. tentacle/config.py — Add RSSSourceConfig with extract_content field

Create RSSSourceConfig(SourceConfig) dataclass with extract_content: bool = False.
Change Config.rss field type from SourceConfig to RSSSourceConfig (default factory RSSSourceConfig).
Update _apply_toml() to read extract_content from the [sources.rss] TOML dict and apply it to the config, following the same pattern used for HN's min_points, arXiv's days_back, etc.
Update DEFAULT_CONFIG_TEMPLATE to add # extract_content = false comment under [sources.rss].

3. tentacle/cli.py (~line 85–86) — Pass extract_content to RSSAdapter

Change RSSAdapter() to RSSAdapter(extract_content=config.rss.extract_content).

4. config.example.toml — Add extract_content comment under [sources.rss]

Add # extract_content = false line.

Review Findings

Errors: 0
Warnings: 3
Nits: 4
Assessment: NEEDS CHANGES

The most actionable fix is #1 (log the exception). Items #2 and #3 are worth addressing before this ships to production — #2 for operational sanity and #3 to avoid leaking script content on malformed pages.

Add `extract_content` option to `RSSAdapter` that fetches and strips HTML from linked article pages after parsing the feed. Introduces `_HTMLToTextParser` (stdlib html.parser) to extract visible text while skipping script/style/nav/ header/footer elements, and `_fetch_content` with charset detection, bounded reads, and full error isolation. Adds `RSSSourceConfig` dataclass mirroring the pattern of other source configs, wires it through config loading and CLI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

foundatron merged commit 4417f7b into main Mar 11, 2026
1 check passed

foundatron deleted the issue-5 branch March 11, 2026 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sources): add RSS/Atom feed adapter with content extraction#24

feat(sources): add RSS/Atom feed adapter with content extraction#24
foundatron merged 1 commit intomainfrom
issue-5

foundatron commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

foundatron commented Mar 11, 2026

Changes

Review Findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant