Skip to content

feat(sources): add RSS/Atom feed adapter with content extraction#24

Merged
foundatron merged 1 commit intomainfrom
issue-5
Mar 11, 2026
Merged

feat(sources): add RSS/Atom feed adapter with content extraction#24
foundatron merged 1 commit intomainfrom
issue-5

Conversation

@foundatron
Copy link
Owner

Closes #5

Changes

1. tentacle/sources/rss.py — Add HTML-to-text parser and optional content extraction

  • Add _HTMLToTextParser(html.parser.HTMLParser) class (stdlib) to strip HTML tags and extract plain text. Skip content inside <script>, <style>, <nav>, <header>, <footer>. Collapse whitespace.
  • Add _fetch_content(url: str, timeout: int, max_bytes: int) -> str | None module-level function:
    • Fetches via urllib.request.urlopen with the given timeout.
    • Reads up to max_bytes via resp.read(max_bytes).
    • Detects charset from Content-Type header, falls back to UTF-8 with errors="replace".
    • Parses HTML to text via _HTMLToTextParser.
    • Catches all exceptions, logs warning, returns None on failure.
  • Add __init__(self, extract_content: bool = False, content_timeout: int = 30, content_max_bytes: int = 1_048_576) to RSSAdapter.
  • In _parse_rss and _parse_atom, after building each Article: if self._extract_content is enabled and the article has a URL, call _fetch_content and set article.full_text.

2. tentacle/config.py — Add RSSSourceConfig with extract_content field

  • Create RSSSourceConfig(SourceConfig) dataclass with extract_content: bool = False.
  • Change Config.rss field type from SourceConfig to RSSSourceConfig (default factory RSSSourceConfig).
  • Update _apply_toml() to read extract_content from the [sources.rss] TOML dict and apply it to the config, following the same pattern used for HN's min_points, arXiv's days_back, etc.
  • Update DEFAULT_CONFIG_TEMPLATE to add # extract_content = false comment under [sources.rss].

3. tentacle/cli.py (~line 85–86) — Pass extract_content to RSSAdapter

  • Change RSSAdapter() to RSSAdapter(extract_content=config.rss.extract_content).

4. config.example.toml — Add extract_content comment under [sources.rss]

  • Add # extract_content = false line.

Review Findings

  • Errors: 0
  • Warnings: 3
  • Nits: 4
  • Assessment: NEEDS CHANGES

The most actionable fix is #1 (log the exception). Items #2 and #3 are worth addressing before this ships to production — #2 for operational sanity and #3 to avoid leaking script content on malformed pages.

Add `extract_content` option to `RSSAdapter` that fetches and strips HTML
from linked article pages after parsing the feed. Introduces `_HTMLToTextParser`
(stdlib html.parser) to extract visible text while skipping script/style/nav/
header/footer elements, and `_fetch_content` with charset detection, bounded
reads, and full error isolation. Adds `RSSSourceConfig` dataclass mirroring
the pattern of other source configs, wires it through config loading and CLI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@foundatron foundatron merged commit 4417f7b into main Mar 11, 2026
1 check passed
@foundatron foundatron deleted the issue-5 branch March 11, 2026 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(sources): add RSS/Atom feed adapter with content extraction

1 participant