Skip to content

Comments

feat: Internet Archive Collector#589

Merged
maxachis merged 6 commits intodevfrom
379-internet-archive-crawler
Feb 19, 2026
Merged

feat: Internet Archive Collector#589
maxachis merged 6 commits intodevfrom
379-internet-archive-crawler

Conversation

@labradorite-dev
Copy link
Collaborator

Summary

Resolves #379 by creating an internet archive collector. Tested via unit tests and manual test script.

I tried my best to follow repo standards but if there's anything I can do better pls let me know!

Implement a new collector that uses the Internet Archive CDX API to
discover archived URLs on domains PDAP already knows about. Users provide
seed URLs, domains are extracted, and the Wayback Machine is searched for
all archived pages with filtering for mime types, URL patterns, and dedup.
…e crawler

Add mocked integration tests (happy path, empty domain, API error) and a
manual lifecycle test hitting the live CDX API. Also fix missing
'internet_archive' value in batch_strategy DB enum and SQLAlchemy model.
The mime_type_allowlist already filters out non-HTML content, making
the static asset file extension patterns unnecessary.
Add missing module, class, and method docstrings (D100-D107) and
type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet
Archive collector files to satisfy flake8 linting requirements.
Update alembic migration down_revision to chain off latest dev head and
fix renamed get_access_info -> get_admin_access_info in IA route.
@maxachis maxachis requested review from maxachis and removed request for josh-chamberlain February 19, 2026 10:37
Copy link
Collaborator

@maxachis maxachis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! At long last!

@maxachis maxachis merged commit 657fd08 into dev Feb 19, 2026
5 of 6 checks passed
@maxachis maxachis deleted the 379-internet-archive-crawler branch February 19, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants