Skip to content

feat: Internet Archive Collector#585

Open
labradorite-dev wants to merge 5 commits intoPolice-Data-Accessibility-Project:devfrom
labradorite-dev:379-internet-archive-crawler
Open

feat: Internet Archive Collector#585
labradorite-dev wants to merge 5 commits intoPolice-Data-Accessibility-Project:devfrom
labradorite-dev:379-internet-archive-crawler

Conversation

@labradorite-dev
Copy link

Summary

Resolves #379 by creating an internet archive collector. Tested via unit tests and manual test script.

I tried my best to follow repo standards but if there's anything I can do better pls let me know!

Implement a new collector that uses the Internet Archive CDX API to
discover archived URLs on domains PDAP already knows about. Users provide
seed URLs, domains are extracted, and the Wayback Machine is searched for
all archived pages with filtering for mime types, URL patterns, and dedup.
…e crawler

Add mocked integration tests (happy path, empty domain, API error) and a
manual lifecycle test hitting the live CDX API. Also fix missing
'internet_archive' value in batch_strategy DB enum and SQLAlchemy model.
The mime_type_allowlist already filters out non-HTML content, making
the static asset file extension patterns unnecessary.
Copy link
Collaborator

@maxachis maxachis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! The last thing I'd ask for is three small things:

  1. Make sure it passes flake8 linting.
  2. Have it point to dev, rather than main, for the merge
  3. When you run the manual test, just provide around 10 of the URLs it gathers.

Do all that, and this will be good to merge!



def upgrade() -> None:
switch_enum_type(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on this! It also reminded me that having an enum as a column is an antipattern, which I've noted for addressing here.

@labradorite-dev labradorite-dev changed the base branch from main to dev February 16, 2026 20:13
@labradorite-dev
Copy link
Author

labradorite-dev commented Feb 16, 2026

Manual Test Results — Sample URLs

Ran the Internet Archive crawler against couple URLs www.cityofchicago.org

Domain: www.cityofchicago.org

# URL
1 http://www.cityofchicago.org:80/
2 http://www.cityofchicago.org:80/AboutTown.html
3 http://www.cityofchicago.org:80/AdminHearings/
4 http://www.cityofchicago.org:80/AdminHearings/About.html
5 http://www.cityofchicago.org:80/AdminHearings/Contacts.html
6 http://www.cityofchicago.org:80/AdminHearings/Division.html
7 http://www.cityofchicago.org:80/AdminHearings/FAQ.html
8 http://www.cityofchicago.org:80/AdminHearings/Hearing.html
9 http://cityofchicago.org:80/AdminHearings/AdminLawOfficers.html
10 http://www.cityofchicago.org:80/AdminHearings/AHPubServIntern.html

Add missing module, class, and method docstrings (D100-D107) and
type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet
Archive collector files to satisfy flake8 linting requirements.
@labradorite-dev
Copy link
Author

May I also suggest a hook using pre-commit to shift the lint check even further left?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New Collector: Internet Archive Crawler

2 participants