feat: Internet Archive Collector by labradorite-dev · Pull Request #585 · Police-Data-Accessibility-Project/data-source-manager

labradorite-dev · 2026-02-16T04:17:05Z

Summary

Resolves #379 by creating an internet archive collector. Tested via unit tests and manual test script.

I tried my best to follow repo standards but if there's anything I can do better pls let me know!

Implement a new collector that uses the Internet Archive CDX API to discover archived URLs on domains PDAP already knows about. Users provide seed URLs, domains are extracted, and the Wayback Machine is searched for all archived pages with filtering for mime types, URL patterns, and dedup.

…e crawler Add mocked integration tests (happy path, empty domain, API error) and a manual lifecycle test hitting the live CDX API. Also fix missing 'internet_archive' value in batch_strategy DB enum and SQLAlchemy model.

…rchive collector

The mime_type_allowlist already filters out non-HTML content, making the static asset file extension patterns unnecessary.

maxachis

This is looking good! The last thing I'd ask for is three small things:

Make sure it passes flake8 linting.
Have it point to dev, rather than main, for the merge
When you run the manual test, just provide around 10 of the URLs it gathers.

Do all that, and this will be good to merge!

maxachis · 2026-02-16T11:09:01Z

alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py

+
+
+def upgrade() -> None:
+    switch_enum_type(


Good call on this! It also reminded me that having an enum as a column is an antipattern, which I've noted for addressing here.

labradorite-dev · 2026-02-16T20:15:02Z

Manual Test Results — Sample URLs

Ran the Internet Archive crawler against couple URLs www.cityofchicago.org

Domain: `www.cityofchicago.org`

#	URL
1	`http://www.cityofchicago.org:80/`
2	`http://www.cityofchicago.org:80/AboutTown.html`
3	`http://www.cityofchicago.org:80/AdminHearings/`
4	`http://www.cityofchicago.org:80/AdminHearings/About.html`
5	`http://www.cityofchicago.org:80/AdminHearings/Contacts.html`
6	`http://www.cityofchicago.org:80/AdminHearings/Division.html`
7	`http://www.cityofchicago.org:80/AdminHearings/FAQ.html`
8	`http://www.cityofchicago.org:80/AdminHearings/Hearing.html`
9	`http://cityofchicago.org:80/AdminHearings/AdminLawOfficers.html`
10	`http://www.cityofchicago.org:80/AdminHearings/AHPubServIntern.html`

Add missing module, class, and method docstrings (D100-D107) and type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet Archive collector files to satisfy flake8 linting requirements.

labradorite-dev · 2026-02-16T20:38:59Z

May I also suggest a hook using pre-commit to shift the lint check even further left?

labradorite-dev added 4 commits February 15, 2026 12:06

fix(collector): use structured logging instead of print in Internet A…

18c4f21

…rchive collector

refactor(collector): remove redundant static asset exclude patterns

6fcc6fd

The mime_type_allowlist already filters out non-HTML content, making the static asset file extension patterns unnecessary.

labradorite-dev requested a review from josh-chamberlain as a code owner February 16, 2026 04:17

maxachis requested changes Feb 16, 2026

View reviewed changes

labradorite-dev changed the base branch from main to dev February 16, 2026 20:13

style(collector): add docstrings and type annotations to pass flake8

71aa2e1

Add missing module, class, and method docstrings (D100-D107) and type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet Archive collector files to satisfy flake8 linting requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Internet Archive Collector#585

feat: Internet Archive Collector#585
labradorite-dev wants to merge 5 commits intoPolice-Data-Accessibility-Project:devfrom
labradorite-dev:379-internet-archive-crawler

labradorite-dev commented Feb 16, 2026

Uh oh!

maxachis left a comment

Uh oh!

maxachis Feb 16, 2026

Uh oh!

labradorite-dev commented Feb 16, 2026 •

edited

Loading

Uh oh!

labradorite-dev commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

labradorite-dev commented Feb 16, 2026

Summary

Uh oh!

maxachis left a comment

Choose a reason for hiding this comment

Uh oh!

maxachis Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

labradorite-dev commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Manual Test Results — Sample URLs

Domain: www.cityofchicago.org

Uh oh!

labradorite-dev commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

labradorite-dev commented Feb 16, 2026 •

edited

Loading

Domain: `www.cityofchicago.org`