Skip to content

Scrape TACC Events — tacc.org/tacc/events #192

@linear

Description

@linear

Description

Build a scraper for the Texas Advanced Computing Center events page at tacc.org/tacc/events. Lower volume than the other sources but valuable for research-oriented events, workshops, and training sessions.

Stack

  • Language: TypeScript
  • Runtime: Cloudflare Workers (scheduled cron trigger)
  • Storage: Cloudflare D1
  • Client surface: React Native app

Hi-Fi UI Requirements (what the design demands from the scraper)

Events from TACC render in the same flyer-first card system as all other sources. Many TACC events are multi-day workshops and often have no flyer, so the no-flyer fallback variant matters here.

  • Flyer image is the hero when available. Cards support vertical, square, horizontal, and no-flyer variants. We need:
    • image_url at highest resolution available (prefer original upload over thumbnail).
    • image_width, image_height, and image_aspect_ratio (vertical / square / horizontal / none).
    • image_mime_type and image_alt_text when present.
    • Expect a meaningful share of TACC events to resolve to image_aspect_ratio = "none" — the no-flyer card must still render cleanly with the data below.
  • "Posted by [Org]" is shown. host_organization defaults to "TACC" but should capture the sub-program (e.g., "Frontera", "Stampede3", "TACC Institute") when listed. Populate host_organization_slug for routing.
  • Card shows date + short time. Store start_datetime / end_datetime as ISO 8601 (America/Chicago). Multi-day workshops must preserve full start/end — the client will render a range (e.g., Mon 3/2 – Fri 3/6).
  • Card shows a short location string (e.g., ACES 2.302 or Virtual). Store location_short (≤ 40 chars) and location_full.
  • Interest grouping. TACC events map to research-oriented interests (training, symposium, workshop). Capture categories accurately so they flow into the interests system.

Scope

  • Crawl the TACC events listing and paginate through all upcoming events.
  • Extract the following fields per event:
    • title
    • description
    • start_datetime / end_datetime (ISO 8601, America/Chicago — handle multi-day workshops)
    • location_short, location_full
    • host_organization (default: "TACC"; capture sub-program if listed), host_organization_slug
    • event_url
    • image_url, image_width, image_height, image_aspect_ratio, image_mime_type, image_alt_text
    • categories (e.g., training, symposium, workshop)
    • registration_url if separate
    • source = "tacc"
  • Map into the shared D1 events schema.
  • Deduplicate via a stable source_event_id.
  • Upsert into D1.

Deliverables

  • scrapers/tacc.ts worker module exporting a run() entrypoint.
  • Unit tests with saved fixtures covering: single-day event, multi-day workshop, virtual event, event with no flyer (exercise no-flyer card variant).
  • Dry-run script.
  • Per-event error isolation.

Acceptance Criteria

  • ≥ 95% of listed TACC events are captured per run.
  • Multi-day events are stored with correct start/end datetimes and render as a date range on the client.
  • image_aspect_ratio correctly classified (including "none") for ≥ 95% of events.
  • location_short renders in ≤ 40 chars.
  • No duplicates across repeat runs.
  • D1 row count for source = "tacc" matches site listings (±5%).
  • CI passes lint, typecheck, and tests.

Out of Scope

  • UI work in the React Native app.
  • Integration with TACC account / allocation data.
  • Cross-source deduplication.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions