Heavy Aggregator is a configurable scraping tool designed to collect comprehensive data on Scottish Heavy Athletics. Ideally suited for archiving and analysis.
The program currently supports automated (and automatable) scanning and scraping of:
nasgaweb.com(NASGA)Heavy AthleteScottish Scores
It is built in a manner that additional scrapers can easily be added and inherit configuration.
- Comprehensive Data Backup:
- Games: Iterates through all available years to scrape full results for every Game event.
- Athletes: Builds a master list of athletes and scrapes their detailed event history.
- Year-Based Output: Game results are organized into per-year directories (e.g.,
output/nasga/2024/nasga_games.json). - Data Quality:
- Hierarchical Schema:
Class -> Athlete -> Eventsstructure. - Strict Types: Integers for Places, Floats for Points/Distances (
20' 4"->20.333). - Cleaned Data: Handles nulls (
NT,DNS) and removes scraping artifacts.
- Hierarchical Schema:
- Streaming Output: Writes data to disk in real-time to prevent data loss.
- Highly Configurable: Customize behavior via
settings.txtor CLI arguments. - Resilient: Built-in retry logic, error handling, and ModSecurity evasion.
- Search Index Generator:
build_search_index.pyproduces a static search index for the companion website.
The scraper uses asyncio and aiohttp to fetch data in parallel, drastically reducing scrape time.
- Concurrency: Controls how many simultaneous requests are made. Default is 5.
- Throttle: Adds a delay (in ms) per worker.
Adjust concurrency via CLI (--concurrency 10) or settings.txt.
The scraper automatically saves its progress to checkpoint.json. If you stop the script (Ctrl+C) or it crashes, run it again to resume exactly where you left off (Year/Month/Game).
- To reset progress, simply delete
checkpoint.json.
-
Clone the repository:
git clone https://github.com/x029a/heavy-aggregator.git cd heavy-aggregator -
Create a virtual environment (recommended):
python3 -m venv .venv source .venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
Simply run the script to start the interactive wizard:
python main.pyYou'll be prompted to select a site (NASGA, Heavy Athlete, Scottish Scores, or All) and configure options like proxy, throttle, concurrency, and output format.
Run with arguments for automated or headless execution:
# Scrape a specific site
python main.py --site nasga --concurrency 10
# Scrape all sites
python main.py --site all --output-format json
# Scrape with throttling (be polite)
python main.py --site heavyathlete --throttle 1000 --concurrency 5Available Arguments:
--site: Target site (nasga,heavyathlete,scottishscores, orall).--proxy: HTTP/HTTPS proxy URL (e.g.,http://user:pass@host:port).--user-agent: Custom User-Agent string.--retry-count: Number of retries for failed requests (default: 3).--concurrency: Number of parallel requests (default: 5). Increase for speed, decrease for stability.--throttle: Delay in milliseconds between requests (default: 0).--output-format: Output format (jsonorcsv). Note: CSV support is experimental.--upload: Upload provider (s3orwebhook).--s3-bucket: AWS S3 Bucket Name.--s3-region: AWS Region (e.g.us-east-1).--webhook-url: URL to POST output files to.
Run without installing Python dependencies:
# Build
docker-compose build
# Run interactively
docker-compose run scraper
# Run with arguments
docker-compose run scraper --site nasga --concurrency 10Output files are saved to the local output/ directory via volume mount. Edit settings.txt locally and it will be reflected in the container.
You can also configure the tool using settings.txt. This file allows you to set defaults so you don't have to pass arguments every time.
Example settings.txt:
proxy=http://127.0.0.1:8080
user_agent=MyScraper/1.0
retry_count=5
throttle=2000
concurrency=5
# --- Remote Upload ---
# upload_provider=S3
# s3_bucket=my-archive
# s3_region=us-east-1
# webhook_url=https://api.myapp.com/uploadScraped data is organized into year-based directories under output/:
output/
├── nasga/
│ ├── 2024/
│ │ └── nasga_games.json
│ ├── 2025/
│ │ └── nasga_games.json
│ ├── nasga_athletes.json
│ └── nasga_failed_retrievals.json
├── heavyathlete/
│ ├── 2024/
│ │ └── heavyathlete_games.json
│ ├── heavyathlete_athletes.json
│ └── heavyathlete_failed_retrievals.json
└── scottishscores/
├── 2024/
│ └── scottishscores_games.json
├── scottishscores_athletes.json
└── scottishscores_failed_retrievals.json
Each game file contains an array of game objects with results organized by class:
{
"id": "8588",
"name": "Highland Games 2024",
"year": "2024",
"date": "06/15/2024",
"results": {
"Amateur": [
{
"Athlete": { "firstName": "John", "lastName": "Smith" },
"Place": "1st",
"GamesPoints": "12",
"Braemar Stone": 32.5,
"Open Stone": 42.167,
"Heavy WFD": 35.0
}
]
}
}To generate a static search index for the companion website:
python build_search_index.py [output_directory]This produces:
athletes.json— lightweight name index (~1.8 MB for ~25,000 athletes)athletes/<id>.json— individual detail files with full game + event data
Use cron_scrape.sh for automated nightly scraping with backup rotation:
# Add to crontab
crontab -e
0 2 * * * /path/to/cron_scrape.sh >> /path/to/cron.log 2>&1The script:
- Backs up existing
output/as a timestamped.tar.gz - Cleans old backups (keeps last 7)
- Runs all scrapers
- Rebuilds the search index
This tool is for educational and archival purposes. Please respect the terms of service of the websites you scrape and use the --throttle option to avoid overwhelming their servers.
In order to cut down on the likelihood that you get blocked by the site, I suggest a proxy. Additionally, we want to be good stewards of their bandwidth, and you should use the --throttle option to avoid overwhelming their servers.
Additionally, the data from NASGA is not perfect (or good really). There may be issues with the schema. This is meant to be a starting point for you.
NASGA Web has inherent issues both with data, and the queries used to access that data. For that reason 500s that occur in the following format (or are in general unrecoverable) - those entries are skipped and won't be retried.
Microsoft JET Database Engine error '80040e07'
Syntax error in date in query expression '(Games.Gamesstart >= ## AND Games.Gamesstart <= ##) AND Athletes.Firstname='XXXX' AND Athletes.Lastname='XXXX''.
/dbase/resultsathlete3.asp, line 74
If a game or athlete fails to download (due to 500 errors, timeouts, or parsing issues), the scraper will skip it and log the failure to a JSON file in the output directory:
nasga_failed_retrievals.jsonheavyathlete_failed_retrievals.jsonscottishscores_failed_retrievals.json
Check these files to see which items were missed and why.
If you have issues running this, try doing so in a virtual environment.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf you are on a mac, you may see a NotOpenSSLWarning. This is normal and can be ignored.