A Go-based web scraping tool that crawls web pages, extracts content and images, normalizes URLs, and generates CSV reports.
- Crawl multiple pages concurrently.
- Extract text and image URLs from HTML.
- Normalize and deduplicate URLs.
- Export extracted data to a CSV file.
- Unit tests included for core functions.
.
├── crawling.go # Core crawler logic
├── extract_content.go # HTML content extraction
├── normalize_url.go # URL normalization utilities
├── csv_report.go # CSV export functionality
├── main.go # Entry point
├── go.mod # Go module configuration
└── *_test.go # Unit tests
git clone https://github.com/SSSM0602/webscraper.git
cd webscraper
go mod tidygo run main.go <base_url> <max_concurrency> <page_limit>- base_url – Starting URL for crawling
- max_concurrency – Number of concurrent workers
- page_limit – Maximum number of pages to crawl
Example:
go run main.go https://example.com 5 100- CSV file containing extracted URLs, titles, and image links.
go test ./...- Go 1.20+
- Internet connection for crawling
MIT