Skip to content

A Go (“golang”)-based web crawler and scraper that starts from a base URL, concurrently crawls multiple pages (with configurable worker concurrency and page-limit), extracts textual content and image links from HTML, normalises and deduplicates URLs, and outputs the results as a CSV file. Comes with unit tests for core modules.

Notifications You must be signed in to change notification settings

SSSM0602/webscraper

Repository files navigation

Webscraper

A Go-based web scraping tool that crawls web pages, extracts content and images, normalizes URLs, and generates CSV reports.

Features

  • Crawl multiple pages concurrently.
  • Extract text and image URLs from HTML.
  • Normalize and deduplicate URLs.
  • Export extracted data to a CSV file.
  • Unit tests included for core functions.

Project Structure

.
├── crawling.go           # Core crawler logic
├── extract_content.go    # HTML content extraction
├── normalize_url.go      # URL normalization utilities
├── csv_report.go         # CSV export functionality
├── main.go               # Entry point
├── go.mod                # Go module configuration
└── *_test.go             # Unit tests

Installation

git clone https://github.com/SSSM0602/webscraper.git
cd webscraper
go mod tidy

Usage

go run main.go <base_url> <max_concurrency> <page_limit>
  • base_url – Starting URL for crawling
  • max_concurrency – Number of concurrent workers
  • page_limit – Maximum number of pages to crawl

Example:

go run main.go https://example.com 5 100

Output

  • CSV file containing extracted URLs, titles, and image links.

Testing

go test ./...

Requirements

  • Go 1.20+
  • Internet connection for crawling

License

MIT

About

A Go (“golang”)-based web crawler and scraper that starts from a base URL, concurrently crawls multiple pages (with configurable worker concurrency and page-limit), extracts textual content and image links from HTML, normalises and deduplicates URLs, and outputs the results as a CSV file. Comes with unit tests for core modules.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages