Webscraper

A Go-based web scraping tool that crawls web pages, extracts content and images, normalizes URLs, and generates CSV reports.

Features

Crawl multiple pages concurrently.
Extract text and image URLs from HTML.
Normalize and deduplicate URLs.
Export extracted data to a CSV file.
Unit tests included for core functions.

Project Structure

.
├── crawling.go           # Core crawler logic
├── extract_content.go    # HTML content extraction
├── normalize_url.go      # URL normalization utilities
├── csv_report.go         # CSV export functionality
├── main.go               # Entry point
├── go.mod                # Go module configuration
└── *_test.go             # Unit tests

Installation

git clone https://github.com/SSSM0602/webscraper.git
cd webscraper
go mod tidy

Usage

go run main.go <base_url> <max_concurrency> <page_limit>

base_url – Starting URL for crawling
max_concurrency – Number of concurrent workers
page_limit – Maximum number of pages to crawl

Example:

go run main.go https://example.com 5 100

Output

CSV file containing extracted URLs, titles, and image links.

Testing

go test ./...

Requirements

Go 1.20+
Internet connection for crawling

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
crawling.go		crawling.go
csv_report.go		csv_report.go
extract_content.go		extract_content.go
extract_content_test.go		extract_content_test.go
go.mod		go.mod
main.go		main.go
normalize_url.go		normalize_url.go
normalize_url_test.go		normalize_url_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webscraper

Features

Project Structure

Installation

Usage

Output

Testing

Requirements

License

About

Uh oh!

Releases

Packages

Languages

SSSM0602/webscraper

Folders and files

Latest commit

History

Repository files navigation

Webscraper

Features

Project Structure

Installation

Usage

Output

Testing

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages