Rust Web Scraper CLI

A command-line web scraper built in Rust that provides CRUD operations for managing URLs and scraping their content. The scraper features concurrent processing, rate limiting, and detailed logging capabilities.

Features

URL Management (CRUD operations)
Concurrent web scraping
Rate limiting to prevent server overload
Progress indicators for long-running operations
Detailed logging system
URL validation and deduplication
Error handling with retries
Support for relative URL resolution
Special handling for tel: and mailto: links

Installation

Make sure you have Rust and Cargo installed on your system. Then:

Clone the repository
Navigate to the project directory
Build the project:

cargo build --release

Usage

The CLI provides the following commands:

Add a URL

cargo run -- add https://example.com

List all stored URLs

cargo run -- list

Remove a URL

cargo run -- remove https://example.com

Update a URL

cargo run -- update https://old-url.com https://new-url.com

Run the scraper

cargo run -- run

Configuration

The scraper uses the following default configuration:

Maximum retries: 3
Request timeout: 10 seconds
Maximum concurrent requests: 5
Rate limit: 2 requests per second

URLs are stored in urls.json in the project root directory. Logs are written to scraper.log with detailed debug information.

Output

When scraping, the tool collects:

Page titles (h1, h2, h3 headers)
Links (with proper URL resolution)
Progress indication during scraping
Summary of results

Error Handling

The scraper handles various error cases:

Invalid URLs
Network failures with automatic retries
Rate limiting
Duplicate URL detection
File I/O errors

Dependencies

clap: Command line argument parsing
reqwest: HTTP client
tokio: Async runtime
serde: Serialization/deserialization
url: URL parsing and manipulation
tracing: Logging system
And more (see Cargo.toml for full list)

CI/CD

This project uses GitHub Actions for continuous integration and deployment. The workflow includes:

Automated Checks

Code formatting verification (rustfmt)
Static code analysis (clippy)
Building the project
Running all tests

Release Process

When you push a tag starting with 'v' (e.g., v1.0.0), the workflow will:

Run all checks
Build a release version
Create a GitHub release
Upload the compiled binary

To create a new release:

git tag v1.0.0
git push origin v1.0.0

Contributing

Feel free to submit issues and pull requests.

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rust Web Scraper CLI

Features

Installation

Usage

Add a URL

List all stored URLs

Remove a URL

Update a URL

Run the scraper

Configuration

Output

Error Handling

Dependencies

CI/CD

Automated Checks

Release Process

Contributing

License

About

Uh oh!

Releases 1

Packages

Languages

License

dayan-ulanov/rust-web-scraper

Folders and files

Latest commit

History

Repository files navigation

Rust Web Scraper CLI

Features

Installation

Usage

Add a URL

List all stored URLs

Remove a URL

Update a URL

Run the scraper

Configuration

Output

Error Handling

Dependencies

CI/CD

Automated Checks

Release Process

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages