Skip to content

dayan-ulanov/rust-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rust Web Scraper CLI

A command-line web scraper built in Rust that provides CRUD operations for managing URLs and scraping their content. The scraper features concurrent processing, rate limiting, and detailed logging capabilities.

Features

  • URL Management (CRUD operations)
  • Concurrent web scraping
  • Rate limiting to prevent server overload
  • Progress indicators for long-running operations
  • Detailed logging system
  • URL validation and deduplication
  • Error handling with retries
  • Support for relative URL resolution
  • Special handling for tel: and mailto: links

Installation

Make sure you have Rust and Cargo installed on your system. Then:

  1. Clone the repository
  2. Navigate to the project directory
  3. Build the project:
cargo build --release

Usage

The CLI provides the following commands:

Add a URL

cargo run -- add https://example.com

List all stored URLs

cargo run -- list

Remove a URL

cargo run -- remove https://example.com

Update a URL

cargo run -- update https://old-url.com https://new-url.com

Run the scraper

cargo run -- run

Configuration

The scraper uses the following default configuration:

  • Maximum retries: 3
  • Request timeout: 10 seconds
  • Maximum concurrent requests: 5
  • Rate limit: 2 requests per second

URLs are stored in urls.json in the project root directory. Logs are written to scraper.log with detailed debug information.

Output

When scraping, the tool collects:

  • Page titles (h1, h2, h3 headers)
  • Links (with proper URL resolution)
  • Progress indication during scraping
  • Summary of results

Error Handling

The scraper handles various error cases:

  • Invalid URLs
  • Network failures with automatic retries
  • Rate limiting
  • Duplicate URL detection
  • File I/O errors

Dependencies

  • clap: Command line argument parsing
  • reqwest: HTTP client
  • tokio: Async runtime
  • serde: Serialization/deserialization
  • url: URL parsing and manipulation
  • tracing: Logging system
  • And more (see Cargo.toml for full list)

CI/CD

This project uses GitHub Actions for continuous integration and deployment. The workflow includes:

Automated Checks

  • Code formatting verification (rustfmt)
  • Static code analysis (clippy)
  • Building the project
  • Running all tests

Release Process

When you push a tag starting with 'v' (e.g., v1.0.0), the workflow will:

  1. Run all checks
  2. Build a release version
  3. Create a GitHub release
  4. Upload the compiled binary

To create a new release:

git tag v1.0.0
git push origin v1.0.0

Contributing

Feel free to submit issues and pull requests.

License

This project is open source and available under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages