A command-line web scraper built in Rust that provides CRUD operations for managing URLs and scraping their content. The scraper features concurrent processing, rate limiting, and detailed logging capabilities.
- URL Management (CRUD operations)
- Concurrent web scraping
- Rate limiting to prevent server overload
- Progress indicators for long-running operations
- Detailed logging system
- URL validation and deduplication
- Error handling with retries
- Support for relative URL resolution
- Special handling for tel: and mailto: links
Make sure you have Rust and Cargo installed on your system. Then:
- Clone the repository
- Navigate to the project directory
- Build the project:
cargo build --release
The CLI provides the following commands:
cargo run -- add https://example.com
cargo run -- list
cargo run -- remove https://example.com
cargo run -- update https://old-url.com https://new-url.com
cargo run -- run
The scraper uses the following default configuration:
- Maximum retries: 3
- Request timeout: 10 seconds
- Maximum concurrent requests: 5
- Rate limit: 2 requests per second
URLs are stored in urls.json
in the project root directory.
Logs are written to scraper.log
with detailed debug information.
When scraping, the tool collects:
- Page titles (h1, h2, h3 headers)
- Links (with proper URL resolution)
- Progress indication during scraping
- Summary of results
The scraper handles various error cases:
- Invalid URLs
- Network failures with automatic retries
- Rate limiting
- Duplicate URL detection
- File I/O errors
- clap: Command line argument parsing
- reqwest: HTTP client
- tokio: Async runtime
- serde: Serialization/deserialization
- url: URL parsing and manipulation
- tracing: Logging system
- And more (see Cargo.toml for full list)
This project uses GitHub Actions for continuous integration and deployment. The workflow includes:
- Code formatting verification (rustfmt)
- Static code analysis (clippy)
- Building the project
- Running all tests
When you push a tag starting with 'v' (e.g., v1.0.0), the workflow will:
- Run all checks
- Build a release version
- Create a GitHub release
- Upload the compiled binary
To create a new release:
git tag v1.0.0
git push origin v1.0.0
Feel free to submit issues and pull requests.
This project is open source and available under the MIT License.