Skip to content

cumakurt/WebArchive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebArchive Subdomain Extractor

Sample Output

Python 3.7+ License: GPL [Status: Active]

Usage Example

python WebArchive.py <domain>

Features

  • Advanced Extraction: Extract subdomains from Wayback Machine CDX API
  • Multiple Formats: Export results in TXT, JSON, and CSV formats
  • Smart Filtering: Filter subdomains using regex patterns, length constraints, and keyword exclusion
  • Progress Tracking: Visual progress bars and real-time status updates
  • Comprehensive Logging: Detailed logging with configurable levels
  • Retry Mechanism: Automatic retry for failed requests
  • Input Validation: Domain format validation and sanitization
  • Statistics: Detailed statistics and reporting
  • Configuration: Flexible configuration via INI files

Table of Contents

Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Install Dependencies

# Clone the repository
https://github.com/cumakurt/WebArchive.git
cd WebArchive

# Install required packages
pip install -r requirements.txt

Requirements

Create a requirements.txt file with the following dependencies (see note below for version pinning):

requests==2.31.0
prettytable==3.9.0
termcolor==2.4.0
tqdm==4.66.1

Quick Start

Basic Usage

python WebArchive.py <domain>

Advanced Usage

python WebArchive.py <domain> --format txt json csv --output-dir results
python WebArchive.py <domain> --filter "test|dev" --exclude-words admin,test
python WebArchive.py <domain> --verbose

Usage Examples

1. Basic Subdomain Extraction

python WebArchive.py <domain>

2. Multiple Output Formats

python WebArchive.py <domain> --format txt json csv --output-dir results

3. Advanced Filtering

python WebArchive.py <domain> --filter "test|dev"
python WebArchive.py <domain> --exclude-words admin,test,staging
python WebArchive.py <domain> --min-length 10 --max-length 30
python WebArchive.py <domain> --filter "api|service" --exclude-words admin --min-length 8

4. Verbose Output with Statistics

python WebArchive.py <domain> --verbose

5. Custom Configuration

python WebArchive.py <domain> --config custom_config.ini
python WebArchive.py <domain> --log-level DEBUG

Configuration

Configuration File (config.ini)

[DEFAULT]
# API Settings
api_url = https://web.archive.org/cdx/search/cdx
output_format = txt
collapse = urlkey
max_results = 10000
timeout = 30
max_retries = 3
retry_delay = 1

# User Agent
user_agent = WebArchive-Subdomain-Extractor/1.0

# Default Filters
default_exclude_words = admin,test,dev,staging
default_min_length = 5
default_max_length = 50

Command Line Options

Option Description Example
domain Target domain to analyze example.com
--output-dir, -o Output directory --output-dir results
--format, -f Output formats --format txt json csv
--filter Regex filter pattern --filter "test|dev"
--exclude-words Words to exclude --exclude-words admin,test
--min-length Minimum length --min-length 10
--max-length Maximum length --max-length 30
--max-results Max results to fetch --max-results 5000
--verbose, -v Verbose output --verbose
--config, -c Config file --config custom.ini
--log-level Logging level --log-level DEBUG

Note: Output files are saved in the specified output directory. You can customize the output directory with --output-dir. Log files are saved in the logs/ directory by default. If the application cannot write to the log directory, it will fall back to console logging.

Dependency Version Pinning: For maximum reproducibility, use the exact versions listed in requirements.txt.

Output Formats

1. TXT Format

Plain text file with one subdomain per line.

2. JSON Format

Structured JSON with metadata.

3. CSV Format

CSV file with index and subdomain columns.

Filtering Options

Regex Filtering

python WebArchive.py <domain> --filter "test|dev|staging"
python WebArchive.py <domain> --filter "api|service|backend"

Keyword Exclusion

python WebArchive.py <domain> --exclude-words admin,test,dev
python WebArchive.py <domain> --exclude-words staging,beta,old

Length Filtering

python WebArchive.py <domain> --min-length 8
python WebArchive.py <domain> --max-length 25
python WebArchive.py <domain> --min-length 8 --max-length 25

Advanced Features

  • Progress tracking with tqdm
  • Retry mechanism for failed requests
  • Comprehensive logging
  • Domain validation

Troubleshooting

  • Connection Timeout: Increase timeout in config.ini or use a lower --max-results value.
  • No Results Found: Try without filters or check if the domain exists in the Wayback Machine.
  • Permission Errors: Ensure write permissions to the output and logs directories.
  • Memory Issues: Use --max-results to limit results for large domains.

FAQ

Q: Can I delete the __pycache__ or logs folders?
A: Yes, they are automatically recreated as needed.

Q: How do I change the output file names?
A: Use the --output-dir option to change the directory. File names are based on the domain.

Q: What if I get a permission error?
A: Make sure you have write permissions to the output and logs directories.

Q: How do I pin dependency versions?
A: Use the exact versions in requirements.txt.

License

This project is licensed under the GNU General Public License (GPL)

Acknowledgments

Developer

Developed by Cuma KURT
Email: cumakurt [at] gmail [dot] com
LinkedIn

About

WebArchive Subdomain Extractor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages