Skip to content

TGV2RayScraper is a Python project to collect Telegram data, extract V2Ray configs, and process them by cleaning, normalizing, and deduplicating, while keeping channel info up-to-date. It supports synchronous and asynchronous scraping and includes management tools.

License

Notifications You must be signed in to change notification settings

denxv/TGV2RayScraper

Repository files navigation

TGV2RayScraper

TGV2RayScraper is a Python project designed for collecting data from Telegram channels, extracting and processing V2Ray configurations, including cleaning, normalizing, and deduplicating them. The project maintains up-to-date information about channels and includes tools for managing their lists. It provides both synchronous and asynchronous tools for data collection and V2Ray configuration processing.

The project runs on Python version 3.10 or higher.

For Russian version, see README.md

Quick Start

Clone the repository

Clones the project to your computer:

git clone https://github.com/denxv/TGV2RayScraper.git

Changes into the project directory:

cd TGV2RayScraper

Working with the uv command

All uv commands work the same on Linux, macOS, and Windows.

Creating a virtual environment

Creates and activates the virtual environment:

uv venv

Installing dependencies

Installs only the main dependencies for running the project:

uv sync --no-dev

Installs all dependencies, including dev packages for tests and development:

uv sync

Running the project

Runs the main project script:

uv run python main.py

Alternative way to run the project:

uv run main.py

This will update the channel list, collect data, and clean V2Ray configurations in a single run.

Testing and linting (only for development)

Runs all project tests automatically:

uv run pytest

Checks type correctness in all files:

uv run mypy .

Checks code style and errors:

uv run ruff check .

Working with the pip command

Creating a virtual environment

Creates a virtual environment for the project:

python -m venv venv

Activates the virtual environment on Linux/macOS:

source venv/bin/activate

Activates the virtual environment on Windows:

.\venv\Scripts\Activate.ps1

If PowerShell blocks script execution, temporarily allow it with:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process

Then run .\venv\Scripts\Activate.ps1 again.

Installing dependencies

Installs the required libraries for running the project:

pip install -r requirements.txt

Installs all dependencies, including dev packages for tests and development:

pip install -r requirements.txt -r requirements-dev.txt

Running the project

Runs the main project script:

python main.py

This will update the channel list, collect data, and clean V2Ray configurations in a single run.

Testing and linting (only for development)

Runs all project tests automatically:

pytest

Checks type correctness in all files:

mypy .

Checks code style and errors:

ruff check .

Dependencies

Main Dependencies

The project requires the following Python libraries (works with Python ≥ 3.10):

  • aiofiles – asynchronous file handling

  • asteval — safe evaluation of Python expressions (used for filtering configurations)

  • httpx — modern HTTP client with support for both synchronous and asynchronous requests

  • lxml – parsing and processing HTML/XML

  • tqdm – progress bar for long-running operations

The full list of dependencies is available in requirements.txt.

Development Dependencies (Dev-dependencies)

For development and testing of the project, additional tools are required:

  • mypy – type checking

  • pytest – testing framework

  • pytest-asyncio – support for asynchronous tests in pytest

  • pytest-cov – test coverage reporting

  • ruff – static code analysis and linting

All dev-dependencies are listed in requirements-dev.txt.

Project Structure

  • adapters/ — adapters for synchronous and asynchronous data operations

    • async_/ — asynchronous implementations (channels, configurations, scraping)

      • channels.py — asynchronous operations with channels

      • configs.py — asynchronous processing of configurations

      • scraper.py — asynchronous channel data scraper

    • sync/ — synchronous implementations

      • channels.py — synchronous operations with channels

      • configs.py — synchronous processing of configurations

      • scraper.py — synchronous channel data scraper

  • channels/ — folder for storing channel and URL list files

    • current.json — main file with Telegram channel information

    • urls.txt — main file with Telegram channel links

    • backups of these files are also stored (e.g., current-backup-<timestamp>.json, urls-backup-<timestamp>.txt)

  • configs/ — folder for storing V2Ray configurations

    • v2ray-clean.txt — cleaned configurations

    • v2ray-raw.txt — raw configurations

  • core/ — core utilities and constants

    • constants.py — constants, default paths, URL templates, regex patterns, script flags

    • decorators.py — decorators (e.g., for logging)

    • logger.py — logging utility with colored console output and microsecond timestamps

    • typing.py — custom type aliases for the project (channels, V2Ray configs, CLI, sessions, etc.)

    • utils.py — utility and helper functions

  • docs/ — project documentation in multiple languages

    • ru/ — Russian documentation

      • README.md — user guide in Russian

      • LICENSE — project license in Russian

  • domain/ — business logic and domain-specific functions

    • channel.py — operations with channels, sorting, filtering

    • config.py — processing and normalization of configurations

    • predicates.py — filtering logic and predicates

  • logs/ — folder for script logs

    • log files with timestamps (e.g., 2020-10-10.log)
  • scripts/ — helper scripts for performing project tasks

    • async_scraper.py — script collects data from Telegram channels asynchronously

    • scraper.py — script collects data from Telegram channels synchronously

    • update_channels.py — script to update channels (removing inactive channels and adding new ones)

    • v2ray_cleaner.py — script cleans, normalizes, and processes obtained V2Ray configurations

  • tests/ — directory with all project tests, verifying correctness, stability, and module functionality (currently in progress...)

    • e2e/ — end-to-end tests, covering full usage scenarios

    • fixtures/ — helper files and test data

    • integration/ — integration tests, checking module interactions

    • unit/ — unit tests, checking individual functions and classes in isolation

    • conftest.pypytest configuration: fixtures, hooks, and common test settings

  • LICENSE — project license (default in English)

  • main.py — main script to run all project operations, including updating channels, collecting data, and processing configurations

  • pyproject.toml — configuration file for project metadata, dependencies, and development tools (e.g., mypy, ruff, pytest), centralizing build and tooling settings

  • README.md — main project documentation (default in English)

  • requirements-dev.txt — list of development dependencies (testing, type checking, linters — pytest, mypy, ruff, etc.)

  • requirements.txt — list of all required libraries for running the project

  • uv.lock — dependency lock file, recording exact package versions for a reproducible environment

Channel JSON Structure

The file channels/current.json stores metadata about Telegram channels. Top-level keys are channel usernames, and values are objects with channel state.

Example

{
    "channel_new_default": {
        "count": 0,
        "current_id": 1,
        "last_id": -1
    },
    "channel_is_not_live": {
        "count": -1,
        "current_id": 100,
        "last_id": -1
    },
    "channel_live": {
        "count": 500,
        "current_id": 100,
        "last_id": 100
    },
    "channel_will_be_deleted": {
        "count": -3,
        "current_id": 100,
        "last_id": -1
    }
}

Field Description

  • count

    • > 0 → number of V2Ray configurations in an active channel (count = 1)

    • = 0 → nothing found, or channel temporarily unavailable (last_id = -1)

    • < 0 → number of failed attempts to access the channel

      • Each failed attempt decreases the value (-1, -2, …).

      • When count <= -3, the channel is considered inactive and removed from current.json and urls.txt.

  • current_id

    • starting message ID for scraping

    • 1 → start from the beginning of the channel

    • negative → take the last N messages

      • Example: if last_id = 150 and current_id = -100, the effective current_id is 150 - 100 = 50. Scraping will start from message 50 and move toward the last message (last_id = 150).
  • last_id

    • latest message ID in the channel

    • updated on each run

    • -1 → channel temporarily or permanently unavailable

    • otherwise, a positive integer

Supported Protocols

The cleaned configuration file (configs/v2ray-clean.txt) contains entries in one of the following formats:

AnyTLS

anytls://password@host:port/path?params#name
anytls://password@host:port?params#name

Hy2 / Hysteria2

hy2://password@host:port/path?params#name
hy2://password@host:port?params#name
hysteria2://password@host:port/path?params#name
hysteria2://password@host:port?params#name

Shadowsocks / ShadowsocksR

ss://base64(method:password)@host:port#name
ss://method:password@host:port#name
ss://base64(method:password@host:port)#name
ssr://base64(host:port:protocol:method:obfs:base64(password)/?param=base64(value))

Trojan

trojan://password@host:port/path?params#name
trojan://password@host:port?params#name

TUIC

tuic://uuid:password@host:port/path?params#name
tuic://uuid:password@host:port?params#name

VLESS

vless://uuid@host:port/path?params#name
vless://uuid@host:port?params#name

VMess

vmess://base64(json)
vmess://uuid@host:port/path?params#name
vmess://uuid@host:port?params#name

WireGuard

wireguard://privatekey@host:port/path?params#name
wireguard://privatekey@host:port?params#name

Usage

1. Update Channels

You can run the channel update script as follows:

python -m scripts.update_channels

You can also prepend uv run before any python command to run it through uv.

An alternative method using PYTHONPATH is also available:

PYTHONPATH=. python scripts/update_channels.py

You can use the -h flag to see all available options:

python -m scripts.update_channels -h

Options include:

  • --no-dry-run — Disable dry run and actually assign current_id (check-only mode is enabled by default).

  • -B, --no-backup — Skip creating backup files for channel and Telegram URL lists before saving (backups are created by default).

  • -C, --channels FILE — Path to the input JSON file containing the list of channels (default: channels/current.json).

  • -D, --delete-channels — Delete channels that are unavailable or meet specific conditions (default: disabled).

  • -M, --message-offset N — Number of recent messages to include when assigning current_id.

  • -N, --include-new — Include new channels in processing.

  • -U, --urls FILE — Path to a text file containing new channel URLs (default: channels/urls.txt).

The script performs the following:

  • Loads the current list of channels from channels/current.json.

  • Merges with new URLs from channels/urls.txt.

  • By default, performs a dry run without making changes (--no-dry-run disables it).

  • Allows assigning current_id to channels taking message offset into account (--message-offset).

  • Can include new channels in processing (--include-new).

  • Supports deletion of unavailable or flagged channels (--delete-channels).

  • Creates backup copies of both files with a timestamp (can be disabled using the --no-backup option).

  • Saves the updated list back to current.json and urls.txt.

  • Logs detailed warnings and debug information for each channel.

Example usage:

python -m scripts.update_channels -C channels/current.json --urls channels/urls.txt --delete-channels -M 50 --include-new --no-dry-run --no-backup

You can add uv run before the python command to run it through uv.


2. Running Scrapers

Asynchronous Scraper (faster, experimental)

You can run the asynchronous scraper as follows:

python -m scripts.async_scraper

You can also prepend uv run before any python command to run it through uv.

An alternative method using PYTHONPATH is also available:

PYTHONPATH=. python scripts/async_scraper.py

You can use the -h flag to see all available options:

python -m scripts.async_scraper -h

Options include:

  • -C, --channels FILE — Path to the input JSON file containing the list of channels (default: channels/current.json).

  • -E, --batch-extract N — Number of messages processed in parallel to extract V2Ray configs (default: 20).

  • -R, --configs-raw FILE — Path to the output file for saving scraped V2Ray configs (default: configs/v2ray-raw.txt).

  • -T, --time-out SECONDS — HTTP client timeout in seconds for requests used while updating channel info and extracting V2Ray configurations (default: 30.0).

  • -U, --batch-update N — Maximum number of channels updated in parallel (default: 100).

Example usage:

python -m scripts.async_scraper -E 20 -U 100 --time-out 30.0 -C channels/current.json -R configs/v2ray-raw.txt

You can add uv run before the python command to run it through uv.


Synchronous Scraper (simpler, slower)

You can run the synchronous scraper as follows:

python -m scripts.scraper

You can also prepend uv run before any python command to run it through uv.

Alternatively, you can run it with PYTHONPATH:

PYTHONPATH=. python scripts/scraper.py

Use -h to see all available options:

python -m scripts.scraper -h

Options include:

  • -C, --channels FILE — Path to the input JSON file containing the list of channels (default: channels/current.json).

  • -R, --configs-raw FILE — Path to the output file for saving scraped V2Ray configs (default: configs/v2ray-raw.txt).

  • -T, --time-out SECONDS — HTTP client timeout in seconds for requests used while updating channel info and extracting V2Ray configurations (default: 30.0).

Example usage:

python -m scripts.scraper --time-out 30.0 -C channels/current.json -R configs/v2ray-raw.txt

You can add uv run before the python command to run it through uv.


3. Cleaning V2Ray Configurations

You can run the V2Ray configuration cleaner script as follows:

python -m scripts.v2ray_cleaner

You can also prepend uv run before any python command to run it through uv.

Alternatively, you can run it using PYTHONPATH:

PYTHONPATH=. python scripts/v2ray_cleaner.py

You can also run with -h to see all available options:

python -m scripts.v2ray_cleaner -h

Options include:

  • -D, --duplicate [FIELDS] — Remove duplicate entries by specified comma-separated fields. If used without value (e.g., -D), the default fields are protocol,host,port. If omitted, duplicates are not removed.

  • -F, --filter CONDITION — Filter entries using a Python-like condition. Example: "host == '1.1.1.1' and port > 1000". Only matching entries are kept.

  • -I, --configs-raw FILE — Path to the input file with raw V2Ray configs (default: configs/v2ray-raw.txt).

  • -N, --no-normalize — Disable normalization (enabled by default).

  • -O, --configs-clean FILE — Path to the output file for cleaned and processed configs (default: configs/v2ray-clean.txt).

  • -R, --reverse — Sort entries in descending order (only applies with --sort).

  • -S, --sort [FIELDS] — Sort entries by comma-separated fields. If used without value (e.g., -S), the default fields are host,port. If omitted, entries are not sorted.

The script performs the following:

  • Reads raw configs from configs/v2ray-raw.txt.

  • Applies regex-based filters and normalization.

  • Removes duplicates (if --duplicate is used).

  • Sorts entries (if --sort is used).

  • Saves cleaned and processed configs to configs/v2ray-clean.txt.

Example usage:

python -m scripts.v2ray_cleaner -I configs/v2ray-raw.txt -O configs/v2ray-clean.txt --filter "re_search(r'speedtest|google', host)" -D "host, port" -S "protocol, host, port" --reverse

You can add uv run before the python command to run it through uv.


4. Running All Steps via main.py

python main.py

You can also prepend uv run before any python command to run it through uv.

You can also run with -h or --help-scripts to see all available options:

python main.py -h
python main.py --help-scripts

Options include:

  • -H, --help-scripts — Display help information for all internal pipeline scripts.

  • -N, --no-async — Use slower but simpler synchronous scraping mode instead of the default asynchronous mode.

The script performs the following:

  • Executes all pipeline steps in order:

    1. update_channels.py – updates the list of channels.

    2. async_scraper.py – collects channel data from Telegram asynchronously (faster, used by default).

    3. scraper.py – collects channel data synchronously if --no-async is used (slower, simpler).

    4. v2ray_cleaner.py – cleans, normalizes, and processes the scraped proxy configuration files.

  • Collects only relevant arguments for each script automatically.

Example usage:

python main.py --batch-extract 10 --batch-update 100 --filter "host and port" --duplicate --sort "protocol" --reverse

You can add uv run before the python command to run it through uv.

Notes

  • Always update the channel list before running the scrapers.

  • Use the V2Ray cleaner after scraping to normalize configurations.

  • Scripts are provided as-is; use at your own risk.

Disclaimer

This software is provided "as-is". The author is not responsible for any damage, data loss, or other consequences resulting from the use of this software.

Important: Intended for educational/personal use only. The author is not responsible for:

  • Misuse, including spamming or overloading Telegram servers

  • Unauthorized data collection

  • Any legal, financial, or other consequences

Use responsibly and comply with platform terms.

License

This project is licensed under the MIT License – see the LICENSE file for details.

About

TGV2RayScraper is a Python project to collect Telegram data, extract V2Ray configs, and process them by cleaning, normalizing, and deduplicating, while keeping channel info up-to-date. It supports synchronous and asynchronous scraping and includes management tools.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages