Migrate project from Python to Rust with new CLI and benchmark tools by Reim-developer · Pull Request #1 · Reim-developer/Sephera

Reim-developer · 2026-03-22T20:38:34Z

No description provided.

The entire Python codebase (77 files) has been removed as part of migrating the project to Rust. This includes: - Core source files (main.py, command.py, handler.py, etc.) - Configuration files (pyproject.toml, requirements.txt, Makefile) - All module packages (sephera, utils, datalyzer, etc.) - Documentation and tests The repository is now positioned for a complete Rust-based implementation.

Add CI workflow with jobs for: - Rust fmt and clippy checks - Rust test suite - Python type checking with pyright - Benchmark suite

Add benchmarks/run.py that measures and compares performance of the Sephera Rust CLI against the Python CLI across multiple datasets. The script runs warmup and measured iterations, collects timing statistics (min/mean/median/max), parses output summaries, and generates JSON and Markdown reports with detailed environment information and benchmark results.

Add config/languages.yml with comment style definitions and file extensions mapping for various programming languages.

Adds new sephera_cli package to the workspace with clap for CLI argument parsing and integration with sephera_core. Includes tempfile for test support.

…type mappings

Add the CodeLoc analyzer that provides line-of-code counting functionality with support for parallel file scanning, language detection, and ignore pattern matching.

Add unit tests for the language data module covering: - builtin language table verification (103 languages) - extension-based language lookup (.rs -> Rust) - exact file name resolution (Makefile, .vimrc) - unsupported file handling (returns None)

Add a new sephera_tools binary crate to the workspace with CLI support using clap and serialization support via serde/serde_yaml.

Adds a new module for generating benchmark corpus files including: - datasets, templates, and writer modules - CLI command to generate benchmark data for performance testing - Creates README.txt in the output directory

…anagement

…e declarations

Add DatasetSpec struct and DATASETS constant defining small, medium, and large benchmark corpus configurations with module count, files per module, and body repeat parameters.

…guages Adds template definitions for Rust, Python, TypeScript, HTML, JSON, TOML, and Shell that are used by the benchmark corpus generator. Each template includes the file extension, source body, and a repeat divisor for controlling fixture scaling.

Verifies that generate_benchmark_corpus creates the expected directory structure including small/medium/large directories and fixture files.

Add a new writer module that generates benchmark corpus files based on dataset specifications and language templates. The module creates dataset directories with module subdirectories, writes Makefile and Dockerfile fixtures, and renders language-specific fixture files using templates.

Add a new module for loading and parsing language registry data from YAML files, including functions to load from file path and from YAML string content.

Add data structures to represent language specifications including CommentStyleSpec, RawLanguageSpec, RawLanguageRegistry, LanguageSpec, and LanguageRegistry. These models support deserialization from YAML configuration files and provide the foundation for the language registry loader functionality.

Add a new module for generating Rust source files from language registry YAML configurations. This module provides functionality to render comment styles, builtin languages, and lookup indices into a static Rust module that can be used for fast language detection.

Add comprehensive unit tests for the language_data module covering: - YAML registry parsing validation - Missing comment style rejection - Duplicate extension conflict detection - Rendering determinism verification - Snapshot comparison tests - Legacy dotfile exact names support

Add comprehensive README covering project overview, current features, workspace layout, quick start instructions, benchmark results, and development setup for the Sephera Rust CLI project.

Add support for generating and running benchmarks with an extra-large dataset. The generate-benchmark-corpus function now accepts dataset names and filters out the 'repo' dataset when generating synthetic datasets.

Adds a `dataset_names` parameter to `generate_benchmark_corpus` function, allowing users to specify which datasets to include when generating the benchmark corpus. Also exposes `default_benchmark_dataset_names()` and `available_benchmark_dataset_names()` helper functions. The README output now lists the selected datasets. BREAKING CHANGE: `generate_benchmark_corpus` function now requires a `dataset_names` parameter

…dataset Introduce DatasetSize enum to support both fixed repeat counts and target total bytes for dataset sizing. Add new "extra-large" dataset (2GB) and functions to resolve dataset specs by name, enabling more flexible benchmark corpus generation.

Add the ability to specify dataset sizes by target total bytes instead of just fixed repeat counts. Uses binary search to find the minimum body repeat needed to reach the target size.

Reim-developer added 30 commits March 22, 2026 09:39

docs: add GNU GPLv3 license file

7ef6425

chore: add .gitignore with common ignore patterns

32d89e6

style(config): add rustfmt configuration

3921244

chore(deps): add Cargo.lock for dependency locking

49d6035

build: add Cargo workspace configuration

02ef68f

chore(deps): add package-lock.json for dependency locking

2c52bfa

build: add package.json for pyright type checking

8ede01a

build: add pyright strict type checking configuration

a09e222

ci: add GitHub Actions CI workflow

af04e38

Add CI workflow with jobs for: - Rust fmt and clippy checks - Rust test suite - Python type checking with pyright - Benchmark suite

chore: add VS Code settings for formatOnSave and clippy

b11ea13

docs(benchmarks): add benchmark documentation

4fbfb28

chore(config): add programming language configuration file

d8553be

Add config/languages.yml with comment style definitions and file extensions mapping for various programming languages.

feat(cli): add sephera_cli crate with clap CLI parser

227c27e

Adds new sephera_cli package to the workspace with clap for CLI argument parsing and integration with sephera_core. Includes tempfile for test support.

feat(cli): add args module with clap argument parsing for loc command

9a2125b

feat(cli): add lib.rs entry point for sephera_cli crate

2d4224f

feat(cli): add main.rs entry point for CLI crate

6327c2b

feat(cli): add output module for printing code loc reports

57850bf

feat(cli): add CLI runner module

5ee07d4

test(cli): add integration tests for loc command

8b0b13d

feat(core): add sephera_core crate

64f2e6b

feat(core): add code_loc, config, language_data module declarations

c382c64

feat(core): add lib.rs with core module and lint config

6315d76

feat(core): add code_loc module implementation

9bcc8a8

feat(core): add CommentStyle and LanguageConfig structs

c048d4a

feat(core): add generated language data with comment styles and file …

ce27abe

…type mappings

feat(core): add language data module

0204859

feat(core): add code LOC analyzer implementation

452a26c

Add the CodeLoc analyzer that provides line-of-code counting functionality with support for parallel file scanning, language detection, and ignore pattern matching.

feat(core): add ignore matcher for code LOC analysis

b70bcf9

Reim-developer added 28 commits March 22, 2026 13:35

test(core): add unit tests for LOC metrics calculation

0fbcd68

feat(core): add types for LOC metrics calculation

91cd47a

feat(sephera_tools): add new CLI tools package

71217d1

Add a new sephera_tools binary crate to the workspace with CLI support using clap and serialization support via serde/serde_yaml.

feat(sephera_tools): add benchmark corpus generation functionality

8f89a22

Adds a new module for generating benchmark corpus files including: - datasets, templates, and writer modules - CLI command to generate benchmark data for performance testing - Creates README.txt in the output directory

feat(sephera_tools): add language_data module for language registry m…

b3cb9c9

…anagement

feat(sephera_tools): add lib.rs with workspace_root utility and modul…

3b02bf8

…e declarations

feat(sephera_tools): add CLI entry point with generate commands

14ea19f

feat(sephera_tools): add benchmark dataset specifications

22662b7

Add DatasetSpec struct and DATASETS constant defining small, medium, and large benchmark corpus configurations with module count, files per module, and body repeat parameters.

test(sephera_tools): add test for benchmark corpus generation layout

65264e2

Verifies that generate_benchmark_corpus creates the expected directory structure including small/medium/large directories and fixture files.

feat(sephera_tools): add language registry loader from YAML files

a6b9a0f

Add a new module for loading and parsing language registry data from YAML files, including functions to load from file path and from YAML string content.

feat(sephera_tools): add language data path utilities

d179301

feat(sephera_tools): add language registry validation module

583fd80

docs: add README with project documentation

8fe3531

Add comprehensive README covering project overview, current features, workspace layout, quick start instructions, benchmark results, and development setup for the Sephera Rust CLI project.

feat(bench): add extra-large dataset option to benchmark runner

b397224

Add support for generating and running benchmarks with an extra-large dataset. The generate-benchmark-corpus function now accepts dataset names and filters out the 'repo' dataset when generating synthetic datasets.

docs(sephera_cli): add help text to loc command arguments

60053d7

style(sephera_cli): format output strings for better readability

a665411

test(sephera_cli): remove output assertions from loc command test

2084f8f

feat(bench): add dataset selection to benchmark corpus generator

d6dcdbf

test(bench): add extra-large dataset minimum size verification test

da481b3

feat(bench): add target total bytes dataset sizing

87f083b

Add the ability to specify dataset sizes by target total bytes instead of just fixed repeat counts. Uses binary search to find the minimum body repeat needed to reach the target size.

Reim-developer merged commit 8c9516a into master Mar 22, 2026
10 checks passed

Reim-developer deleted the rust-version branch March 22, 2026 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate project from Python to Rust with new CLI and benchmark tools#1

Migrate project from Python to Rust with new CLI and benchmark tools#1
Reim-developer merged 60 commits intomasterfrom
rust-version

Reim-developer commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Reim-developer commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant