Skip to content

Migrate project from Python to Rust with new CLI and benchmark tools#1

Merged
Reim-developer merged 60 commits intomasterfrom
rust-version
Mar 22, 2026
Merged

Migrate project from Python to Rust with new CLI and benchmark tools#1
Reim-developer merged 60 commits intomasterfrom
rust-version

Conversation

@Reim-developer
Copy link
Owner

No description provided.

The entire Python codebase (77 files) has been removed as part of migrating the project to Rust. This includes:
- Core source files (main.py, command.py, handler.py, etc.)
- Configuration files (pyproject.toml, requirements.txt, Makefile)
- All module packages (sephera, utils, datalyzer, etc.)
- Documentation and tests

The repository is now positioned for a complete Rust-based implementation.
Add CI workflow with jobs for:
- Rust fmt and clippy checks
- Rust test suite
- Python type checking with pyright
- Benchmark suite
Add benchmarks/run.py that measures and compares performance of the Sephera
Rust CLI against the Python CLI across multiple datasets. The script runs
warmup and measured iterations, collects timing statistics (min/mean/median/max),
parses output summaries, and generates JSON and Markdown reports with detailed
environment information and benchmark results.
Add config/languages.yml with comment style definitions and file
extensions mapping for various programming languages.
Adds new sephera_cli package to the workspace with clap for CLI argument
parsing and integration with sephera_core. Includes tempfile for test
support.
Add the CodeLoc analyzer that provides line-of-code counting functionality with support for parallel file scanning, language detection, and ignore pattern matching.
Add unit tests for the language data module covering:
- builtin language table verification (103 languages)
- extension-based language lookup (.rs -> Rust)
- exact file name resolution (Makefile, .vimrc)
- unsupported file handling (returns None)
Add a new sephera_tools binary crate to the workspace with CLI
support using clap and serialization support via serde/serde_yaml.
Adds a new module for generating benchmark corpus files including:
- datasets, templates, and writer modules
- CLI command to generate benchmark data for performance testing
- Creates README.txt in the output directory
Add DatasetSpec struct and DATASETS constant defining small, medium,
and large benchmark corpus configurations with module count, files
per module, and body repeat parameters.
…guages

Adds template definitions for Rust, Python, TypeScript, HTML, JSON, TOML, and Shell
that are used by the benchmark corpus generator. Each template includes the file
extension, source body, and a repeat divisor for controlling fixture scaling.
Verifies that generate_benchmark_corpus creates the expected directory
structure including small/medium/large directories and fixture files.
Add a new writer module that generates benchmark corpus files based on
dataset specifications and language templates. The module creates dataset
directories with module subdirectories, writes Makefile and Dockerfile
fixtures, and renders language-specific fixture files using templates.
Add a new module for loading and parsing language registry data from YAML files, including functions to load from file path and from YAML string content.
Add data structures to represent language specifications including
CommentStyleSpec, RawLanguageSpec, RawLanguageRegistry, LanguageSpec,
and LanguageRegistry. These models support deserialization from YAML
configuration files and provide the foundation for the language registry
loader functionality.
Add a new module for generating Rust source files from language registry YAML configurations. This module provides functionality to render comment styles, builtin languages, and lookup indices into a static Rust module that can be used for fast language detection.
Add comprehensive unit tests for the language_data module covering:
- YAML registry parsing validation
- Missing comment style rejection
- Duplicate extension conflict detection
- Rendering determinism verification
- Snapshot comparison tests
- Legacy dotfile exact names support
Add comprehensive README covering project overview, current features,
workspace layout, quick start instructions, benchmark results, and
development setup for the Sephera Rust CLI project.
Add support for generating and running benchmarks with an extra-large
dataset. The generate-benchmark-corpus function now accepts dataset names
and filters out the 'repo' dataset when generating synthetic datasets.
Adds a `dataset_names` parameter to `generate_benchmark_corpus` function, allowing users to specify which datasets to include when generating the benchmark corpus. Also exposes `default_benchmark_dataset_names()` and `available_benchmark_dataset_names()` helper functions. The README output now lists the selected datasets.

BREAKING CHANGE: `generate_benchmark_corpus` function now requires a `dataset_names` parameter
…dataset

Introduce DatasetSize enum to support both fixed repeat counts and target total
bytes for dataset sizing. Add new "extra-large" dataset (2GB) and functions
to resolve dataset specs by name, enabling more flexible benchmark corpus
generation.
Add the ability to specify dataset sizes by target total bytes instead
of just fixed repeat counts. Uses binary search to find the minimum
body repeat needed to reach the target size.
@Reim-developer Reim-developer merged commit 8c9516a into master Mar 22, 2026
10 checks passed
@Reim-developer Reim-developer deleted the rust-version branch March 22, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant