Migrate project from Python to Rust with new CLI and benchmark tools#1
Merged
Reim-developer merged 60 commits intomasterfrom Mar 22, 2026
Merged
Migrate project from Python to Rust with new CLI and benchmark tools#1Reim-developer merged 60 commits intomasterfrom
Reim-developer merged 60 commits intomasterfrom
Conversation
The entire Python codebase (77 files) has been removed as part of migrating the project to Rust. This includes: - Core source files (main.py, command.py, handler.py, etc.) - Configuration files (pyproject.toml, requirements.txt, Makefile) - All module packages (sephera, utils, datalyzer, etc.) - Documentation and tests The repository is now positioned for a complete Rust-based implementation.
Add CI workflow with jobs for: - Rust fmt and clippy checks - Rust test suite - Python type checking with pyright - Benchmark suite
Add benchmarks/run.py that measures and compares performance of the Sephera Rust CLI against the Python CLI across multiple datasets. The script runs warmup and measured iterations, collects timing statistics (min/mean/median/max), parses output summaries, and generates JSON and Markdown reports with detailed environment information and benchmark results.
Add config/languages.yml with comment style definitions and file extensions mapping for various programming languages.
Adds new sephera_cli package to the workspace with clap for CLI argument parsing and integration with sephera_core. Includes tempfile for test support.
Add the CodeLoc analyzer that provides line-of-code counting functionality with support for parallel file scanning, language detection, and ignore pattern matching.
Add unit tests for the language data module covering: - builtin language table verification (103 languages) - extension-based language lookup (.rs -> Rust) - exact file name resolution (Makefile, .vimrc) - unsupported file handling (returns None)
Add a new sephera_tools binary crate to the workspace with CLI support using clap and serialization support via serde/serde_yaml.
Adds a new module for generating benchmark corpus files including: - datasets, templates, and writer modules - CLI command to generate benchmark data for performance testing - Creates README.txt in the output directory
Add DatasetSpec struct and DATASETS constant defining small, medium, and large benchmark corpus configurations with module count, files per module, and body repeat parameters.
…guages Adds template definitions for Rust, Python, TypeScript, HTML, JSON, TOML, and Shell that are used by the benchmark corpus generator. Each template includes the file extension, source body, and a repeat divisor for controlling fixture scaling.
Verifies that generate_benchmark_corpus creates the expected directory structure including small/medium/large directories and fixture files.
Add a new writer module that generates benchmark corpus files based on dataset specifications and language templates. The module creates dataset directories with module subdirectories, writes Makefile and Dockerfile fixtures, and renders language-specific fixture files using templates.
Add a new module for loading and parsing language registry data from YAML files, including functions to load from file path and from YAML string content.
Add data structures to represent language specifications including CommentStyleSpec, RawLanguageSpec, RawLanguageRegistry, LanguageSpec, and LanguageRegistry. These models support deserialization from YAML configuration files and provide the foundation for the language registry loader functionality.
Add a new module for generating Rust source files from language registry YAML configurations. This module provides functionality to render comment styles, builtin languages, and lookup indices into a static Rust module that can be used for fast language detection.
Add comprehensive unit tests for the language_data module covering: - YAML registry parsing validation - Missing comment style rejection - Duplicate extension conflict detection - Rendering determinism verification - Snapshot comparison tests - Legacy dotfile exact names support
Add comprehensive README covering project overview, current features, workspace layout, quick start instructions, benchmark results, and development setup for the Sephera Rust CLI project.
Add support for generating and running benchmarks with an extra-large dataset. The generate-benchmark-corpus function now accepts dataset names and filters out the 'repo' dataset when generating synthetic datasets.
Adds a `dataset_names` parameter to `generate_benchmark_corpus` function, allowing users to specify which datasets to include when generating the benchmark corpus. Also exposes `default_benchmark_dataset_names()` and `available_benchmark_dataset_names()` helper functions. The README output now lists the selected datasets. BREAKING CHANGE: `generate_benchmark_corpus` function now requires a `dataset_names` parameter
…dataset Introduce DatasetSize enum to support both fixed repeat counts and target total bytes for dataset sizing. Add new "extra-large" dataset (2GB) and functions to resolve dataset specs by name, enabling more flexible benchmark corpus generation.
Add the ability to specify dataset sizes by target total bytes instead of just fixed repeat counts. Uses binary search to find the minimum body repeat needed to reach the target size.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.