Cacheless semantic code + document search that ties or beats transformer baselines. One binary, 19 grammars, three retrieval engines, zero setup.
ripvec finds code and documents by meaning, provides structural code intelligence across every language it knows, and ranks results by how important each file is in your project. The default engine runs CPU-only, holds no on-disk index, and matches or exceeds ModernBERT-class transformers on our benchmark matrix across code and prose. Transformer engines remain available, opt-in, for users who want a persistent index and the best coherent top-K on long-form narrative.
$ ripvec "retry logic with exponential backoff" ~/src/my-project
1. retry_handler.rs:42-78 [0.91]
pub async fn with_retry<F, T>(f: F, max_attempts: u32) -> Result<T>
where F: Fn() -> Future<Output = Result<T>> {
let mut delay = Duration::from_millis(100);
for attempt in 0..max_attempts {
match f().await {
Ok(v) => return Ok(v),
Err(e) if attempt < max_attempts - 1 => {
sleep(delay).await;
delay *= 2; // exponential backoff
...
2. http_client.rs:156-189 [0.84]
impl HttpClient {
async fn request_with_backoff(&self, req: Request) -> Response {
...The function is called with_retry, the variable is delay. "exponential backoff" appears nowhere in the source. grep can't find this. ripvec can, because it embeds both your query and the code into the same vector space, fuses semantic scores with path-enriched BM25, layers a structural-importance signal from a PageRank percentile boost, and reranks the top candidates through a cross-encoder.
ripvec has three interfaces. Here's when each one matters:
| Interface | When to use it | Who uses it |
|---|---|---|
CLI (ripvec "query" .) |
Terminal search, interactive TUI, one-shot queries | You, directly |
MCP server (ripvec-mcp) |
AI agent needs to search or understand your codebase | Claude Code, Cursor, any MCP client |
LSP server (ripvec-mcp --lsp) |
Editor/agent needs symbols, definitions, diagnostics | Claude Code's LSP tool, editors |
The MCP server gives AI agents 8 semantic + structural tools plus 9 LSP tools. The LSP server gives editors structural intelligence (outlines, go-to-definition, syntax diagnostics) for all 19 languages from one binary. The CLI is for humans. Same binary for all three.
If you're using Claude Code, install the plugin. It sets up both MCP and LSP automatically; Claude will use search_code when you ask conceptual questions and the LSP for symbol navigation.
Three retrieval engines share the same CLI/MCP/LSP surface. Pick at runtime with --model:
graph TB
Q["Query"] --> S["Shared surface<br/>CLI / MCP / LSP"]
S --> R["--model ripvec<br/>(default)"]
S --> M["--model modernbert"]
S --> B["--model bert"]
R --> RP["Model2Vec 32M bi-encoder<br/>+ path-enriched BM25<br/>+ PageRank percentile boost<br/>+ TinyBERT-L-2 cross-encoder rerank (corpus-aware)<br/>= in-memory only"]
M --> MP["ModernBERT 768-dim transformer<br/>+ BM25 + PageRank<br/>= persistent on-disk index"]
B --> BP["BGE-small 384-dim transformer<br/>+ BM25 + PageRank<br/>= persistent on-disk index"]
| Engine | Pipeline | Cache | When to pick |
|---|---|---|---|
ripvec (default) |
Model2Vec potion-base-32M + path-enriched BM25 + function-level PageRank + TinyBERT-L-2 cross-encoder rerank (gated by corpus class) |
none (in-memory per session) | Default. Sub-MCPs, fresh worktrees, fan-out agents, document archives, anywhere first-query latency matters. |
modernbert |
ModernBERT 768-dim transformer + BM25 + PageRank | ~/.cache/ripvec/ or .ripvec/cache/ |
Workstation with a persistent index. Best coherent top-10 on long-form narrative prose. GPU-capable (Metal/MLX/CUDA). |
bert |
BGE-small 384-dim transformer + BM25 + PageRank | ~/.cache/ripvec/ or .ripvec/cache/ |
Lighter transformer alternative. Lower memory footprint, lower top-K coherence than ModernBERT. |
The MCP daemon picks the engine at startup via RIPVEC_MCP_ENGINE (defaults to ripvec); the transformer engines require building ripvec-mcp with --features legacy-transformer-mcp. The CLI accepts --model per-invocation with no build-time gating.
Two reproducible benchmarks anchor ripvec's behavior, both run from a fresh checkout via cargo run --release --example corpus_bench. The corpora and query / target-file annotations are checked in under tests/corpus/.
Workload: tests/corpus/code, ~2 GB across nine codebases (tokio, redis, react, spring-boot, go, linux, ripgrep, flask, express). Query set: 20 architectural and semantic queries against tokio with file-level ground truth (tests/corpus/annotations/tokio.json). Scoring: NDCG@10, recall@10, precision@10 with suffix-path matching.
| metric | value |
|---|---|
| chunks indexed | 1,075,655 |
| index build | 65 s |
| PageRank graph build | 45 s |
| query p50 | 42 ms |
| query p90 | 168 ms |
| query p99 | 241 ms |
| NDCG@10 | 0.665 |
| recall@10 | 0.767 |
| precision@10 | 0.120 |
Workload: tests/corpus/gutenberg, 10 Project Gutenberg books (~2 MB plain text). Query set: 15 natural-language queries each mapping to a single relevant book (tests/corpus/annotations/gutenberg.json). Same scoring.
| metric | value |
|---|---|
| chunks indexed | 1,652 |
| index build | 120 ms |
| query p50 | 34 ms |
| query p90 | 36 ms |
| query p99 | 36 ms |
| NDCG@10 | 1.000 |
| recall@10 | 1.000 |
| precision@10 | 0.100 (one relevant book per query, top-10) |
Every query returns the correct book at rank 1.
semble is the closest published baseline for this stack: static-embedding bi-encoder, path-enriched BM25, ranking layer. ripvec runs semble's full published benchmark (63 repos, 19 languages, 1,251 queries) end-to-end. Full per-language tables, methodology, and raw JSON outputs live in docs/benchmarks/full_corpus.md.
Macro-averaged across languages:
| pipeline | NDCG@10 | q-p50 | q-p99 | index |
|---|---|---|---|---|
| semble (potion-code-16M) | 0.852 | 2.22 ms | 11.35 ms | 1347 ms |
| ripvec matched (same model, no PageRank, no rerank) | 0.845 | 0.33 ms | 4.20 ms | 110 ms |
| ripvec default (potion-base-32M + PageRank + auto-rerank) | 0.803 | 0.35 ms | 4.31 ms | 109 ms |
Matched-mode quality sits within 0.007 NDCG@10 of semble while running 6.7× faster at p50, 2.7× at p99, and 12.2× faster on index build. The matched cell answers "is the port faithful": same model, same algorithm shape, deltas attribute to the implementation. The default cell answers "what does a user get out of the box": ripvec's shipped configuration trades 0.049 NDCG@10 on this code-heavy corpus for the headroom the 32M model gives on prose (NDCG@10 = 1.000 on the Gutenberg benchmark above) and for the PageRank prior that helps architectural queries on import-graph-heavy codebases.
The pipelines differ on three axes:
- Embedding model. semble defaults to
potion-code-16M(code-tuned). ripvec defaults topotion-base-32M(general). 16M leads 32M on this code corpus by 0.042 NDCG@10; 32M leads 16M on the prose benchmark by 0.058. Both models are available via--model. - Reranker. semble has no cross-encoder. ripvec applies
ms-marco-TinyBERT-L-2-v2on Docs and Mixed corpora when the query is natural-language; pure Code corpora skip it (the gate fires zero times across this 63-repo run, by design). - Structural prior. ripvec computes function-level PageRank over the import / call graph and applies a percentile-based boost. semble has no equivalent.
# Single-corpus end-to-end harness (code, ~25 min).
cargo run --release --example corpus_bench -- \
tests/corpus/code tests/corpus/annotations/tokio.json --scope code
# Single-corpus end-to-end harness (prose, ~30 s).
cargo run --release --example corpus_bench -- \
tests/corpus/gutenberg tests/corpus/annotations/gutenberg.json --scope docs
# Full semble corpus replay (63 repos, ~25 min after one-time clone).
cd ~/src/semble && uv run python -m benchmarks.sync_repos # ~10 GB
cargo run --release --example semble_full_bench --features cpu-accelerate -- \
--mode matched --out docs/benchmarks/results/ripvec_matched.json
cargo run --release --example semble_full_bench --features cpu-accelerate -- \
--mode default --out docs/benchmarks/results/ripvec_default.jsonBench flags:
| flag | default | purpose |
|---|---|---|
--candidates N |
50 | cap on candidates the reranker sees |
--rerank-model REPO |
cross-encoder/ms-marco-TinyBERT-L-2-v2 |
swap cross-encoder |
--model REPO |
minishlab/potion-base-32M |
swap bi-encoder |
--scope {code,docs,all} |
(from arg) | corpus filter intent |
--repeats N |
5 | timing reps per query |
--no-rerank / --rerank |
auto | force the gate one way |
For matched-model semble parity, cargo run --release --example semble_bench -- <repo> <annotations.json> mirrors the harness in ~/src/semble/benchmarks/run_benchmark.py.
graph LR
A["🗺️ Orient<br/>get_repo_map"] --> B["🔍 Search<br/>search(scope)"]
B --> C["🧭 Navigate<br/>LSP operations"]
C -->|"need more context"| B
C -->|"found it"| D["✏️ Edit"]
Orient. get_repo_map returns a structural overview ranked by function-level importance. One tool call replaces 10+ sequential file reads. Start here when working on unfamiliar code.
Search. search(query="authentication middleware", scope="code") finds implementations by meaning across all 19 languages simultaneously. Pass scope="docs" for documentation-only retrieval (with cross-encoder rerank), scope="all" (default) to search everything and let the corpus class decide whether rerank fires. Results are ranked by relevance and structural importance.
Navigate. LSP documentSymbol shows the file outline. goToDefinition jumps to the likely definition. findReferences shows usage sites. incomingCalls/outgoingCalls traces the call graph.
You describe behavior, ripvec finds the implementation:
| What you want | grep / ripgrep | ripvec |
|---|---|---|
| "retry with backoff" | Nothing (code says delay *= 2) |
Finds the retry handler |
| "database connection pool" | Comments mentioning "pool" | The pool implementation |
| "authentication middleware" | // TODO: add auth |
The auth guard |
| "WebSocket lifecycle" | String "WebSocket" | Connect/disconnect handlers |
Search modes: --mode hybrid (default, semantic + BM25 fusion), --mode semantic (pure vector similarity), --mode keyword (pure BM25). Hybrid is usually best.
Documents about a topic (READMEs, design specs, RFCs, code comments) literally use the topic's words. Code that implements the topic usually doesn't. Semantic similarity therefore systematically ranks docs above implementations on descriptive queries, and the right answer depends on what the agent is looking for.
scope lets the caller declare intent:
| Scope | Includes | Rerank | When to pick |
|---|---|---|---|
code |
code-language extensions (.py, .rs, .ts, .go, …) |
off | "Find the implementation of X." |
docs |
prose extensions (.md, .rst, .txt, .adoc, .org, .mdx) |
on (NL queries) | "Find documentation about X / how X is described." |
all (default) |
everything | corpus-aware | "Search everything; let the gate decide whether rerank fires." |
include_extensions and exclude_extensions give surgical control on top of scope (e.g. scope=all, exclude_extensions=["min.js"]). Same flags on CLI: --scope, --include-ext, --exclude-ext.
The MCP search tool exposes these as JSON params; the CLI exposes them as flags.
ripvec serves LSP from a single binary for all 19 grammars. No per-language server installs. It provides:
documentSymbol: file outline (functions, fields, enum variants, constants, types, headings)workspaceSymbol: cross-language symbol search with PageRank boostgoToDefinition: name-based resolution ranked by structural importancefindReferences: usage sites via hybrid search + content filteringhover: scope chain, signature, enriched contextpublishDiagnostics: tree-sitter syntax error detection after every editincomingCalls/outgoingCalls: function-level call graph
For languages with dedicated LSPs (Rust, Python, Go, TypeScript), ripvec runs alongside them. The dedicated server handles types, ripvec handles semantic search and cross-language features. For languages without dedicated LSPs (bash, HCL, Ruby, Kotlin, Swift, Scala), ripvec is the primary code intelligence.
JSON, YAML, TOML, and Markdown get structural outlines (keys, mappings, headings) and syntax diagnostics. Useful for navigating large config files, not comparable to language-aware intelligence.
The default engine is a four-stage composite pipeline. Each stage uses a fast cheap-to-rebuild signal; together they outperform a single transformer on retrieval quality.
graph TB
Q["Query"] --> EMB["Bi-encoder embed<br/>(Model2Vec potion-base-32M, 256-dim)"]
Q --> BM["BM25 score<br/>(path-enriched, postings-list inverted)"]
EMB --> SEM["Cosine similarity<br/>parallel sgemv across rayon row-shards<br/>top-N candidates"]
BM --> LEX["Lexical ranking<br/>par_iter over query terms<br/>top-N candidates"]
SEM --> RRF["Reciprocal Rank Fusion<br/>(k=60)"]
LEX --> RRF
RRF --> PR["× PageRank percentile boost<br/>(sigmoid curve, α=0.5)"]
PR --> GATE{"Corpus class<br/>(≥30% prose chunks?)"}
GATE -->|"Docs / Mixed"| RR["Cross-encoder rerank<br/>(ms-marco-TinyBERT-L-2-v2)<br/>top-50 candidates"]
GATE -->|"Code"| OUT["Top-k results"]
RR --> OUT
Static bi-encoder retrieval (Model2Vec). The bi-encoder is a lookup-and-mean-pool over a pretrained 256-dim embedding table (minishlab/potion-base-32M). No transformer forward pass; encoding cost is dominated by memory bandwidth, not FLOPs. About 5ms per query on a single CPU thread; ~250K chunks per second when indexing in parallel.
Path-enriched BM25. Lexical scoring with a code-aware tokenizer that splits parseJsonConfig into [parse, json, config] and my_func_name into [my, func, name]. Chunk text is enriched with the file stem (doubled) and the last three directory components before tokenization, so a query like "session encoding" hits both content and sessions.py paths.
Reciprocal Rank Fusion. Combines the semantic and lexical rankings via Cormack et al.'s rank-based fusion (k=60). Handles the scale mismatch between cosine similarity and BM25 without tuning.
PageRank percentile boost. A structural-importance signal on top of relevance. See the next section.
Cross-encoder rerank (prose-class corpora). When the index's corpus class is Docs or Mixed (at least 30% of indexed chunks are prose-extension files) and the query is natural-language, the top 50 candidates are re-scored by ms-marco-TinyBERT-L-2-v2: a 2-layer cross-encoder distilled from BERT-base, ~5 MB on disk, ~0.3 ms per pair on CPU. The model swaps in from a sweep against the larger ms-marco-MiniLM-L-12-v2 (33 MB, 12 layers): TinyBERT-L-2 holds NDCG@10 = 1.000 on the Gutenberg benchmark at 20× the throughput.
Wiring details: the BERT pooler (tanh(W_pool · cls)) runs between the trunk and the classifier head (matching the head the model was trained against). Raw classifier logits flow out (sentence-transformers Identity activation), and the ranking layer min-max normalizes both cross-encoder and bi-encoder score arrays within the candidate set before convex-combining (0.7 × cross + 0.3 × bi). Tokenizer truncation is LongestFirst at max_position_embeddings, preserving [CLS] / [SEP] on long inputs.
Code-class corpora skip the reranker. The cross-encoder is trained on web-prose passage retrieval and adds latency without lifting NDCG on code: on the 8-Python-library benchmark, rerank-on costs roughly 0.09 NDCG@10 vs rerank-off regardless of which cross-encoder model is plugged in.
graph LR
subgraph "Call Graph"
A["main()"] --> B["handle_request()"]
A --> C["init_db()"]
B --> D["authenticate()"]
B --> E["dispatch()"]
D --> F["verify_token()"]
E --> D
end
subgraph "PageRank"
D2["authenticate() ★★★"]
B2["handle_request() ★★"]
E2["dispatch() ★"]
end
ripvec extracts call expressions from every function body using tree-sitter, resolves callee names to definitions, and computes PageRank on the resulting call graph. Functions called by many others rank higher. authenticate() in the example above is more structurally important than dispatch() because more code depends on it.
The bi-encoder is structurally weaker than a transformer. Model2Vec doesn't model cross-token interactions and can't reliably distinguish a 1500-char canonical implementation from a 3-line example stub by dense similarity alone. Without a corrective signal, the engine ranks tests/hello_world.py competitively with src/auth/handler.py on a query like "register a route." PageRank carries the missing signal: implementations are imported by tests and callers; stubs are imported by nothing.
ripvec applies the structural prior as a sigmoid-on-percentile boost: boost(p) = 1 + α × sigmoid((p − 0.5) / s) where p is the file's PR percentile within the corpus, α=0.5 is the ceiling lift, and s=0.15 controls steepness.
| PR percentile | Example file | Boost (α=0.5) |
|---|---|---|
| 0 (not in graph) | isolated leaf file | 1.00× (no boost) |
| 0.10 (bottom decile) | rarely-imported impl | 1.04× |
| 0.25 (lower quartile) | hub of one small module | 1.08× |
| 0.50 (median) | typical impl file | 1.25× |
| 0.75 (upper quartile) | heavily-imported module | 1.42× |
| 0.95 (near top) | central trait / API surface | 1.48× |
| 1.00 (graph root) | e.g. tokio/src/lib.rs |
~1.49× (asymptote 1.5×) |
Two design constraints fall out of this curve:
- At-or-above-median PR gets a meaningfully different boost from low-PR. A median-importance impl with cosine 0.84 ends at 0.84 × 1.25 = 1.05; a near-zero-PR test with cosine 0.85 ends at 0.85 × 1.02 = 0.867. The impl flips above the test by ~21%, enough to reorder reliably when the bi-encoder is uncertain.
- The ceiling caps centers-of-universe. A graph-root file at p=1.0 gets at most 1.5×. It can't dominate when the query genuinely matches a less-central file.
The boost is applied via a composable RankingLayer chain shared across CLI, MCP, and LSP code paths. Adding a new ranking signal (recency, file-saturation diversification) is a single new impl RankingLayer.
ripvec retains two transformer engines for users who want a persistent on-disk index and the absolute coherent top-K on long-form prose. Both share the cache architecture, the BM25/RRF/PageRank ranking layers, and the GPU backends; they differ only in the embedding model.
graph TD
subgraph "~/.cache/ripvec/<project_hash>/v3-modernbert/"
M["manifest.json<br/>file entries + Merkle hashes"]
L["manifest.lock<br/>advisory fd-lock"]
subgraph "objects/ (content-addressed)"
O1["ab/cdef12...<br/>(zstd-compressed FileCache)"]
O2["3f/a891bc...<br/>(zstd-compressed FileCache)"]
end
end
Each file's chunks and embeddings are serialized into a FileCache object, compressed with zstd (~8x), and stored by blake3 content hash in a git-style xx/hash sharded object store. The manifest tracks metadata: mtime, size, content hash, chunk count per file, plus Merkle directory hashes.
The ripvec engine never builds any of this. It holds the in-memory index across an MCP session lifetime, drops it on reindex or process exit, and rebuilds on next query.
graph TD
F["File on disk"] --> M{"mtime + size<br/>match manifest?"}
M -->|"yes"| SKIP["Unchanged<br/>(fast path, no I/O)"]
M -->|"no"| HASH{"blake3 content hash<br/>matches manifest?"}
HASH -->|"yes"| TOUCH["Touched but identical<br/>(heal mtime in manifest)"]
HASH -->|"no"| DIRTY["Dirty → re-embed"]
Level 1 (mtime+size) is a stat call (microseconds). Level 2 (blake3 hash) reads the file but avoids re-embedding if content hasn't changed. After git clone (where all mtimes are wrong), the first run hashes everything but re-embeds nothing, then heals the manifest mtimes for fast-path on subsequent runs.
| Format | Used for | Portable? |
|---|---|---|
| rkyv (zero-copy) | User-level cache (~/.cache) | No (architecture-dependent) |
| bitcode | Repo-level cache (.ripvec/) | Yes (cross-architecture) |
Auto-detected on read via magic bytes: 0x42 0x43 = bitcode, otherwise rkyv. Both are zstd-compressed. Repo-level indices use bitcode so they can be committed to git and shared between x86 CI and ARM developer machines.
sequenceDiagram
participant MCP as MCP Server
participant Watcher as File Watcher
participant Lock as manifest.lock
participant Cache as Object Store
Note over MCP: Query arrives
MCP->>Lock: acquire read lock
MCP->>Cache: load objects
Lock-->>MCP: release
Note over Watcher: File change detected (2s debounce)
Watcher->>Lock: acquire write lock
Watcher->>Cache: re-embed dirty files
Watcher->>Cache: write new objects
Watcher->>Cache: save manifest + GC
Lock-->>Watcher: release
The file watcher debounces for 2 seconds of quiet before triggering re-indexing. Advisory fd-lock on manifest.lock prevents readers from seeing a half-written manifest. Multiple readers can proceed concurrently; writers block all readers.
Garbage collection runs after each incremental update; unreferenced objects (from deleted or re-embedded files) are removed from the store.
ripvec --model modernbert --index --repo-level "query"
git add .ripvec/ && git commit -m "add search index"Creates .ripvec/config.toml (pins model + version) and .ripvec/cache/ (manifest + objects). Teammates who clone get instant search. The config is validated on load. If the model doesn't match the runtime model, ripvec falls back to the user-level cache with a warning.
Repo config can also exclude files from the index using .gitignore syntax:
[ignore]
patterns = [
"*.jsonl",
"*.md",
"docs/generated/**",
"!docs/README.md",
]These patterns apply to CLI indexing, incremental cache diffing, MCP reindexing, and repo-map file discovery. The command-line --exclude-extensions=jsonl,md flag is useful for one-off extension filters.
graph TD
A["--cache-dir override"] -->|"highest priority"| R["Resolved cache dir"]
B[".ripvec/config.toml<br/>(repo-local)"] -->|"if model matches"| R
C["RIPVEC_CACHE env var"] --> R
D["~/.cache/ripvec/<br/>(XDG default)"] -->|"lowest priority"| R
graph LR
subgraph "Stage 1: Chunk (rayon)"
F["Files"] --> TS["Tree-sitter<br/>parse"]
TS --> C["Semantic<br/>chunks"]
end
subgraph "Stage 2: Tokenize"
C --> T["Tokenizer<br/>(BPE / WordPiece)"]
T --> B["Padded<br/>batches"]
end
subgraph "Stage 3: Embed (GPU)"
B --> FW["Forward pass<br/>(22 layers ModernBERT,<br/>12 layers BGE-small)"]
FW --> P["Mean / CLS pool<br/>+ L2 norm"]
P --> V["Embedding<br/>vectors"]
end
C -.->|"bounded channel<br/>backpressure"| T
T -.->|"bounded channel<br/>backpressure"| FW
For large corpora (1000+ files), stages run concurrently as a streaming pipeline with bounded channels for backpressure. The GPU starts embedding after the first batch (~50ms), not after all files are chunked.
The core design insight for the transformer engines: the forward pass is written ONCE as a generic ModernBertArch<D: Driver>, and each backend implements the Driver trait with platform-specific operations. Same model, same math, different hardware.
graph TB
subgraph "Architecture (written once)"
FP["ModernBertArch<D: Driver><br/>forward()"]
FP --> L1["Layer 1: Attention + FFN"]
L1 --> L2["Layer 2: Attention + FFN"]
L2 --> LN["...22 layers..."]
LN --> Pool["Mean pool + L2 norm"]
end
subgraph "Driver trait implementations"
FP -.->|"D = Metal"| M["MetalDriver<br/>MPS GEMMs + custom MSL kernels"]
FP -.->|"D = CUDA"| CU["CudaDriver<br/>cuBLAS tensor cores + NVRTC kernels"]
FP -.->|"D = CPU"| CP["CpuDriver<br/>ndarray + Accelerate/OpenBLAS"]
FP -.->|"D = MLX"| ML["MlxDriver<br/>lazy eval → auto-fused Metal"]
end
Each of the 22 ModernBERT layers runs attention + FFN. Here's how the same operations map to different hardware:
graph LR
subgraph "Attention"
LN1["LayerNorm"] --> QKV["QKV projection<br/>(GEMM)"]
QKV --> PAD["Pad + Split"]
PAD --> ROPE["RoPE rotation"]
ROPE --> ATTN["Q @ K^T<br/>(batched GEMM)"]
ATTN --> SM["Scale + Mask<br/>+ Softmax"]
SM --> AV["Scores @ V<br/>(batched GEMM)"]
AV --> UNPAD["Reshape + Unpad"]
UNPAD --> OPROJ["Output proj<br/>(GEMM)"]
OPROJ --> RES1["Residual add"]
end
subgraph "FFN"
RES1 --> LN2["LayerNorm"]
LN2 --> WI["Wi projection<br/>(GEMM)"]
WI --> GEGLU["Split + GeGLU"]
GEGLU --> WO["Wo projection<br/>(GEMM)"]
WO --> RES2["Residual add"]
end
| Operation | Metal | CUDA | CPU | MLX |
|---|---|---|---|---|
| GEMM | MPS (AMX) | cuBLAS FP16 tensor cores | Accelerate / OpenBLAS | Auto-fused |
| Softmax+Scale+Mask | Fused MSL kernel | Fused NVRTC kernel | Scalar loop | Auto-fused |
| RoPE | Custom MSL kernel | Custom NVRTC kernel | Scalar loop | Lazy ops |
| GeGLU (split+gelu+gate) | Fused MSL kernel | Fused NVRTC kernel | Scalar loop | Auto-fused |
| Pad/Unpad/Reshape | Custom MSL kernels | Custom NVRTC kernels | Rust loops | Free (metadata) |
| FP16 support | Yes (all kernels) | Yes (all kernels) | No | No |
Metal and CUDA have hand-written fused kernels for softmax, GeGLU, and attention reshape. These eliminate intermediate buffers and reduce memory bandwidth. MLX gets fusion automatically via lazy evaluation (the entire forward pass typically compiles to 2-3 Metal kernel dispatches). CPU uses explicit scalar loops for everything except GEMM.
graph LR
subgraph "HybridIndex"
subgraph "SearchIndex (dense vectors)"
EMB["embeddings<br/>(TurboQuant 4-bit compressed)"]
EMB --> CS["Cosine similarity scan"]
end
subgraph "Bm25Index (tantivy)"
TAN["Inverted index<br/>(code-aware tokenizer)"]
TAN --> BM["BM25 scoring<br/>(name 3× / path 1.5× / body 1×)"]
end
CS --> RRF["RRF fusion (k=60)"]
BM --> RRF
end
The transformer BM25 index uses a code-aware tokenizer that splits parseJsonConfig into [parse, json, config] and my_func_name into [my, func, name]. Keyword search finds json config parser even if the function is named in camelCase. Function names are boosted 3x over body text.
TurboQuant compresses 768-dim vectors from 3KB to ~380 bytes (4-bit) with a rotation matrix for better quantization. This enables ~5x faster scanning for large indices while maintaining ranking quality through exact re-ranking of the top candidates.
Cacheless (ripvec engine, the default). Wall time for a single query, end-to-end including model load on cold start:
| Corpus | First query (cold) | Warm | Notes |
|---|---|---|---|
| Small repo (~500 files) | ~7s | 0.3s | Model download + index build dominate cold path |
| Medium repo (~5K files, e.g. Tokio) | ~12s | 0.8s | |
| Large repo (~50K files) | ~50s | 8s | Linear in file count for indexing |
| Linux kernel (~92K files, 1.7 GB) | ~75s | n/a (in-memory drops between processes) |
The MCP daemon holds the in-memory index for the session lifetime, so warm latency dominates after the first query. For sub-MCPs and agent fan-out where each spawn starts fresh, the cold-path numbers are what to budget against. Roughly 100× faster cold-path than ModernBERT cacheless (33s/79s on Gutenberg/Tokio respectively).
Indexed (transformer engines). Time to build the persistent index on first run; subsequent queries against the cached index are milliseconds.
| Hardware | Throughput | Time (Flask corpus, 2383 chunks) |
|---|---|---|
| RTX 4090 (CUDA) | 435 chunks/s | ~5s |
| M2 Max (Metal) | 73.8 chunks/s | ~32s |
| M2 Max (CPU/Accelerate) | 73.5 chunks/s | ~32s |
Metal and CPU show similar throughput on M2 Max because macOS Accelerate routes BLAS operations through the AMX coprocessor regardless of backend. The Metal backend has headroom on larger batches and non-BLAS operations.
Memory. Ripvec engine: ~200 MB for a typical project (embedding table + chunks + BM25). Transformer engines: ~500 MB during embedding (model weights + batch buffers), ~100 MB for query-time.
Where CPU goes on the ripvec engine (linux/92K corpus, sampled).
| Component | % of CPU-time |
|---|---|
| rayon worker synchronization (intrinsic par_iter joins) | ~38% |
tokenizer Unicode normalization (upstream tokenizers crate) |
~10% |
| file I/O (read + open syscalls) | ~5% |
| pool_ids (SIMD f32x8, our kernel) | ~2% |
| tree-sitter parse | ~3% |
| BM25 build + interner | ~3% |
| useful work | ~36% |
The 38% sync floor is structural: rayon's par_iter join semantics require parking workers between stages. We've shipped what's worth shipping past that floor (mimalloc, hand-vectorized pool_ids, bounded-queue streaming pipeline, lasso term interning). Further compression would require restructuring around an async stage scheduler.
| Tool | Type | Key difference from ripvec |
|---|---|---|
| ripgrep | Text search | No semantic understanding |
| Sourcegraph | Cloud AI platform | $49-59/user/month, code leaves your machine |
| grepai | Local semantic search | Requires Ollama for embeddings |
| mgrep | Semantic search | Uses cloud embeddings (Mixedbread AI) |
| Serena | MCP symbol navigation | Requires per-language LSP servers installed |
| Bloop | Was semantic + navigation | Archived Jan 2025 |
| VS Code anycode | Tree-sitter outlines | Editor-only, no cross-file search |
| Cursor @Codebase | IDE semantic search | Cursor-only, sends embeddings to cloud |
ripvec is self-contained (no Ollama, no cloud, no per-language setup), runs locally, and combines search + LSP + structural ranking in one binary. The cacheless default fits sub-MCP / fan-out / fresh-worktree workflows where a persistent index isn't viable.
cargo binstall ripvec ripvec-mcpRequires cargo-binstall. Downloads a pre-built binary for your platform; no compilation.
cargo install ripvec ripvec-mcpFor CUDA (Linux with NVIDIA GPU, transformer engines only):
cargo install ripvec ripvec-mcp --features cudaTo enable transformer engines on the MCP daemon:
cargo install ripvec-mcp --features legacy-transformer-mcp(The default ripvec-mcp build ships only the ripvec engine. The CLI binary ripvec accepts all engines without feature gating.)
claude plugin install ripvec@fnordpig-my-claude-pluginsThe plugin auto-downloads the binary for your platform on first use and configures both MCP and LSP servers. It includes 3 skills (codebase orientation, semantic discovery, change impact analysis), 3 commands (/map, /find, /repo-index), and a code exploration agent. CUDA is auto-detected via nvidia-smi.
| Platform | Backends | GPU |
|---|---|---|
| macOS Apple Silicon | Metal + MLX + CPU (Accelerate) | Metal auto-enabled |
| Linux x86_64 | CPU (OpenBLAS) | CUDA with --features cuda |
| Linux ARM64 (Graviton) | CPU (OpenBLAS) | CUDA with --features cuda |
Model weights download automatically on first run: ~33 MB (potion-base-32M, default ripvec engine) or ~100 MB (ModernBERT). The cross-encoder reranker (ms-marco-TinyBERT-L-2-v2, ~5 MB) downloads on first prose-class query under the ripvec engine.
ripvec "error handling" . # Default ripvec engine
ripvec "form validation hooks" -n 5 # Top 5 results
ripvec "database migration" --mode keyword # BM25 only
ripvec "session encoding" --model modernbert --index # ModernBERT with persistent index
ripvec --model modernbert --index --exclude-extensions=jsonl,md # Skip noisy extensions
ripvec -i --model modernbert --index . # Interactive TUI (transformer engines only){ "mcpServers": { "ripvec": { "command": "ripvec-mcp" } } }Tools (7 retrieval + 9 LSP):
| Category | Tools |
|---|---|
| Retrieval | search (with scope / include_extensions / exclude_extensions), find_similar, find_duplicates, get_repo_map, reindex, index_status, up_to_date |
| LSP | lsp_document_symbols, lsp_workspace_symbols, lsp_hover, lsp_goto_definition, lsp_goto_implementation, lsp_references, lsp_prepare_call_hierarchy, lsp_incoming_calls, lsp_outgoing_calls |
| Diagnostics | debug_log, log_level |
A single search tool covers code and prose. The agent picks scope (code / docs / all); the corpus-aware rerank gate decides whether the cross-encoder fires on a given query.
Engine selection is per-daemon via RIPVEC_MCP_ENGINE={ripvec,modernbert,bert}; default is ripvec. Tool schemas are stable across engines: index_status reports engine: "ripvec" and cache_location: "in-memory" under the ripvec engine, engine: "modernbert" and an on-disk path under transformer engines.
ripvec-mcp --lsp # serves LSP over stdioSame binary, --lsp flag selects protocol.
19 tree-sitter grammars, 30 file extensions:
| Language | Extensions | Extracted elements |
|---|---|---|
| Rust | .rs |
functions, structs, enums, variants, fields, impls, traits, consts, mods |
| Python | .py |
functions, classes, assignments |
| JavaScript | .js .jsx |
functions, classes, methods, variables |
| TypeScript | .ts .tsx |
functions, classes, interfaces, type aliases, enums |
| Go | .go |
functions, methods, types, constants |
| Java | .java |
methods, classes, interfaces, enums, fields, constructors |
| C | .c .h |
functions, structs, enums, typedefs |
| C++ | .cpp .cc .cxx .hpp |
functions, classes, namespaces, enums, fields |
| Bash | .sh .bash .bats |
functions, variables |
| Ruby | .rb |
methods, classes, modules, constants |
| HCL / Terraform | .tf .tfvars .hcl |
blocks (resources, data, variables) |
| Kotlin | .kt .kts |
functions, classes, objects, properties |
| Swift | .swift |
functions, classes, protocols, properties |
| Scala | .scala |
functions, classes, traits, objects, vals, types |
| TOML | .toml |
tables, key-value pairs |
| JSON | .json |
object keys |
| YAML | .yaml .yml |
mapping keys |
| Markdown | .md |
headings |
Unsupported file types get sliding-window plain-text chunking. The embedding model handles any language; tree-sitter just provides better chunk boundaries.
ripvec's static bi-encoder uses Model2Vec embeddings (potion-base-32M, potion-code-16M) from MinishLab, whose semble pipeline inspired the path-enriched BM25 and query-shape boosting design we ported to Rust and extended. Cross-encoder rerank uses ms-marco-TinyBERT-L-2-v2. See CREDITS.md for the full ledger of what we used, what we ported, and what we built on top.
- goToDefinition is best-effort: resolves by name matching and structural importance, not by type system analysis. Use dedicated LSPs (rust-analyzer, pyright, gopls) when you need exact resolution for overloaded symbols.
- Call graph is approximate: common names like
new,run,rendermay resolve to the wrong definition. Cross-crate resolution limited to workspace members. - Ripvec engine top-10 coherence on long-form prose: ModernBERT retains an edge on narrative corpora where a single source document is the right answer for every position in the top-10. Top-3 quality is competitive; coherent top-10 is not. If you're searching a legal archive or a book collection and need 10 contiguous hits from the same source,
--model modernbert --indexis the better tool. - Cacheless cold start scales linearly: first-query indexing on the ripvec engine is O(files). At 92K files (Linux kernel) it's ~75s. Persistent transformer engines amortize this across runs but pay model-download and disk-cache costs.
- English-centric: both engines were trained primarily on English text. Queries and code comments in other languages will have lower recall.
cargo fmt --check && cargo clippy --all-targets -- -D warnings && cargo test --workspaceSee CLAUDE.md for detailed development conventions, architecture notes, and MCP tool namespace resolution.
Cargo workspace with three crates:
| Crate | Role |
|---|---|
ripvec-core |
Engines (ripvec + transformer), backends, chunking, embedding, search, repo map, cache, call graph, ranking layers |
ripvec |
CLI binary (clap + ratatui TUI) |
ripvec-mcp |
MCP + LSP server binary (rmcp + tower-lsp-server) |
- CREDITS.md: full attribution for models, libraries, and design inspiration
- Metal/MPS Architecture
- CUDA Architecture
- Development Learnings
Licensed under either of Apache-2.0 or MIT at your option.