vllm-project · samzong · Oct 14, 2025
@@ -0,0 +1,101 @@
+# Semantic Router Quickstart
+
+This quickstart walks through the minimal set of commands needed to prove that
+the semantic router can classify incoming chat requests, route them through
+Envoy, and receive OpenAI-compatible completions. The flow is optimized for
+local laptops and uses a lightweight mock backend by default, so the entire
+loop finishes in a few minutes.
+
+## Prerequisites
+
+- Python environment with the project’s dependencies and virtualenv activated.
+- `make`, `curl`, `go`, `cargo`, `rustc`, and `python3` in `PATH`.
+- All commands below are run from the repository root.
+
+## Step-by-Step Runbook
+
+0. **Download router support models**
+
+   These assets (ModernBERT classifiers, LoRA adapters, embeddings, etc.) are
+   required before the router can start.
+
+   ```bash
+   make download-models
+   ```
+
+1. **Start the OpenAI-compatible backend**
+
+   The router expects at least one endpoint that serves `/v1/chat/completions`.
+   You can point to a real vLLM deployment, but the fastest option is the
+   bundled mock server:
+
+   ```bash
+   pip install -r tools/mock-vllm/requirements.txt
+   python -m uvicorn tools.mock_vllm.app:app --host 0.0.0.0 --port 8000
+   ```
+
+   Leave this process running; it provides instant canned responses for
+   `openai/gpt-oss-20b`.
+
+2. **Launch Envoy**
+
+   In a separate terminal, bring up the Envoy sidecar that listens on
+   `http://127.0.0.1:8801/v1/*` and forwards traffic to the router’s gRPC
+   ExtProc server.
+
+   ```bash
+   make run-envoy
+   ```
+
+3. **Start the router with the quickstart config**
+
+   In another terminal, run the quickstart bootstrap. Point the health probe at
+   the router’s local HTTP API (port 8080) so the script does not wait on the
+   Envoy endpoint.
+
+   ```bash
+   QUICKSTART_HEALTH_URL=http://127.0.0.1:8080/health \
+     ./examples/quickstart/quickstart.sh --skip-download --skip-build
+   ```
+
+   Keep this process alive; Ctrl+C will stop the router.
+
+4. **Run the quick evaluation**
+
+   With Envoy, the router, and the mock backend running, execute the benchmark
+   to send a small batch of MMLU questions through the routing pipeline.
+
+   ```bash
+   OPENAI_API_KEY="sk-test" \
+     ./examples/quickstart/quick-eval.sh \
+       --mode router \
+       --samples 5 \
+       --vllm-endpoint ""
+   ```
+
+   - `--mode router` restricts the run to router-transparent requests.
+   - `--vllm-endpoint ""` disables direct vLLM comparisons.
+
+5. **Inspect the results**
+
+   The evaluator writes all artifacts under
+   `examples/quickstart/results/<timestamp>/`:
+
+   - `raw/` – individual JSON summaries per dataset/model combination.
+   - `quickstart-summary.csv` – tabular metrics (accuracy, tokens, latency).
+   - `quickstart-report.md` – Markdown report suitable for sharing.
+
+   You can re-run the evaluator with different flags (e.g., `--samples 10`,
+   `--dataset arc`) and the outputs will land in fresh timestamped folders.
+
+## Switching to a Real vLLM Backend
+
+If you prefer to exercise a real language model:
+
+1. Replace step 1 with a real vLLM launch (or any OpenAI-compatible server).
+2. Update `examples/quickstart/config-quickstart.yaml` so the `vllm_endpoints`
+   block points to that service (IP, port, and model name).
+3. Re-run steps 2–4. No other changes to the quickstart scripts are needed.
+
+Keep the mock server documented for quick demos; swap to full vLLM when you
+want latency/quality signals from the actual model.
@@ -0,0 +1,90 @@
+# Quickstart configuration tuned for a single-node developer setup.
+# Keeps routing options minimal while remaining compatible with the default assets
+# shipped by `make download-models`.
+
+bert_model:
+  model_id: sentence-transformers/all-MiniLM-L12-v2
+  threshold: 0.6
+  use_cpu: true
+
+semantic_cache:
+  enabled: false
+  backend_type: "memory"
+
+prompt_guard:
+  enabled: false
+  use_modernbert: true
+  model_id: "models/jailbreak_classifier_modernbert-base_model"
+  threshold: 0.7
+  use_cpu: true
+  jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json"
+
+classifier:
+  category_model:
+    model_id: "models/category_classifier_modernbert-base_model"
+    threshold: 0.6
+    use_cpu: true
+    use_modernbert: true
+    category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"
+  pii_model:
+    model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
+    threshold: 0.7
+    use_cpu: true
+    pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json"
+
+vllm_endpoints:
+  - name: "local-vllm"
+    address: "127.0.0.1"
+    port: 8000
+    models:
+      - "openai/gpt-oss-20b"
+    weight: 1
+
+model_config:
+  "openai/gpt-oss-20b":
+    preferred_endpoints: ["local-vllm"]
+    reasoning_family: "gpt-oss"
+    pii_policy:
+      allow_by_default: true
+
+categories:
+  - name: general
+    system_prompt: "You are a helpful and knowledgeable assistant. Provide concise, accurate answers."
+    model_scores:
+      - model: openai/gpt-oss-20b
+        score: 0.7
+        use_reasoning: false
+
+  - name: reasoning
+    system_prompt: "You explain your reasoning with clear numbered steps before giving a final answer."
+    model_scores:
+      - model: openai/gpt-oss-20b
+        score: 0.6
+        use_reasoning: true
+
+  - name: safety
+    system_prompt: "You prioritize safe completions and refuse harmful requests."
+    model_scores:
+      - model: openai/gpt-oss-20b
+        score: 0.5
+        use_reasoning: false
+
+default_model: openai/gpt-oss-20b
+
+reasoning_families:
+  gpt-oss:
+    type: "chat_template_kwargs"
+    parameter: "thinking"
+
+api:
+  batch_classification:
+    metrics:
+      enabled: false
+
+# Tool auto-selection is available but disabled for quickstart.
+tools:
+  enabled: false
+  top_k: 3
+  similarity_threshold: 0.2
+  tools_db_path: "config/tools_db.json"
+  fallback_to_empty: true