RightNow-AI
diff --git a/‎README.md‎
Lines changed: 79 additions & 66 deletions b/‎README.md‎
Lines changed: 79 additions & 66 deletions
diff --git a/‎benchmarks/modal/configs/models.yaml‎
Lines changed: 35 additions & 32 deletions b/‎benchmarks/modal/configs/models.yaml‎
Lines changed: 35 additions & 32 deletions
@@ -108,12 +108,14 @@ TIDE auto-probes your model's architecture. No adapter code needed.
 
 | Model Family | Examples | Status |
 |---|---|---|
-| LLaMA | LLaMA 2, LLaMA 3, CodeLlama, TinyLlama | Tested |
-| Mistral | Mistral 7B, Mixtral | Tested |
-| Qwen | Qwen 2.5 series | Tested |
+| LLaMA | LLaMA 3.3, LLaMA 4 Scout/Maverick | Benchmarked |
+| DeepSeek | DeepSeek R1, R1 Distill 8B/32B/70B | Benchmarked |
+| Qwen | Qwen3 8B/32B, Qwen 2.5 | Benchmarked |
+| Mistral | Mistral Small 3.1, Mixtral | Supported |
+| Gemma | Gemma 3 12B/27B | Supported |
 | GPT-2 | GPT-2, DistilGPT-2 | Tested |
 | GPT-NeoX | Pythia, GPT-NeoX-20B | Supported |
-| Phi | Phi-2, Phi-3 | Supported |
+| Phi | Phi-3, Phi-4 | Supported |
 | Falcon | Falcon 7B/40B | Supported |
 | OPT | OPT-1.3B through OPT-30B | Supported |
 | **Anything else** | Any `AutoModelForCausalLM` | Auto-probed |
@@ -130,108 +132,119 @@ engine = TIDE.TIDE(model, "router.pt")  # UniversalAdapter handles it
 
 GPU architecture is auto-detected at install time.
 
-| GPU | Status | Notes |
+| GPU | Arch | Status |
 |---|---|---|
-| V100 | Supported | sm_70 |
-| T4 | Supported | sm_75, great for cost-efficient inference |
-| A100 | Supported | sm_80 |
-| A10G | Tested in CI | sm_86, Modal/AWS default |
-| L4 | Supported | sm_89 |
-| H100 | Supported | sm_90 |
+| V100 | sm_70 | Supported |
+| T4 | sm_75 | Supported |
+| A100 | sm_80 | Benchmarked |
+| A10G | sm_86 | Tested in CI |
+| L4 / L40S | sm_89 | Supported |
+| H100 / H200 | sm_90 | Supported |
+| B100 / B200 | sm_100 | Supported |
+| GB200 / GB300 | sm_120 | Supported (PTX fallback) |
 
 Override: `TORCH_CUDA_ARCH_LIST="8.6" pip install .`
 
 No GPU? TIDE works in pure PyTorch (CPU fallback, no CUDA kernels needed).
 
 ## Benchmark Results
 
-Tested on **LLaMA 3.1 8B Instruct** (32 layers, 4096 hidden) on NVIDIA A100-SXM4-40GB.
-Calibrated with 2000 WikiText samples. CUDA kernels compiled for sm_80.
+All benchmarks on **NVIDIA A100-SXM4-40GB**, bf16 precision, 2000 WikiText calibration samples.
+16 real text prompts (science, code, history, economics).
 
 ### Prefill Exit Rates
 
-16 real text prompts (science, code, history), evaluated at different thresholds:
-
 ```
-Threshold   Exit Rate   Where Exits Happen
-=========   =========   ==================
-  0.95        98.9%     L11: 16 tokens, L31: 158 tokens
-  0.90       100.0%     L11: 16 tokens, L31: 160 tokens
-  0.85       100.0%     L11: 16 tokens, L31: 160 tokens
-  0.70       100.0%     L11: 16 tokens, L31: 160 tokens
-  0.50       100.0%     L11: 16 tokens, L31: 160 tokens
+Model                       Layers  Threshold  Exit Rate  Where Exits Happen
+==========================  ======  =========  =========  ==========================
+Qwen3 8B                     36       0.95      100.0%   L35: 155 tokens
+Qwen3 8B                     36       0.85      100.0%   L35: 155 tokens
+Qwen3 8B                     36       0.50      100.0%   L11:11  L23:5  L35:139
+DeepSeek R1 Distill 8B       32       0.95      100.0%   L31: 176 tokens
+DeepSeek R1 Distill 8B       32       0.85      100.0%   L11:16  L31:160
+DeepSeek R1 Distill 8B       32       0.50      100.0%   L11:16  L31:160
 ```
 
-100% of tokens converge by Layer 31 (the last checkpoint before the final layer).
-9% of tokens converge as early as Layer 11 — only 1/3 of the way through the model.
+100% of tokens converge by the last checkpoint. At lower thresholds, earlier exits
+appear — up to 10% of tokens exit at Layer 11, only 1/3 of the way through the model.
 
 ### Prefill Latency
 
 Single prompt, 20 runs averaged:
 
 ```
-Configuration              Latency     vs Baseline
-======================     =======     ===========
-Baseline (no TIDE)         54.04ms        --
-TIDE (threshold=0.95)      50.94ms       -5.7%
-TIDE (threshold=0.85)      50.52ms       -6.5%
-TIDE (threshold=0.50)      50.21ms       -7.1%
+Model                       Baseline     TIDE (t=0.85)   Change
+==========================  ==========   =============   ======
+Qwen3 8B (36 layers)        46.82ms       44.14ms        -5.7%
+DeepSeek R1 Distill 8B      31.66ms       31.89ms        +0.7%
 ```
 
-TIDE is **faster than baseline** even in frozen-token mode (all layers still run)
-because the router evaluation + early output selection avoids redundant final-layer
-normalization for exited tokens.
-
 ### Batch Throughput
 
 ```
-Batch Size    Baseline (tok/s)    TIDE (tok/s)    Improvement
-==========    ================    ============    ===========
-    1               231                252           +9.1%
-    4               834                902           +8.2%
-    8             1,618              1,773           +9.6%
+Model                       BS   Baseline (tok/s)   TIDE (tok/s)   Change
+==========================  ==   ================   ============   ======
+Qwen3 8B                     1          258              271        +5.0%
+Qwen3 8B                     4          923              961        +4.2%
+Qwen3 8B                     8        1,781            1,926        +8.1%
+DeepSeek R1 Distill 8B       1          403              403        +0.0%
+DeepSeek R1 Distill 8B       8        2,997            2,833        -5.5%
 ```
 
+Qwen3 (36 layers) shows consistent improvement. DeepSeek R1 Distill (32 layers,
+already optimized via distillation) has minimal headroom.
+
 ### Generation Quality
 
-100 tokens generated with `temperature=0` on the same prompt:
+100 tokens, `temperature=0`, same prompt across thresholds:
 
 ```
-Threshold   Exit Rate   Output
-=========   =========   =============================================
-1.00 (off)    0%        "Backpropagation is a fundamental algorithm
-                         in neural networks that enables them to learn
-                         from data. Here's a step-by-step guide on
-                         how it works: 1. Forward pass: The input..."
-
-0.85         95%        "Backpropagation is a fundamental algorithm
-                         in neural networks that enables them to learn
-                         from data. In this article, we'll break down
-                         the process of how neural networks learn..."
-
-0.50         96%        (same as 0.85 — stable)
+Model                   Threshold  Exit Rate  Output
+=====================   =========  =========  ==============================
+DeepSeek R1 Distill 8B  1.0 (off)    0%       "Transformers are a type of
+                                               neural network architecture
+                                               that uses self-attention
+                                               mechanisms to capture long-
+                                               range dependencies..."
+
+DeepSeek R1 Distill 8B    0.85      100%       "Transformers are neural
+                                               networks that use self-
+                                               attention mechanisms to
+                                               process sequential data.
+                                               They are particularly
+                                               effective for tasks like
+                                               machine translation..."
+
+Qwen3 8B                1.0 (off)    0%       "...the basic principles,
+                                               the role of the core, the
+                                               function of the windings,
+                                               and the importance of the
+                                               magnetic field..."
+
+Qwen3 8B                  0.85      100%       Quality degrades at 100%
+                                               exit rate (10 unique tokens).
+                                               Use threshold >= 0.90 for
+                                               Qwen3 to preserve quality.
 ```
 
-95% of decode tokens exit at Layer 31 — the output diverges slightly in phrasing
-("Here's a step-by-step guide" vs "In this article, we'll break down") but
-remains equally coherent and factually correct.
+**Key finding**: DeepSeek R1 Distill maintains quality at 100% exit rate.
+Qwen3 is more sensitive — use a higher threshold (0.90+) to preserve output quality.
 
 ### Convergence Analysis
 
-Layer-by-layer convergence (cosine similarity > 0.98 with final layer):
+Calibrated on 2000 WikiText samples, cosine similarity > 0.98 with final layer:
 
 ```
-Model               Layers   Convergence per Checkpoint Layer
-=================   ======   ===========================================
-LLaMA 3.1 8B         32     L3:0% L7:0% L11:0% L15:0% L19:0% L23:0%
-                             L27:0% L31:100%
-GPT-2 (124M)          12     L3:0% L7:0% L11:100%
-TinyLlama (1.1B)      22     L3:0% L7:0% L11:0% L15:0% L19:0%
+Model                    Layers   Convergence per Checkpoint
+=======================  ======   =======================================
+Qwen3 8B                  36     L3-L31: 0%  L35: 100%
+DeepSeek R1 Distill 8B    32     L3-L27: 0%  L31: 100%
+LLaMA 3.1 8B              32     L3-L27: 0%  L31: 100%
+GPT-2 (124M)              12     L3: 0%  L7: 0%  L11: 100%
 ```
 
-The convergence threshold (0.98) is strict — most tokens converge at the last
-checkpoint. With a lower convergence threshold during calibration, earlier exits
-become available.
+The strict threshold (0.98) means most tokens converge at the penultimate checkpoint.
+Lower `convergence_threshold` during calibration (e.g., 0.95) enables earlier exits.
 
 ## Tuning the Threshold
 
 
@@ -1,55 +1,58 @@
 # Model registry for TIDE benchmarks
-# Organized by phase (implementation priority)
 
 phase_1:
-  # Small/medium models for initial validation
-  - name: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
-    short: "tinyllama-1.1b"
-    gpu: "A10G"
-    dtype: "float16"
-  - name: "meta-llama/Llama-3.1-8B-Instruct"
-    short: "llama3.1-8b"
+  # Small/medium models for validation
+  - name: "meta-llama/Llama-4-Scout-17B-16E-Instruct"
+    short: "llama4-scout-17b"
     gpu: "A100"
-    dtype: "float16"
-  - name: "mistralai/Mistral-7B-Instruct-v0.3"
-    short: "mistral-7b"
+    dtype: "bfloat16"
+  - name: "meta-llama/Llama-3.3-70B-Instruct"
+    short: "llama3.3-70b"
+    gpu: "H100:2"
+    dtype: "bfloat16"
+  - name: "Qwen/Qwen3-8B"
+    short: "qwen3-8b"
+    gpu: "A100"
+    dtype: "bfloat16"
+  - name: "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
+    short: "mistral-small-3.1"
     gpu: "A100"
-    dtype: "float16"
-  - name: "Qwen/Qwen2.5-7B-Instruct"
-    short: "qwen2.5-7b"
+    dtype: "bfloat16"
+  - name: "google/gemma-3-12b-it"
+    short: "gemma3-12b"
     gpu: "A100"
-    dtype: "float16"
+    dtype: "bfloat16"
 
 phase_2:
-  # Medium models
-  - name: "meta-llama/Llama-3.1-70B-Instruct"
-    short: "llama3.1-70b"
-    gpu: "H100:2"
-    dtype: "float16"
-  - name: "Qwen/Qwen2.5-72B-Instruct"
-    short: "qwen2.5-72b"
+  # Large models
+  - name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+    short: "llama4-maverick-17b"
     gpu: "H100:2"
-    dtype: "float16"
+    dtype: "bfloat16"
+  - name: "Qwen/Qwen3-32B"
+    short: "qwen3-32b"
+    gpu: "H100"
+    dtype: "bfloat16"
+  - name: "google/gemma-3-27b-it"
+    short: "gemma3-27b"
+    gpu: "H100"
+    dtype: "bfloat16"
 
 phase_3:
-  # Reasoning models (key TIDE targets)
+  # Reasoning models (key TIDE targets — long decode, many easy tokens)
   - name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
     short: "r1-distill-8b"
     gpu: "A100"
-    dtype: "float16"
+    dtype: "bfloat16"
   - name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
     short: "r1-distill-32b"
     gpu: "H100"
-    dtype: "float16"
+    dtype: "bfloat16"
   - name: "Qwen/QwQ-32B"
     short: "qwq-32b"
     gpu: "H100"
-    dtype: "float16"
+    dtype: "bfloat16"
   - name: "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
     short: "r1-distill-70b"
     gpu: "H100:2"
-    dtype: "float16"
-  - name: "deepseek-ai/DeepSeek-R1"
-    short: "r1-671b"
-    gpu: "H100:4"
-    dtype: "float16"
+    dtype: "bfloat16"