Token-compression sidecar for LLM gateways. Drop it next to any OpenAI-compatible gateway (Hivemind, LiteLLM, custom proxy) and reduce prompt tokens 15-86% before they reach your backend.
Forked from chopratejas/headroom v0.23.0 compression modules, repackaged as a standalone HTTP sidecar service.
Client → LLM Gateway → headroom-sidecar :9100/compress → Compressed → Backend
(HTTP call, graceful fallback on failure)
The gateway sends the request body to the sidecar before forwarding to the LLM. If the sidecar is down or compression doesn't help, the original body passes through unchanged. Zero risk of data loss.
| Stage | What it does |
|---|---|
content_router |
Detects content type (plain text, JSON, code) |
compressor |
SmartCrusher for JSON, Kompress (abbreviation/filler removal) for plain text |
cache_aligner |
Separates static prefix from dynamic content for caching alignment |
All stages include size guards — if a stage would increase content size, it's skipped.
# Docker
docker run -d --name headroom-sidecar \
-p 9100:9100 \
-e HCP_MIN_BODY_SIZE=1000 \
-e HCP_RATE_LIMIT_RPM=600 \
headroom-sidecar
# Or docker-compose
docker compose up -d| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Status, uptime, stats |
/compress |
POST | Compress request body |
/expand |
POST | Expand <<ccr:HASH>> markers |
/metrics |
GET | Prometheus-compatible counters |
curl -X POST http://localhost:9100/compress \
-H 'Content-Type: application/json' \
-d '{
"body": {
"messages": [
{"role": "user", "content": "Please review this code carefully..."}
]
},
"target": "messages"
}'Response:
{
"body": {"messages": [{"role": "user", "content": "Review code..."}]},
"metadata": {
"action": "compressed",
"original_chars": 1128,
"compressed_chars": 157,
"savings_pct": 86.08,
"stages_run": ["content_router", "compressor", "cache_aligner"],
"elapsed_ms": 4.2
}
}| Variable | Default | Description |
|---|---|---|
HCP_HOST |
0.0.0.0 |
Listen address |
HCP_PORT |
9100 |
Listen port |
HCP_MIN_BODY_SIZE |
1000 |
Skip messages shorter than this |
HCP_TIMEOUT |
3.0 |
Pipeline timeout (seconds) |
HCP_RATE_LIMIT_RPM |
600 |
Max requests per minute |
HCP_RATE_LIMIT_BURST |
50 |
Burst allowance |
HCP_DB_PATH |
/var/lib/headroom/ccr.db |
CCR cache database path |
func CompressBody(endpoint string, body []byte) []byte {
resp, err := http.Post(endpoint+"/compress",
"application/json", bytes.NewReader(body))
if err != nil {
return body // graceful fallback
}
defer resp.Body.Close()
var result struct {
Body json.RawMessage `json:"body"`
Metadata struct {
Action string `json:"action"`
} `json:"metadata"`
}
json.NewDecoder(resp.Body).Decode(&result)
if result.Metadata.Action == "compressed" {
return result.Body
}
return body
}Point your gateway's compression endpoint at the sidecar:
[compression]
enabled = true
endpoint = "http://127.0.0.1:9100"
min_body_size = 1000
timeout_ms = 1000python3 -m pytest tests/ -q # 657 testsCherry-picked from headroom v0.23.0:
compression/— SmartCrusher, ContentRouter, CacheAligner, TagProtector, Kompressccr/— Compress-Cache-Retrieve pipelinescoring/— Line importance scoring, TOIN observermemory/— Intelligent context injection, bubbling, hierarchyvalidation/— Input validation, injection detectionresilience/— Circuit breaker, auth routingsimulation/— Token cost prediction
Apache 2.0 (inherited from chopratejas/headroom)