Skip to content

pookNast/headroom-sidecar

Repository files navigation

headroom-sidecar

Token-compression sidecar for LLM gateways. Drop it next to any OpenAI-compatible gateway (Hivemind, LiteLLM, custom proxy) and reduce prompt tokens 15-86% before they reach your backend.

Forked from chopratejas/headroom v0.23.0 compression modules, repackaged as a standalone HTTP sidecar service.

How it works

Client → LLM Gateway → headroom-sidecar :9100/compress → Compressed → Backend
                        (HTTP call, graceful fallback on failure)

The gateway sends the request body to the sidecar before forwarding to the LLM. If the sidecar is down or compression doesn't help, the original body passes through unchanged. Zero risk of data loss.

Compression pipeline

Stage What it does
content_router Detects content type (plain text, JSON, code)
compressor SmartCrusher for JSON, Kompress (abbreviation/filler removal) for plain text
cache_aligner Separates static prefix from dynamic content for caching alignment

All stages include size guards — if a stage would increase content size, it's skipped.

Quick start

# Docker
docker run -d --name headroom-sidecar \
  -p 9100:9100 \
  -e HCP_MIN_BODY_SIZE=1000 \
  -e HCP_RATE_LIMIT_RPM=600 \
  headroom-sidecar

# Or docker-compose
docker compose up -d

Endpoints

Endpoint Method Description
/health GET Status, uptime, stats
/compress POST Compress request body
/expand POST Expand <<ccr:HASH>> markers
/metrics GET Prometheus-compatible counters

Compression request

curl -X POST http://localhost:9100/compress \
  -H 'Content-Type: application/json' \
  -d '{
    "body": {
      "messages": [
        {"role": "user", "content": "Please review this code carefully..."}
      ]
    },
    "target": "messages"
  }'

Response:

{
  "body": {"messages": [{"role": "user", "content": "Review code..."}]},
  "metadata": {
    "action": "compressed",
    "original_chars": 1128,
    "compressed_chars": 157,
    "savings_pct": 86.08,
    "stages_run": ["content_router", "compressor", "cache_aligner"],
    "elapsed_ms": 4.2
  }
}

Environment variables

Variable Default Description
HCP_HOST 0.0.0.0 Listen address
HCP_PORT 9100 Listen port
HCP_MIN_BODY_SIZE 1000 Skip messages shorter than this
HCP_TIMEOUT 3.0 Pipeline timeout (seconds)
HCP_RATE_LIMIT_RPM 600 Max requests per minute
HCP_RATE_LIMIT_BURST 50 Burst allowance
HCP_DB_PATH /var/lib/headroom/ccr.db CCR cache database path

Integration examples

Go gateway (Hivemind pattern)

func CompressBody(endpoint string, body []byte) []byte {
    resp, err := http.Post(endpoint+"/compress",
        "application/json", bytes.NewReader(body))
    if err != nil {
        return body // graceful fallback
    }
    defer resp.Body.Close()
    var result struct {
        Body     json.RawMessage `json:"body"`
        Metadata struct {
            Action string `json:"action"`
        } `json:"metadata"`
    }
    json.NewDecoder(resp.Body).Decode(&result)
    if result.Metadata.Action == "compressed" {
        return result.Body
    }
    return body
}

Any gateway via config

Point your gateway's compression endpoint at the sidecar:

[compression]
enabled = true
endpoint = "http://127.0.0.1:9100"
min_body_size = 1000
timeout_ms = 1000

Tests

python3 -m pytest tests/ -q    # 657 tests

Modules included

Cherry-picked from headroom v0.23.0:

  • compression/ — SmartCrusher, ContentRouter, CacheAligner, TagProtector, Kompress
  • ccr/ — Compress-Cache-Retrieve pipeline
  • scoring/ — Line importance scoring, TOIN observer
  • memory/ — Intelligent context injection, bubbling, hierarchy
  • validation/ — Input validation, injection detection
  • resilience/ — Circuit breaker, auth routing
  • simulation/ — Token cost prediction

License

Apache 2.0 (inherited from chopratejas/headroom)

About

Token-compression sidecar for LLM gateways. Fork of chopratejas/headroom compression modules as a standalone HTTP service.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors