Skip to content

phillza/fireworks-proxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fireworks Rate-Limiting Proxy

CI Python 3.10+ License: MIT

A local HTTP proxy that sits between Kimi CLI and the Fireworks AI API. It queues concurrent Kimi terminals behind request-per-second and token-per-minute budgets so they are less likely to hit Fireworks adaptive 429 rate limit exceeded errors.

Problem

Fireworks AI applies adaptive serverless rate limits. For Kimi K2.6 turbo, live response headers showed limits for total prompt tokens, uncached prompt tokens, and generated tokens per minute. When 4-8 Kimi terminals all send large project context at once, token bursts can hit those adaptive limits even when raw request count looks fine.

Solution

This proxy turns many hard rejections into queued waits. Requests are held locally until the request bucket and rolling token budgets have room, then forwarded to Fireworks.

Quick Start

# 1. Start the proxy (leave this terminal open)
cd "C:\Users\phillip\Coding projects\fireworks-proxy"
python fireworks_proxy.py

# 2. Point Kimi CLI at the proxy
# Edit: C:\Users\phillip\.kimi\config.toml
[providers.fireworks]
type = "anthropic"
base_url = "http://localhost:8787"
api_key = "your_key"

Then open 4–8 Kimi terminals normally. They'll route through the proxy automatically.

Configuration

Proxy settings (env vars)

Variable Default Description
FIREWORKS_PROXY_RPS 20 Requests per second (20 RPS = 1,200 RPM)
FIREWORKS_PROXY_PORT 8787 Local listen port
FIREWORKS_PROXY_MAX_WAIT 120 Max seconds to queue a request
FIREWORKS_PROXY_PROMPT_TPM_LIMIT 4500000 Total prompt tokens per minute before safety margin
FIREWORKS_PROXY_UNCACHED_PROMPT_TPM_LIMIT 900000 Uncached prompt tokens per minute before safety margin
FIREWORKS_PROXY_GENERATED_TPM_LIMIT 36000 Generated/output tokens per minute before safety margin
FIREWORKS_PROXY_TPM_SAFETY_RATIO 0.80 Fraction of Fireworks TPM limits used locally
FIREWORKS_PROXY_OUTPUT_TOKEN_RESERVE 2048 Output-token reserve when a request omits max_tokens
FIREWORKS_PROXY_CHARS_PER_TOKEN 3.5 Prompt-token estimate used before forwarding
FIREWORKS_PROXY_DAILY_TOKEN_LIMIT 200000000 Daily token cap before the proxy returns 429
FIREWORKS_PROXY_DAILY_TOKEN_WARN_RATIO 0.80 Log a warning after this share of the daily cap
FIREWORKS_PROXY_TIMEZONE Australia/Melbourne Calendar day used for daily usage rollover
FIREWORKS_PROXY_USAGE_STATE daily_usage.json Local file used to persist today's counted tokens
# Slower but safer for 8+ terminals
$env:FIREWORKS_PROXY_RPS = 20
python fireworks_proxy.py

# Faster, less headroom (4 terminals)
$env:FIREWORKS_PROXY_RPS = 35
python fireworks_proxy.py

Kimi CLI settings

Also cap max_steps_per_turn in ~/.kimi/config.toml so no single terminal monopolises the bucket:

[loop_control]
max_steps_per_turn = 100

How It Works

  • Request bucket: 20 RPS sustained, 20 burst capacity (1x)
  • Token budgets: rolling 60-second prompt, uncached prompt, and generated-token budgets
  • Adaptive limit learning: Fireworks X-Ratelimit-Limit-Tokens-* headers update live local budgets
  • Queueing: When request or token budgets are low, requests wait instead of being sent immediately
  • Streaming: SSE responses from Fireworks are forwarded chunk-by-chunk
  • Connection safety: Fresh TCP connection per request (no reuse corruption)
  • Metrics: Queue depth, wait times, token budgets, and daily usage are logged every 10 seconds

Why Burst Matters

Your 8 terminals may only sustain a low request rate over time, but they can all send large prompts at once. Fireworks' adaptive limits can drop after spikes, so the proxy leaves headroom instead of trying to ride the dashboard's dotted rate-limit line.

Example Output

2026-05-21 00:31:59,656 INFO Fireworks proxy starting on http://127.0.0.1:8787 -> https://api.fireworks.ai/inference (RPS limit=20.0, prompt TPM=4500000, uncached prompt TPM=900000, generated TPM=36000, TPM safety=0.80, daily limit=200000000)
2026-05-21 00:32:09,658 INFO stats | bucket={'rate': 20.0, 'capacity': 20.0, 'tokens': 20.0} | last_60s=0 req | token_budgets={...} | daily_tokens=722859/200000000 remaining=199277141

Files

File Purpose
fireworks_proxy.py The proxy server (aiohttp)
kill_proxy.ps1 PowerShell helper to kill a stuck proxy process

Troubleshooting

Port already in use

# Run the kill script
.\kill_proxy.ps1

# Or manually
Get-NetTCPConnection -LocalPort 8787 | Select-Object OwningProcess
Stop-Process -Id <PID> -Force

Proxy not working

Check health endpoint:

curl http://localhost:8787/health

Check daily token usage:

curl http://localhost:8787/usage

Daily usage limit

The proxy defaults to a 200M tokens/day cap. This comes from the recent analytics total: 1.19B tokens over 3 days = 396.7M/day, then reduced by 50% to 198.3M/day, rounded to 200M.

When the proxy sees Fireworks token usage in a response, it records it in daily_usage.json. At 80% of the cap it logs a warning. Once the cap is hit, new requests receive a clean 429 daily_token_limit response until the next Melbourne calendar day.

Still getting 429s

First lower the token safety ratio because Fireworks' adaptive dotted limit may have moved down:

$env:FIREWORKS_PROXY_TPM_SAFETY_RATIO = 0.60
python fireworks_proxy.py

If the log shows request bursts rather than token throttling, lower the RPS:

$env:FIREWORKS_PROXY_RPS = 15
python fireworks_proxy.py

Requirements

  • Python 3.10+
  • aiohttp and PyJWT (pip install aiohttp PyJWT)
  • Fireworks API key (set in Kimi config)

About

Local HTTP rate-limiting proxy that queues concurrent Kimi CLI requests behind RPS and TPM budgets so they don't trip Fireworks AI's adaptive 429s.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors