Checklist
Background
The v2 refactor split AReaL into separate microservices — inference, agent, weight-update, training — each exposing a clean HTTP gateway.
The proposal: a single areal console-script that exposes one sub-CLI per service, drives the same gateways the controllers already drive, and persists the minimum local state needed for commands to find each other
across invocations. Concretely it covers four scopes:
areal inf # operate an inference service (gateway+router+models)
areal agent # operate an agent service (gateway+router+sessions)
areal train # submit and observe training jobs
areal weight-update # diagnose weight-sync state between train and inference
Potential Solution
The principles of CLI design include:
Packaging: one binary, four namespaces. A user installs the project and gets a single areal command. The four sub-CLIs (inf, agent,train, weight-update) are subcommands of that one binary, not separate scripts.
Lightness: areal --help must not load the training stack. A user on a login node that has no GPUs, no sglang/vllm install, no megatron, should still be able to install the project and run areal inf status against a remote service.
**State: small files under ~/.areal/, no background process.**Commands need to find each other across invocations — when I runareal inf register ten minutes after areal inf run, the second command has to know which gateway address the first one started on.
Process model: non-daemon, with eyes open. When the CLI launches a service, it spawns the component processes detached (start_new_session=True), records their PIDs in the state file, and exits.
Per-module design
areal inf — inference operator console
The user mental model is close to Ollama: bring up a service, attach one or more models, send requests, tear down.
areal inf run # launch gateway + router (optionally with --model inline)
areal inf stop # tear them down
areal inf status # health for one service
areal inf ps # list all locally known services
areal inf register # attach a model (external = HTTP only; internal = spawn backends)
areal inf deregister # detach a model (and stop its backends if internal)
areal inf models # list attached models
areal inf chat # one-shot or REPL chat with a model
areal inf collect # collect a batch of rollout trajectories
areal inf logs # show gateway / router / model logs
areal agent — agent operator console
The agent service has the same gateway/router/data-proxy/worker shape as inference, but the primary unit of interaction is different. For inf it's the model; for agent it's the session — a multi-turn conversation that carries tool state on the data-proxy side, can hold an RL session key obtained from the inference service, and needs explicit creation, switching, reward, and timeout semantics.
areal agent run # launch router + N pairs + gateway
areal agent stop
areal agent status
areal agent reward # send reward to inference for a session
areal agent logs
areal train — job submitter, not service controller
Training has a different lifecycle than inf or agent: training jobs terminate, services don't. The scheduling decision (local / slurm / ray) is already inside the driver, so the CLI's job is purely lifecycle wrapping. The verbs therefore look like a job submitter rather than a service controller:
areal train run # run a driver in the foreground (small jobs, debugging)
areal train start # spawn a detached driver process (cluster jobs)
areal train stop # signal a running job by name
areal train ps # list locally tracked jobs
areal train status # status of one job
areal train logs # tail a job's combined stdout/stderr
A driver entry is resolved by precedence: --driver module:func on the CLI, then a top-level driver: field in the YAML config, then a built-in fallback per command. The CLI stays scheduler-agnostic; whether the driver launches locally, on Slurm, or on Ray is decided inside the driver based on config.scheduler.type, exactly as today's hand-written example scripts already do.
areal weight-update — diagnostic console
The weight-update service sits between training and inference, exposing /connect, /update_weights, /disconnect, and a /weight_meta/*family. The operator-facing surface is genuinely small:
areal weight-update status # is the gateway alive? how many pairs connected?
areal weight-update pairs # list connected (train, inference) pairs and versions
areal weight-update ps # list locally known weight-update services
areal weight-update logs # tail the gateway log
There's no run in the first cut, because in the v2 flow the weight-update gateway is brought up by the training-side controller, not by the operator. If we later want operator-initiated launches (for testing or for split-cluster setups), run can be added without disturbing the rest of the surface.
Implementation plan
Earlier stages lock the surface; later stages fill in behavior behind that fixed surface. Gradually enriching the features as the v2 functionality becomes more complete.
| Stage |
What lands |
What it locks down |
| 1 |
Scaffold: console-script, full parser tree (every verb's --help matches its final shape), ~/.areal/ directory layout, lightness test. Verbs print "not yet implemented" but exit cleanly. |
Command names, flag matrices, state file shapes — everything later stages depend on |
| 2 |
areal inf run / stop / status / ps / logs real behavior |
inference service lifecycle |
| 3 |
areal inf scaleup / scaledown / models |
inference service management |
| 4 |
areal agent lifecycle + session management + chat |
agent operator surface |
| 5 |
areal train run / start / stop / ps / status / logs |
job submitter |
| 6 |
areal weight-update status / pairs / ps / logs (+ optional force-sync) |
diagnostic console |
Stage 1 is by far the smallest and the most important: it freezes the command names, the flag shapes, and the on-disk layout. Once it merges, the remaining stages can land independently without changing the user's mental model.
Additional Information
Existing controllers this complements rather than replaces:
areal/experimental/inference_service/controller/controller.py,
areal/experimental/agent_service/controller/controller.py,
areal/experimental/weight_update/controller/controller.py.
Checklist
areal/api/. If not, please raise a refactor issue first.Background
The v2 refactor split AReaL into separate microservices — inference, agent, weight-update, training — each exposing a clean HTTP gateway.
The proposal: a single
arealconsole-script that exposes one sub-CLI per service, drives the same gateways the controllers already drive, and persists the minimum local state needed for commands to find each otheracross invocations. Concretely it covers four scopes:
Potential Solution
The principles of CLI design include:
Packaging: one binary, four namespaces. A user installs the project and gets a single
arealcommand. The four sub-CLIs (inf,agent,train,weight-update) are subcommands of that one binary, not separate scripts.Lightness:
areal --helpmust not load the training stack. A user on a login node that has no GPUs, no sglang/vllm install, no megatron, should still be able to install the project and runareal inf statusagainst a remote service.**State: small files under
~/.areal/, no background process.**Commands need to find each other across invocations — when I runareal inf registerten minutes afterareal inf run, the second command has to know which gateway address the first one started on.Process model: non-daemon, with eyes open. When the CLI launches a service, it spawns the component processes detached (
start_new_session=True), records their PIDs in the state file, and exits.Per-module design
areal inf— inference operator consoleThe user mental model is close to Ollama: bring up a service, attach one or more models, send requests, tear down.
areal agent— agent operator consoleThe agent service has the same gateway/router/data-proxy/worker shape as inference, but the primary unit of interaction is different. For
infit's the model; foragentit's the session — a multi-turn conversation that carries tool state on the data-proxy side, can hold an RL session key obtained from the inference service, and needs explicit creation, switching, reward, and timeout semantics.areal train— job submitter, not service controllerTraining has a different lifecycle than
inforagent: training jobs terminate, services don't. The scheduling decision (local / slurm / ray) is already inside the driver, so the CLI's job is purely lifecycle wrapping. The verbs therefore look like a job submitter rather than a service controller:A driver entry is resolved by precedence:
--driver module:funcon the CLI, then a top-leveldriver:field in the YAML config, then a built-in fallback per command. The CLI stays scheduler-agnostic; whether the driver launches locally, on Slurm, or on Ray is decided inside the driver based onconfig.scheduler.type, exactly as today's hand-written example scripts already do.areal weight-update— diagnostic consoleThe weight-update service sits between training and inference, exposing
/connect,/update_weights,/disconnect, and a/weight_meta/*family. The operator-facing surface is genuinely small:There's no
runin the first cut, because in the v2 flow the weight-update gateway is brought up by the training-side controller, not by the operator. If we later want operator-initiated launches (for testing or for split-cluster setups),runcan be added without disturbing the rest of the surface.Implementation plan
Earlier stages lock the surface; later stages fill in behavior behind that fixed surface. Gradually enriching the features as the v2 functionality becomes more complete.
--helpmatches its final shape),~/.areal/directory layout, lightness test. Verbs print "not yet implemented" but exit cleanly.areal inf run / stop / status / ps / logsreal behaviorareal inf scaleup / scaledown / modelsareal agentlifecycle + session management + chatareal train run / start / stop / ps / status / logsareal weight-update status / pairs / ps / logs(+ optionalforce-sync)Stage 1 is by far the smallest and the most important: it freezes the command names, the flag shapes, and the on-disk layout. Once it merges, the remaining stages can land independently without changing the user's mental model.
Additional Information
Existing controllers this complements rather than replaces:
areal/experimental/inference_service/controller/controller.py,
areal/experimental/agent_service/controller/controller.py,
areal/experimental/weight_update/controller/controller.py.