[Feature] Operator CLI (`areal`) for the v2 microservice architecture

## Checklist

- [x] This feature will maintain backward compatibility with the current APIs in
  `areal/api/`. If not, please raise a refactor issue first.

## Background

The v2 refactor split AReaL into separate microservices — inference, agent, weight-update, training — each exposing a clean HTTP gateway. 

The proposal: a single `areal` console-script that exposes one sub-CLI per service, drives the same gateways the controllers already drive, and persists the minimum local state needed for commands to find each other
across invocations. Concretely it covers four scopes:
```
areal inf            # operate an inference service (gateway+router+models)
areal agent          # operate an agent service (gateway+router+sessions)
areal train          # submit and observe training jobs
areal weight-update  # diagnose weight-sync state between train and inference
```

## Potential Solution

The principles of CLI design include:

**Packaging: one binary, four namespaces.** A user installs the project and gets a single `areal` command. The four sub-CLIs (`inf`, `agent`,`train`, `weight-update`) are subcommands of that one binary, not separate scripts.
**Lightness: `areal --help` must not load the training stack.** A user on a login node that has no GPUs, no sglang/vllm install, no megatron, should still be able to install the project and run `areal inf status` against a remote service.
**State: small files under `~/.areal/`, no background process.**Commands need to find each other across invocations — when I run`areal inf register` ten minutes after `areal inf run`, the second command has to know which gateway address the first one started on.
**Process model: non-daemon, with eyes open.** When the CLI launches a service, it spawns the component processes detached (`start_new_session=True`), records their PIDs in the state file, and exits.

### Per-module design

#### `areal inf` — inference operator console

The user mental model is close to Ollama: bring up a service, attach one or more models, send requests, tear down.

```
areal inf run          # launch gateway + router (optionally with --model inline)
areal inf stop         # tear them down
areal inf status       # health for one service
areal inf ps           # list all locally known services
areal inf register     # attach a model (external = HTTP only; internal = spawn backends)
areal inf deregister   # detach a model (and stop its backends if internal)
areal inf models       # list attached models
areal inf chat         # one-shot or REPL chat with a model
areal inf collect      # collect a batch of rollout trajectories
areal inf logs         # show gateway / router / model logs
```
#### `areal agent` — agent operator console

The agent service has the same gateway/router/data-proxy/worker shape as inference, but the primary unit of interaction is different. For `inf` it's the model; for `agent` it's the **session** — a multi-turn conversation that carries tool state on the data-proxy side, can hold an RL session key obtained from the inference service, and needs explicit creation, switching, reward, and timeout semantics.

```
areal agent run             # launch router + N pairs + gateway
areal agent stop
areal agent status
areal agent reward          # send reward to inference for a session
areal agent logs
```

#### `areal train` — job submitter, not service controller

Training has a different lifecycle than `inf` or `agent`: training jobs terminate, services don't. The scheduling decision (local / slurm / ray) is already inside the driver, so the CLI's job is purely lifecycle wrapping. The verbs therefore look like a job submitter rather than a service controller:

```
areal train run     # run a driver in the foreground (small jobs, debugging)
areal train start   # spawn a detached driver process (cluster jobs)
areal train stop    # signal a running job by name
areal train ps      # list locally tracked jobs
areal train status  # status of one job
areal train logs    # tail a job's combined stdout/stderr
```

A driver entry is resolved by precedence: `--driver module:func` on the CLI, then a top-level `driver:` field in the YAML config, then a built-in fallback per command. The CLI stays scheduler-agnostic; whether the driver launches locally, on Slurm, or on Ray is decided inside the driver based on `config.scheduler.type`, exactly as today's hand-written example scripts already do.

#### `areal weight-update` — diagnostic console

The weight-update service sits between training and inference, exposing `/connect`, `/update_weights`, `/disconnect`, and a `/weight_meta/*`family. The operator-facing surface is genuinely small:

```
areal weight-update status      # is the gateway alive? how many pairs connected?
areal weight-update pairs       # list connected (train, inference) pairs and versions
areal weight-update ps          # list locally known weight-update services
areal weight-update logs        # tail the gateway log
```

There's no `run` in the first cut, because in the v2 flow the weight-update gateway is brought up by the training-side controller, not by the operator. If we later want operator-initiated launches (for testing or for split-cluster setups), `run` can be added without disturbing the rest of the surface.

## Implementation plan
Earlier stages lock the surface; later stages fill in behavior behind that fixed surface. Gradually enriching the features as the v2 functionality becomes more complete.

| Stage | What lands | What it locks down |
|---|---|---|
| 1 | Scaffold: console-script, full parser tree (every verb's `--help` matches its final shape), `~/.areal/` directory layout, lightness test. Verbs print "not yet implemented" but exit cleanly. | Command names, flag matrices, state file shapes — everything later stages depend on |
| 2 | `areal inf run / stop / status / ps / logs` real behavior | inference service lifecycle |
| 3 | `areal inf scaleup / scaledown / models`| inference service management |
| 4 | `areal agent` lifecycle + session management + chat | agent operator surface |
| 5 | `areal train run / start / stop / ps / status / logs` | job submitter |
| 6 | `areal weight-update status / pairs / ps / logs` (+ optional `force-sync`) | diagnostic console |

Stage 1 is by far the smallest and the most important: it freezes the command names, the flag shapes, and the on-disk layout. Once it merges, the remaining stages can land independently without changing the user's mental model.

## Additional Information
Existing controllers this complements rather than replaces:
areal/experimental/inference_service/controller/controller.py,
areal/experimental/agent_service/controller/controller.py,
areal/experimental/weight_update/controller/controller.py.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Operator CLI (`areal`) for the v2 microservice architecture #1374

Checklist

Background

Potential Solution

Per-module design

`areal inf` — inference operator console

`areal agent` — agent operator console

`areal train` — job submitter, not service controller

`areal weight-update` — diagnostic console

Implementation plan

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage	What lands	What it locks down
1	Scaffold: console-script, full parser tree (every verb's `--help` matches its final shape), `~/.areal/` directory layout, lightness test. Verbs print "not yet implemented" but exit cleanly.	Command names, flag matrices, state file shapes — everything later stages depend on
2	`areal inf run / stop / status / ps / logs` real behavior	inference service lifecycle
3	`areal inf scaleup / scaledown / models`	inference service management
4	`areal agent` lifecycle + session management + chat	agent operator surface
5	`areal train run / start / stop / ps / status / logs`	job submitter
6	`areal weight-update status / pairs / ps / logs` (+ optional `force-sync`)	diagnostic console

[Feature] Operator CLI (areal) for the v2 microservice architecture #1374

Description

Checklist

Background

Potential Solution

Per-module design

areal inf — inference operator console

areal agent — agent operator console

areal train — job submitter, not service controller

areal weight-update — diagnostic console

Implementation plan

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Feature] Operator CLI (`areal`) for the v2 microservice architecture #1374

`areal inf` — inference operator console

`areal agent` — agent operator console

`areal train` — job submitter, not service controller

`areal weight-update` — diagnostic console