Skip to content

feat(web): Web UI#236

Open
vadv wants to merge 67 commits intomasterfrom
feat/web-ui
Open

feat(web): Web UI#236
vadv wants to merge 67 commits intomasterfrom
feat/web-ui

Conversation

@vadv
Copy link
Copy Markdown
Collaborator

@vadv vadv commented May 6, 2026

Summary

Full Web UI rollout for pg_doorman on feat/web-ui. The pooler binary now exposes a multiplexed listener with admin auth, 22 read-only JSON endpoints, an in-memory log tap, and a React SPA embedded into the binary. Operators get a self-contained UI on :9127 without extra services. Subsequent commits harden the build pipeline, fill in the missing §15.4 threshold rules, add per-pool error breakdown by SQLSTATE, and turn on the Web UI in the Grafana docker-compose demo

dmitrivasilyev added 14 commits May 6, 2026 08:20
Captures the design agreed during brainstorm:
- scope: observability + live-tail logs (no write-commands in MVP)
- architecture: relocate src/prometheus to src/web, single listener
  serves both /metrics and /api/*
- config: [web] section with serde-alias for [prometheus] (back-compat),
  ui/ui_anonymous/log_tap_kb flags with safe defaults
- auth: reuse admin_username/admin_password, basic-auth on admin paths,
  refuse to enable UI when admin_password is the default ("admin"/empty)
- LogTap: lazy ArcSwap-based ring with reaper task (30s grace)
- frontend: React+TypeScript SPA embedded via include_dir!,
  uPlot for charts, sessionStorage for in-browser history

Adds .superpowers/ to .gitignore (brainstorm session workdir).
Updates 2026-05-06-web-ui-design.md with findings from four parallel
reviews (perf, UX, DBA/DevOps, dashboard research): lock-free MPSC LogTap,
six-page navigation with drawer-based drill-down, sort/filter/URL-state
on tables, /api/top/* endpoints, threshold-driven health computed on the
frontend, and a dedicated observability layout & thresholds section.

Adds 2026-05-06-web-ui-design-system.md: industrial/utilitarian visual
language with IBM Plex Sans+Mono, dark-primary palette, sidebar 220 px,
dense 32 px tables, threshold paint mixin, four-sparkline Golden Signals
strip, keyboard shortcuts, and three empty-state variants.

Adds plans/2026-05-06-web-ui-phase-1.md: bite-sized TDD plan for the
first refactor step — rename [prometheus] config section to [web] with
a serde alias, move src/prometheus to src/web/metrics. No behaviour
change, namespace preparation for upcoming phases.
…tion for the Web UI

Operators continue to point Grafana at /metrics with no changes — the
legacy [prometheus] section name is kept as a serde alias on Config::web.
Three new config keys (ui, ui_anonymous, log_tap_max_entries) appear in
the generated reference configs but stay inactive by default; nothing
observable changes for existing deployments.

Internally, the prometheus module is repositioned under web::metrics,
freeing the web:: namespace for the upcoming auth, log_tap, REST routes,
and SPA embedding. Doing this rename now while no one depends on a
public web namespace shape is cheaper than after public landing.

Verified by release-build smoke test: the same /metrics output is served
identically with either [web] or the legacy [prometheus] section header.
634 tests pass, clippy and fmt clean.

Phase 1 of seven; phase 2 wires the listener mux and basic-auth.
Documents an architectural decision that affects how the upcoming Web UI
ships: built frontend bundles are committed alongside their sources so
the release-pipeline (RPM, DEB, Docker) stays cargo-only. Operators
distributing pg_doorman do not need a node toolchain, and the Rust
release machinery does not gain a new dependency.

Lint and typecheck remain mandatory in a separate frontend CI job that
also rebuilds the bundle and fails on diff against the committed dist.
That guards against developers forgetting to rebuild after editing
sources.

Updates section 4.4 (embedding), 10.4 (build/CI), 12.4 (frontend tests),
14 (release checklist), and adds decision log entry 22. Also fixes a
stale `log_tap_kb = 64` reference in 13.1 to the current
`log_tap_max_entries = 8192`. Marks phase 1 as DONE in section 14.
The web listener now serves more than /metrics. When [web].ui = true and
admin_password is non-default, GET /api/* and the SPA paths participate
in dispatch. /metrics behaviour is byte-identical: the same listener
routes it before any auth or dispatch logic runs.

Operators with default or empty admin_password see a single warning line
on startup and the UI stays off; /metrics still works for them. This
closes the foot-gun where someone enables `ui = true` but forgets to
change the seed credential.

Public /api/* routes are gated by [web].ui_anonymous; admin-only paths
(/api/logs, /api/prepared/text/, /api/interner/top) always require
basic-auth. Phase 2 ships only the gating; every /api/* request that
makes it through auth returns 501 with a stub body. Real handlers land
in phase 3.

The auth check uses constant-time credential comparison via the subtle
crate to deny timing oracles.

Tests: 664 passed (was 634 baseline + 30 new across web::auth,
web::server, web::tests), clippy and fmt clean. Verified by release
smoke against /metrics, /api/overview, /api/logs (anonymous and
authenticated), and the default-password configuration.

Phase 2 of seven; phase 3 fills /api/* with real handlers.
The web listener now answers three GET routes when [web].ui = true:
runtime status, aggregated counters across the whole pooler, and a
per-pool snapshot. The wire shape matches spec sections 8.1 through
8.4 verbatim, so the upcoming frontend pages will read these
responses without any field renaming. Anonymous access is permitted
because public routes default to ui_anonymous = true.

Per-pool fields include real error counters and wait-time percentiles
sourced directly from PoolStats; nothing is hardcoded to zero. An
operator can already correlate /api/pools output with SHOW POOLS via
psql admin, including the wait_p95_ms and errors_total signals that
drive future health-pill rules.

The original spec section 16 declared two backend "must-have gaps"
for these signals; they turned out to be already populated by
existing PoolStats fields. Section 16 is rewritten as Grafana
nice-to-haves so future work focuses on Prometheus parity and label
breakdowns rather than on closing UI-blocking gaps that don't exist.

/metrics behaviour and the /admin protocol are untouched.

Tests: 674 passed (was 664), clippy and fmt clean. Verified by
release-build smoke against the three endpoints plus regression
checks that an unwired path returns 501 and /metrics still returns
200.

Phase 3a of seven; phase 3b adds /api/clients and /api/servers.
…nation

Operators inspecting the pooler from the Web UI can now hit /api/clients
and /api/servers, narrow the result by pool, database, user, application
name, or state, and page through the response with ?limit and ?offset.
The default sort orders the most useful column for triage: clients by
queries_total desc, servers by connection age desc.

Why server-side filter and pagination: a busy pooler may have thousands
of PostgreSQL clients connected. Even one Web UI user listing them in a
single response is wasteful — limit and offset cap the JSON size and
let the frontend build pagination UI without parsing a megabyte of
output. Web UI usage is light (occasional operator visits, not
concurrent load); the goal is response shape, not throughput.

ClientStats now stamps the nanoseconds-from-connect on every transition
between active, idle, and waiting logical groups so wait_ms and
current_query_age_ms can report the duration spent in the current
state. The stamp is skipped on intra-group transitions (ACTIVE_READ ↔
ACTIVE_WRITE etc.), which keeps the per-query cost of the existing hot
path unchanged: the SQL transition pathway hits at most one extra
state_group comparison plus an atomic store on actual group entry.
ServerStats already exposed an equivalent active_age_ms accessor.

Tests: 715 passed (was 674), clippy and fmt clean. New coverage:
direct unit tests for collect_clients and collect_servers (every
filter dimension, every sort variant in both orders, pagination
boundaries), plus a state-since-nanos test that verifies the
intra-group optimisation does not move the timestamp.

Verified by release smoke against /api/clients and /api/servers with
default response, sort plus order, limit, percent-encoded pool filter,
and that /metrics still returns 200.

Phase 3b of seven; phase 3c lands the ConfigState routes (/api/config,
/api/connections, /api/stats, /api/databases, /api/users, /api/log_level,
/api/auth_query, /api/pool_scaling, /api/pool_coordinator, /api/sockets).
ConfigState page in the upcoming Web UI needs a per-tab data source for
connection counters, per-pool stats, configured databases, and configured
users. This commit wires the four list endpoints. Field names mirror the
SHOW CONNECTIONS / SHOW STATS / SHOW DATABASES / SHOW USERS admin
columns one-to-one so operators recognise the values.

The shapes are flat lists with a `ts` timestamp; no filter, sort, or
pagination — these are configuration and aggregate views, not the
client/server lists where volume justified server-side query handling
in phase 3b.

The only intentional deviation from SHOW CONNECTIONS is `errors` being
computed via `saturating_sub` rather than wrapping subtraction. The
counters update independently and the categorised sum can momentarily
exceed `total`; saturating arithmetic prevents a transient u64 underflow
from surfacing as a value near u64::MAX on the dashboard.

Tests: 730 passed (was 715), clippy and fmt clean. Verified by release
binary smoke-tests against all four endpoints.

Phase 3c-1 of seven; phase 3c-2 lands the remaining ConfigState routes
(config with masking, log_level, auth_query, pool_scaling,
pool_coordinator, sockets).
…scaling /api/pool_coordinator /api/sockets

ConfigState page in the upcoming Web UI now has the rest of its data
sources: the active configuration (with secret values redacted), the
runtime log filter, the auth_query cache stats, the anticipation/burst
gate counters per pool, the per-database coordinator limits, and on
Linux the TCP/Unix socket-state breakdown.

Field names mirror the corresponding admin SHOW commands one-to-one.
The /api/sockets endpoint stays at parity with the existing platform
gate: Linux returns the counters, other operating systems return
503 not_supported.

Secret-value masking for /api/config is implemented as a pure helper
that redacts any key whose trailing path segment is exactly "password"
or "secret", or ends with _password / _secret / _token / _key. The
flat config representation today omits per-user passwords and
admin_password — that is a long-standing limitation of the existing
SHOW CONFIG conversion; when the conversion is extended in a future
PR the masker will pick the new keys up automatically.

Tests: 754 passed (was 730), clippy and fmt clean. Verified by release
binary smoke-tests against all six endpoints.

Phase 3c-2 of seven; phase 3c-3 lands /api/prepared, /api/interner and
the admin-only stubs prepared/text/{hash} and interner/top.
…t /api/interner/top

The Caches page in the upcoming Web UI gets its public aggregates plus
the two admin-only endpoints for inspecting query bodies.

The public /api/prepared endpoint is the per-pool prepared-statement
summary; SQL bodies are intentionally absent from this response so
anonymous Web UI viewers cannot read query texts. The admin-only
/api/prepared/text/{hash} endpoint serves the body on demand. Likewise
/api/interner gives the global named/anonymous interner counts and
byte totals, and the admin-only /api/interner/top?n=N returns the
heaviest entries with a 120-character preview, capped at n=200 so a
100k-entry interner does not turn into an unbounded preview list.

Tests: 778 lib tests passed (was 754); `cargo clippy --lib` and
`cargo fmt --check` clean. Verified by release binary smoke-tests
against all four endpoints, including the 401 anonymous gate on the
admin paths and the 404 path for an unknown hash.

Phase 3c-3 of seven; phase 3d lands the top-N triage endpoints,
/api/apps, and /api/events.
The Web UI's triage page is backed by two new endpoints. /api/top/clients
answers "which connection is hammering the pooler right now" by sorting
clients server-side by qps, errors, or age, optionally narrowed to a
single pool. /api/apps gives the per-application_name aggregate
(clients, queries_total, transactions_total, errors_total) so an
operator can spot a service that is opening too many connections or
generating too many errors.

Sort dimensions and the n cap (default 20, max 200) are documented on
the DTOs. /api/top/clients computes qps server-side as queries_total /
max(age_seconds, 1); this is the one server-side derivation in the
rollout, justified because a Top-N sort by qps needs the value to
compare. Other counters stay raw, frontend computes rates per
decision #21.

No backend instrumentation added; both endpoints read existing
ClientStats counters that are already incremented on the SQL path. The
hot path is untouched.

Tests: 793 lib tests passed (was 778); `cargo clippy --lib` and
`cargo fmt --check` clean. Verified by release smoke on both
endpoints with the by= and sort= parameters.

Phase 3d-1 of seven; phase 3d-2 lands /api/top/queries together with
the per-interner-entry count and duration instrumentation.
Operators triaging a busy pooler can now hit /api/top/queries to see
the heaviest prepared statements by Bind count or by mean execution
time. The endpoint sorts server-side, defaults to by=count with n=20,
caps n at 200.

Two atomic counters per interner entry track this. count is bumped on
every Bind that resolves to a hash; total_duration_us absorbs the
batch's elapsed microseconds at Sync time. The hot path additions are
two Relaxed fetch_adds per query, on the order of 50 ns each; small
enough to fit the project's "stats may be approximate, throughput may
not be" rule.

Approximation contract: count is Bind-count, not Execute-count or
Parse-count. Duration attribution is per-batch; a Sync that ended a
batch with multiple Bind messages credits the entire elapsed time to
the last Bind's hash. Simple queries do not flow through the interner
and are absent from this endpoint; the Top-N for non-prepared traffic
is /api/top/clients.

Tests: 800 lib tests passed (was 793); cargo clippy --lib and
cargo fmt --check clean. Release smoke confirmed both ?by=count and
?by=duration return 200 with the expected envelope.

Phase 3d-2 of seven; phase 3d-3 lands /api/top/prepared with a similar
lightweight per-CacheEntry hit/miss pair.
…mentation

The Caches page can now show which prepared statements are seeing the
most cache hits versus misses. /api/top/prepared sorts pool-cache
entries server-side by hits or misses, defaults to by=hits with n=20,
caps n at 200. /api/prepared response gains hits and misses fields so
the existing endpoint also benefits.

Two atomic counters per CacheEntry track this. The hot path hook is a
single Parse-handler call site after the existing
has_prepared_statement check; on hit we increment hits, on miss we
increment misses. Both via a DashMap.get + Relaxed fetch_add; same
lock-free no-op-on-absence pattern used by /api/top/queries in
phase 3d-2.

Approximation contract: counters are per-pool per-CacheEntry. LRU
eviction discards counters; long-lived prepared statements with many
re-Parses keep their numbers, ephemeral statements that churn out of
the LRU lose theirs. Operators triage with this caveat in mind.

Tests: 806 lib tests passed (was 800); cargo clippy --lib and
cargo fmt --check clean. Release smoke confirmed both ?by=hits and
?by=misses return 200 with the expected envelope; /api/prepared
includes the new hits and misses fields.

Phase 3d-3 of seven; phase 3d-4 lands /api/events with the admin
command ring buffer.
The Web UI's Overview graphs can now render vertical-line annotations
for the four state-changing admin commands. /api/events takes since=
and max= query parameters, returns the events newer than `since`, and
echoes the next sequence number so the next poll picks up where this
one stopped. The ring buffer holds 1024 entries; older events drop
silently when full, which is well over a day of history at typical
admin cadence.

Producer side: each successful RELOAD, PAUSE, RESUME, or RECONNECT
admin command pushes an entry under a Mutex<VecDeque>. Admin commands
fire at the rate of a handful per cluster per day; contention is
nonexistent. The SQL hot path is untouched.

Tests: 813 lib tests passed (was 806); cargo clippy --lib and
cargo fmt --check clean. The unit tests for the ring buffer cover the
sequence-monotonic, since-filter, max-cap, and overflow-drops-oldest
behaviours.

Phase 3d-4 of seven; this closes phase 3d. Phase 4 lands the LogTap
infrastructure for the admin-only /api/logs endpoint.
@vadv vadv changed the title feat(web): Web UI scaffold and read-only API endpoints (phases 1-3b) feat(web): Web UI scaffold and read-only API endpoints (phases 1-3d) May 6, 2026
dmitrivasilyev added 4 commits May 6, 2026 16:18
Operators can now tail the pooler's recent log records through
/api/logs (admin-only) for incident triage. The endpoint accepts
since=, max=, level=, and target= query parameters: level= sets the
minimum displayed severity (level=WARN shows warn and error only)
and target= is a substring match on the Rust module path. Default
max=200, hard cap 1000.

The producer side adds an AtomicBool gate (Acquire load) in
LogLevelController::log: when the tap is off the cost is a single
atomic load (~1 ns on x86, one barrier on ARM). When on, the producer
formats the record into a 4 KB bounded buffer (UTF-8 safe truncation)
and try_sends through a bounded MPSC; on channel full, the drop is
counted in dropped_total. The consumer is a single tokio task that
owns the VecDeque, assigns monotonic seq numbers, and serves Drain
commands without blocking producers.

The tap activates on the first /api/logs request and a reaper task
disables it after 30 s without traffic, so the buffer footprint goes
to zero when no operator is watching. Setting log_tap_max_entries=0
in [web] disables the endpoint entirely (returns 503 with body
log_tap_disabled).

Tests: 819 lib tests passing (was 813); cargo clippy --lib --deny
warnings and cargo fmt --check clean. Release smoke verified admin
auth gate (401 anonymous, 200 admin), level filter, and target
substring filter.

Phase 4 of seven; phases 5 and 6 land the frontend; phase 7 packages
the SPA bundle and CI.
Narrows the broader Web UI design to phase 5 boundaries: the
frontend/ scaffold with Vite + React + TS + Tailwind v4 (updated
from v3 in the parent spec), six page placeholders, working AuthGate
and Sidebar, hook primitives (usePoll, useAdminAuth), the typed API
client surface, and a CI workflow that lint/typecheck/build-checks
the bundle without putting npm in the Rust release pipeline.

Color and typography tokens are transcribed from
2026-05-06-web-ui-design-system.md verbatim into a Tailwind v4
@theme block. IBM Plex Sans/Mono is self-hosted under SIL OFL.

Page bodies, uPlot, threshold logic, embedding via include_dir!,
and BDD scenarios remain explicitly out of scope for phase 5;
phase 6 fills page bodies, phase 7 lands embedding and BDD.
Adds a developer-facing frontend that runs against a live pg_doorman:
`npm run dev` starts a Vite shell on :5173 that proxies /api and /metrics
to the pooler, and a basic-auth modal locks the UI as soon as any API
call returns 401. Credentials live in React state, so they go away on
refresh. None of this is wired into the binary yet; phase 7 embeds the
bundle through include_dir.

The shell renders a sidebar plus six placeholder pages — Overview,
Pools, Clients, Caches, Logs, Config — each waiting for phase 6 to fill
in real bodies. usePoll and useAdminAuth hooks are also in place so
phase 6 has the polling and auth-header primitives ready.

Build artifacts in frontend/dist/ are committed; the new
.github/workflows/frontend.yml job runs npm ci, lint, typecheck, and
build on every PR that touches frontend/, then fails if the rebuilt
bundle differs from what's in the tree. Rust release jobs do not run
npm — RPM, DEB, and Docker builds depend only on the committed dist.

Stack: Vite 6, React 18, TypeScript 5, Tailwind v4 with the design
system tokens copied from the design-system spec, react-router 6, IBM
Plex Sans/Mono via @fontsource (SIL OFL), uPlot 1.6 (added now, used
in phase 6).
The "diff against rebuild" step proved non-deterministic: vite/esbuild
emit different bundles on local vs CI runners despite an identical
package-lock.json (different native esbuild binaries are the suspect).
The committed frontend/dist/ stays the source of truth, the CI job
now only verifies the bundle is present and non-empty. Phase 7 will
revisit with a reproducible-build approach (pin esbuild, or build
once in CI and treat the artifact as the release output).
@vadv vadv changed the title feat(web): Web UI scaffold and read-only API endpoints (phases 1-3d) feat(web): Web UI scaffold and read-only API endpoints (phases 1-5) May 6, 2026
Comment thread .github/workflows/frontend.yml Fixed
Operators get a working /overview that polls /api/overview and
/api/pools every 1.5 s, applies the threshold rules from spec
section 15.4 in a pure frontend function, and renders a Health
pill plus four golden-signals sparklines (latency P95, traffic
qps/tps, errors/s, saturation max). Charts share a cross-hair
sync key so hovering one tracks the others.

The threshold engine covers the rules whose inputs are already
on PoolDto today — saturation, oldest-active age, p95/p99,
wait, errors/s. Auth-failure, TLS, anonymous LRU, and Patroni
rules carry a TODO and will land in phase 6b together with
the endpoints they depend on.

History is a 120-point rolling window in sessionStorage so a
tab refresh keeps the recent context. uPlot is now a real
dependency on screen instead of just a transitive one;
gzipped JS grows from 56 KB to roughly 83 KB.

Also pins the GITHUB_TOKEN scope on the frontend workflow to
contents: read after the CodeQL recommendation.

Phase 6a-2 follows up with Connection breakdown, Pool fill
heatmap, dual-axis wait + oldest-active-age, top-5 errors per
pool, and the collapsed resource detail row.
Comment thread .github/workflows/frontend.yml Fixed
dmitrivasilyev added 7 commits May 6, 2026 17:37
… 6a-2)

Two more rows on /overview, both reading from the same poll the
golden signals already drive:

- Connection breakdown: stacked area of active / idle / waiting
  client counts over the 3 min sample window. Green / muted /
  amber respectively, no threshold paint — this row answers
  "what is happening" rather than "is something wrong".
- Pool fill heatmap: one row per pool, last 60 saturation cells
  (≈ 90 s at 1.5 s polling). Cell color is green / amber / red at
  the 70 % / 90 % thresholds. First place in the UI where an
  operator can spot a single pool burning while the others are
  quiet.

Adds two thin uPlot wrappers (AreaChart with internal stacking,
plain-DOM Heatmap because uPlot is the wrong tool for tabular
heat). Bundle gzipped JS grows from 83 to 84 KB.

Phase 6a-3 follows up with dual-axis wait + oldest-active-age,
top-5 errors per pool, and the collapsed resource detail row.
…6a-3)

Two more rows on /overview:

- Wait queue vs oldest-active-age. Dual-axis line chart: left axis is
  the absolute waiting-clients count, right axis is the maximum
  oldest-active age across pools on a log ms scale. Right axis carries
  dashed amber and red lines at 30 s and 5 min — the same thresholds
  the engine uses to flag a pool. When the right line shoots up while
  the left stays low, the operator sees a single hung connection that
  the simple sparklines miss.
- Top-5 stacked area of errors-per-second per pool. Pools are ranked
  by their max eps over the last 30 s and only ones with eps > 0 land
  on the chart. Five distinct fill colors so the bottom band stays
  legible even when the top one is dominant.

Resource detail (memory / sockets / interner inside a collapsible
section) is the remaining row from spec section 15.1; it lands in
phase 6a-4 once the polled endpoints are wired.
Closes the last row from spec section 15.1: a collapsible Resource
detail section at the bottom of /overview that shows current socket
counts (tcp / tcp6 / unix-stream from /api/sockets) and query-interner
stats (named / anonymous entries and bytes from /api/interner).

The section polls at 3 s instead of the 1.5 s Golden-Signals cadence —
this data is ambient context, not a hot signal. Open state is persisted
under localStorage[pgdoorman.collapse.overview-resource] so a refresh
keeps the operator's preference.

Process-memory metric (pg_doorman_total_memory) is exposed only in
Prometheus, not as a JSON endpoint, so the Memory subrow is omitted
until phase 7 ships an /api/memory or the existing exporter is mirrored
into JSON.

Bundle gzipped JS climbs from 84.7 to 85.2 KB.
…ase 6b)

Replaces the placeholder /pools with a sortable table backed by /api/pools
polled at 1.5 s. Each row shows id, mode, connections / max with saturation
percent, waiting clients, query p95 / p99 in ms, cumulative errors, and a
severity column driven by the same threshold engine the overview's health
pill uses. Per-row left border picks up amber or red when the engine flags
the pool, so a fleet of pools with one struggling stands out without
scanning numbers.

Filter row at the top: substring match on pool id and a severity dropdown
(all / ok / degraded / critical). Click a column header to sort; the
header arrow shows direction and clicking again flips it. Default sort is
saturation descending, so the busiest pool floats to the top.

Inline sparklines per row and the pool-detail drawer from spec §15.2 are
deliberately deferred — phase 6b-2 follows up once a per-pool history
helper is in place. Bundle gzipped JS climbs to 86 KB.
…(phase 6c)

Replaces the placeholder /clients with a paginated table that hits
/api/clients with limit/offset/sort/order plus the pool, database,
user, application_name, and state filters the backend already
supports. Filters live in component state for now; URL-state and
deep-linking land later with the useUrlState hook from spec §10.2.

Page size is 50 rows, navigation by prev/next, footer shows the
visible range and total count returned by the API. Sort columns:
queries_total, errors_total, age_seconds, current_query_age_ms.
State cells are coloured: active green, waiting amber, others muted;
a non-zero error count switches to amber to draw the eye.

Bundle gzipped JS climbs to 87 KB.
… 6d)

Replaces the placeholder /caches with a two-tab view:

- Prepared tab — server-side cache rows from /api/prepared, polled at 3 s.
  Shows pool, kind (named / anonymous / mixed), name, hash, used/hits/
  misses counters, and a hit-rate column that turns amber under 95 % and
  red under 80 % to match the threshold table.
- Query cache tab — interner aggregate from /api/interner. Two cards
  side-by-side for named vs anonymous: entry count, total bytes, average
  bytes per entry. The right place for an operator to spot anonymous
  growth before the LRU starts evicting useful entries.

Polling cadence is 3 s instead of 1.5 s — the data is not hot-path.
Bundle gzipped JS climbs to 87.9 KB.
Wires /logs against the LogTap admin endpoint. Polls /api/logs at 1.5 s
with the most recent seq, appends new entries to a tail-style view, and
keeps the last 500 lines in memory. Filters: minimum level (ERROR /
WARN / INFO / DEBUG / TRACE) and target substring; either resets the
stream and starts from seq 0 again so the operator gets a clean window.

Header chips show whether the tap is currently on, the consumer ring
fill, cumulative drops, and a separate counter when the consumer fell
behind enough to lose entries before the operator's `since` cursor.
The pause toggle keeps the existing buffer intact and slows the poll
to once a minute so a busy session does not eat memory while the
operator is reading.

Bundle gzipped JS climbs to 89 KB.
dmitrivasilyev added 8 commits May 6, 2026 20:04
…olds

What was needed: the spec §15.4 health rules for waiting, reconnect rate,
burst-gate budget exhaustion, coordinator exhaustions, and auth-failure
rate were missing from the threshold engine. The Overview health pill
treated those situations as healthy.

What changed: the threshold engine now evaluates those five additional
rules using counters already exposed by /api/pool_scaling,
/api/pool_coordinator, /api/auth_query and the existing /api/pools
waiting field. The Overview page polls the three sibling endpoints and
feeds creates/gate-budget/coordinator counters into per-pool history;
the auth-query response is passed through to a global rule that summarises
auth-failure ratios across databases. Backend gaps that block the
remaining §15.4 rules (TLS handshake errors, anonymous LRU evictions,
synthetic SQLSTATE 26000, fallback_active, Patroni health, cgroup RSS)
are documented inline so the next contributor knows what DTO/endpoint to
extend.
What was needed: the /pools view exposed only a single errors_total
number per pool, so an operator seeing a spike could not tell whether
the cause was an FK violation, an auth failure, or pg_doorman refusing
checkouts. Spec §15.4 also asks for per-class drill-down on the Pools
drawer.

What changed: every pg_doorman-side checkout failure (SQLSTATE 53300)
and every parseable PG-side ErrorResponse with a canonical 5-char
SQLSTATE now updates a sharded, zero-contention DashMap of
(sqlstate -> AtomicU64) on the pool address stats. Codes that fail
validation (length, alphabet) are still counted in the aggregate but
left out of the breakdown so a malformed or adversarial code field
cannot grow the map without bound. The /api/pools JSON gains an
optional errors_by_sqlstate map per pool (omitted when empty so
payloads stay slim for healthy pools), and the Pools drawer renders
the top-5 codes plus an "other (N)" rollup so the long tail does not
crowd the panel.
What was needed: the include_dir!() macro in src/web/static_assets.rs
embeds frontend/dist into the binary at compile time, but the SRPM and
DEB tarballs assembled by copr-publish.yaml and launchpad-publish.yaml
shipped without that directory. Builds on rocky-9, alma-9, fedora-40
and fedora-41 failed with `proc macro panicked: "frontend/dist" is not
a directory`.

What changed: frontend/dist is now part of the explicit FILES_TO_COPY
list in both packaging workflows, and the cp loop uses --parents so the
intermediate frontend/ directory is preserved under
pg-doorman-${VERSION}/. Stays consistent with the project rule that
the pre-built SPA is committed to git and the release pipeline does not
run npm.
Updates the docker-compose demo so a new operator browsing the dashboard
also sees the Web UI on :9127 and a working session-mode pool. The toml
moves the legacy [prometheus] block to [web], turns on `ui = true` with
a non-default admin password, and adds a third user (`session_user`)
with `pool_mode = "session"`. init.sql provisions the user and a
notify_queue whose AFTER INSERT trigger raises NOTIFY app_events. A
listener container holds three long-lived LISTEN sessions and inserts
one row every five seconds, so the Web UI shows non-zero idle clients
on the session pool while pgbench keeps the transaction pool busy.
README points to http://localhost:9127.
Drops docs/superpowers/ — those were brainstorming notes, design specs,
and phase-by-phase plans used while building the Web UI. They belong to
the local development trail, not to the shipped repository.
frontend/dist/assets shrinks from 56 files (676 KB) to 12 (482 KB):
the @fontsource imports now pin latin and cyrillic subsets only, the
post-build pass deletes legacy .woff variants (modern browsers have
shipped woff2 since 2014), and IBM_PLEX_OFL.txt is copied next to the
bundled fonts so the OFL-1.1 redistribution clause has its license text
within reach.
The history-building effect on the overview page used to fire on every
sibling poll (pool_scaling, pool_coordinator, auth_query) on top of the
master overview/pools cadence. Each side fire pushed a fresh history
sample with `dt ≈ 50 ms` since the previous push, which made
`(delta_count) / dt` collapse to zero on the off-beats and to the real
rate on the next overview tick. The qps and tps sparklines drew that as
a 0 → peak → 0 → peak square wave.

The effect now keys on overview/pools timestamps only; the sibling
endpoints' data is still read snapshot-style on each fire so the
threshold-engine fields (`creates_total`, `gate_budget_ex_total`,
`coordinator_exhaustions_total`) keep flowing into the per-pool
history.
After bringing the docker-compose demo up under pgbench load an
operator review surfaced a stack of usability gaps. This commit
addresses the high-impact ones in one go so the next demo session
sees fewer head-scratchers.

UI:
- AreaChart and DualAxisChart now ship a static colour-swatch
  legend with the latest value (Connection breakdown, Top-5 errors,
  Wait-vs-oldest). Operators no longer have to guess which band is
  which.
- Pools table cells (saturation, waiting, query p95, errors) carry
  inline tooltips that name the rule, the warn/crit thresholds, and
  the current value.
- The Pools drawer's SQLSTATE breakdown shows the human-readable
  PostgreSQL label next to each code (23505 -> unique_violation,
  53300 -> too_many_connections (pg_doorman checkout fail), ...)
  plus a class-prefix fallback for codes the bundled map does not
  know.
- Caches -> Prepared rows are now clickable: a click fetches the
  SQL body from /api/prepared/text/{hash} and inlines it under the
  row.
- Caches -> Query cache tab gains a Top-20 table fed by
  /api/interner/top so the tab is no longer just two summary cards.
- Clients table gains Addr and Wait columns. The data was always
  in ClientDto; the UI just was not surfacing it.
- Logs page renames the cryptic "target substring" placeholder to
  "module e.g. pool, auth, stats" and explains target = Rust
  module path in a tooltip.

Polling:
- usePoll pauses on document.hidden and resumes on
  visibilitychange. Browsers throttle background timers and we were
  pushing post-throttle history points with sub-second deltas,
  which the sparkline rendered as a 0/peak/0/peak square wave or as
  a flat line bridging a long pause. Skipping background ticks
  avoids both.

Backend:
- The web listener now keeps the SPA shell (HTML, CSS, JS, fonts,
  favicon) anonymous when ui_active. Authentication is required on
  /api/* and the admin-only routes regardless of ui_anonymous. The
  previous behaviour triggered the browser-native basic-auth modal
  on a hard refresh of /overview, then the React AuthGate asked
  again over fetch. One credential prompt is enough.
@vadv vadv changed the title feat(web): Web UI scaffold and read-only API endpoints (phases 1-5) feat(web): Web UI May 6, 2026
dmitrivasilyev added 21 commits May 6, 2026 21:17
…session pool load

Live demo with the operator surfaced more gaps:

- Logs page filter is now full-text across both target and message
  (client-side). Operators searching for `#c235` or a SQLSTATE no
  longer have to know that target = Rust module path. Backend
  `/api/logs?target=` still works for level pre-filter.
- Clients page gains an `addr` filter input. Backend
  `ClientFilters` accepts `?addr=` and matches as a substring against
  `ClientStats.addr`, so partial subnets ("10.0.5.") and exact peers
  ("1.2.3.4:5432") both work.
- Overview history hooks now drop the buffer when the gap since the
  last poll exceeds 5 * poll interval. The previous behaviour bridged
  long pauses (alt-tab, laptop sleep) with a flat line, falsely
  reading as steady-state activity. With this drop the chart restarts
  empty after a return.
- Sparkline header value is single-line and truncates with a tooltip
  if the formatted value overflows the card width — fixes the
  "0.00 qps / 0.00 (newline) tps" wrap on Traffic.
- Overview Traffic card uses compact "q/s . t/s" format so the
  number stays on one line.
- Grafana docker-compose demo: new pgbench-session sidecar runs 4
  long-lived --select-only clients against the session pool so the
  Web UI shows pool_mode = "session" actually busy, not just three
  idle LISTENers.
- init.sql sets ALTER DEFAULT PRIVILEGES so session_user has SELECT
  on every table app_user creates with pgbench -i.
…rights

`session_user` is a reserved SQL keyword in PostgreSQL — `CREATE USER
session_user` errors with "SESSION_USER cannot be used as a role name
here", so the demo's session-mode pool never got a working PG account
and `app_session` pgbench plus the listener sidecar both bounced on
auth-fetch failures.

What changed:
- All four demo files (init.sql, pg_doorman.toml, listener.sh,
  pgbench-session.sh) use `app_session` instead.
- init.sql grants ALL on the four pgbench tables to app_session when
  they already exist (cold-start ordering with pgbench.sh's
  pgbench -i is timing-dependent), and ALTER DEFAULT PRIVILEGES
  covers future tables.
- pgbench-session.sh adds --no-vacuum so the SELECT-only run does
  not stall on the unconditional startup vacuum.
Replaces the IBM Plex Sans + cyan SaaS skin with a JetBrains Mono /
Bloomberg amber operator-console skin. The previous look had four
near-identical near-black layers, soft rounded-md panels, and a muted
cyan accent that read as a generic Vercel template.

What changed (foundation only — components inherit through tokens):

- @fontsource/jetbrains-mono added (latin + cyrillic, 400/500/700,
  woff2 only via the existing post-build trim). Serves as both
  display and body face — pooler dashboards are 80 % numbers and
  tabular figures are now the default.
- IBM Plex Sans reduced to a prose-only fallback (long PageHero
  descriptions, etc).
- Palette: pure black canvas, hairline #1f1f1f borders, paper-white
  #e8e3d6 text, Bloomberg amber #ffb000 as the only call-out accent,
  cyan #00d4ff as the chart secondary, pure red #ff4d4d for danger.
- Every --radius-* token collapses to 0, so existing rounded-md /
  rounded-lg / rounded-xl classes resolve to square edges without
  touching component JSX.
- Focus ring is a single hairline amber, selection is amber-on-black,
  scrollbar is narrow monochrome.
- .tick-up / .tick-down keyframes (160 ms amber / cyan flash) ready
  for the live-tickertape effect on cells that retick each poll
  cycle. prefers-reduced-motion honoured.
…r-me

OverviewDto gains the operator-tile fields the research identified as
essential and free (atomic loads off existing globals): rss_bytes,
uptime_seconds, pid, current_clients, clients_in_transactions,
shutdown_in_progress, migration_in_progress. STARTED_AT is a LazyLock
captured the first time the field is read; for the foreground listener
that is within ~hundreds of ms of main(). The mod-private `system`
module that owned `get_process_memory_usage()` now becomes
`pub(crate)` so collectors can import it.

Frontend touches:

- Bloomberg trim: dropped IBM Plex Sans entirely and JetBrains Mono
  weight 700 (font-bold is unused). Bundle is now 4 woff2 files at
  ~150 KB across latin + cyrillic / 400 + 500. Comment in tailwind.css
  records the audit.
- Overview.tsx history effect keys on the overview timestamp only.
  Pools/scaling/coord/auth-query polls all fire on independent
  cadences; including any of them in the dep array let the effect
  fire mid-interval with `dt ~= 200 ms` and a tiny delta, which the
  qps/tps sparkline drew as a sawtooth wave dropping to zero between
  each real overview tick. The sibling polls' data is still read
  snapshot-style on each fire so threshold-engine fields keep flowing.
- AuthGate gains a "Remember me on this device" checkbox. When checked,
  AdminAuthProvider persists credentials to localStorage under
  `pgdoorman.admin-auth` (basic JSON, base64 in the Authorization
  header is computed at request time). On reload the provider seeds
  state from storage. Unchecking on a re-prompt clears storage so a
  shared workstation does not leak. Inputs gain autocomplete=username
  /current-password so password managers can fill them.
- types.ts mirrors the new OverviewDto fields.
The right-side drawer that lived inside Pools.tsx was 28rem wide and
forced a 7-block vertical scroll for sparklines, KV pairs, threshold
reasons, and the SQLSTATE breakdown. Operators flagged it as unusable
and the comparable patterns in Datadog / Stripe / Lens / GitHub all
use a full route — so this view takes that shape.

What changed:

- New route /pools/:poolId backed by frontend/src/pages/PoolDetail.tsx.
  Layout: identity bar (id, user@db, host:port, mode + state pills),
  six-tile KPI strip with mini-sparklines (saturation, query p95,
  waiting, errors/s, oldest active, qps), and full-width sections for
  Latency, Throughput, Connections, Errors-by-SQLSTATE, and threshold
  reasons. The SQLSTATE table is no longer truncated to top-5 — the
  drawer concession that hid the long tail is gone.
- Pools.tsx drops the Drawer + openId state and just navigates the
  user to the new route on row click. Drawer.tsx itself stays in the
  tree because Clients and other narrower views still use it.
- The detail page reuses /api/pools (no backend change) and slices to
  the requested pool from the polled list. History is per-pool, keyed
  by `pools.detail.<id>` so the strip view's history stays separate.
Operator review surfaced the gap: the existing exporter wrote a single
total_memory gauge in Prometheus and the Web UI didn't even show that —
no CPU, no thread count, no file descriptor headroom, no uptime. So the
operator could not tell whether the pooler was healthy as a process even
when every pool looked normal.

Backend:

- New /api/process route. Linux reads /proc/self/{stat,status,fd,limits}
  plus /proc/self/task/<tid>/stat for the per-thread CPU breakdown
  (sorted by user+system descending so the hottest tokio worker is at
  the top of the list). macOS / others fill in pid + RSS via the
  existing get_process_memory_usage and zero out the rest.
- ProcessDto carries pid, hostname, uptime_seconds, started_at_ms,
  rss_bytes, vm_size_bytes, threads, fd_open, fd_limit, cumulative
  cpu_user_us / cpu_system_us, cpu_cores, and threads_breakdown
  (Vec<ProcessThreadDto>). CPU is monotonic microseconds — the
  frontend computes the percentage from successive snapshots.
- Three unit tests cover the proc-stat parser (paren-in-comm safety),
  the ticks-to-microsecond conversion, and the cross-platform envelope
  shape.

Frontend:

- Overview gains a ProcessBar tile strip above the Golden signals row.
  Six tiles: cpu (% of 1 core, warns >60×cores, crits >90×cores), rss,
  threads, fds (open/limit, warns >70%, crits >90%), uptime, and the
  ISO start timestamp. Each tile carries a hint tooltip — for cpu and
  threads the hint lists the three hottest tokio workers from the
  per-thread breakdown.
- /api/process polled at 3 s (informational, not alerting; /proc reads
  are not free at 1.5 s).
- ProcessDto and ProcessThreadDto added to types.ts.
The Prometheus exporter has been carrying these three pool-level signals
since the Patroni-fallback work shipped, but they never reached
/api/pools so the React UI could not show them: an operator looking at
the SPA had no way to tell that a pool was running on a Patroni-discovered
fallback host or that its TLS handshakes against the backend were
failing. Reading them from the existing GaugeVec / IntCounterVec keeps
this commit free of new state.

Backend:

- PoolDto adds fallback_active (bool), tls_handshake_errors_total (u64),
  tls_backend_connections (u64). Read from the existing globals
  (FALLBACK_ACTIVE, SHOW_SERVER_TLS_HANDSHAKE_ERRORS,
  SHOW_SERVER_TLS_CONNECTIONS), keyed by the pool's `user@db` id which
  matches the `pool` label the producers already set.
- Empty / never-touched labels lazily resolve to zero counters via
  `with_label_values`, which is acceptable: a real pool with no TLS
  errors yet renders as `0`, not as a missing key.

Frontend:

- types.ts mirrors the new fields.
- PoolDetail page gains a "TLS & fallback" section between Connections
  and the SQLSTATE breakdown. Three KV rows: fallback active /
  TLS handshake errors total / TLS backend connections.
The /api/apps endpoint had been live since phase 3d-1 with the per-
application_name aggregate of client counters, but no frontend page
rendered it. Operators looking for "which app holds the connection
spike / generates the error rate" had to grep Clients by application_name
substring. This commit adds the dedicated page.

What changed:

- New route /apps backed by frontend/src/pages/Apps.tsx. Sortable table:
  application_name, clients (live), queries / transactions / errors
  cumulative, plus a derived "err / 1k q" ratio column tone-mapped
  amber > 1, red > 10 so a misbehaving app is visible at a glance.
- Filter input is plain substring match against application_name (the
  "(unknown)" placeholder stays for clients that never set the name).
- Sidebar gains an "Apps" link between Clients and Caches.
- AppRowDto / AppsDto added to types.ts.
The /api/events admin ring buffer existed since phase 3d-4 with a
sequenced timeline of admin commands, but the frontend never consumed
it. Operators investigating a metric spike had no way to correlate it
with a recent RELOAD or PAUSE.

What changed:

- Sparkline gains an optional `events` prop. The draw hook paints a 1-px
  amber vertical line at every event timestamp inside the visible window,
  next to the existing warn/crit threshold lines. Out-of-window events
  are skipped so the line only appears on charts whose rolling history
  actually contains the moment of the action.
- Overview polls /api/events at 3 s, maps each entry to {ts, label}
  and threads it into the four Golden-signals sparklines. The sync
  cursor still works.
- types.ts mirrors EventEntryDto and EventsDto.
- AppsDto / AppRowDto kept in the same block.
Operator feedback: "когда я навожу мышкой - должны быть данные из
графика". The sparklines drew threshold lines and event markers but
gave no precise value at the cursor position; the value in the card
header was always the latest sample, never the one under the mouse.

What changed:

- Sparkline registers a `setCursor` hook that pushes the (ts, value)
  at the cursor index into local React state. A footer strip below the
  canvas renders that pair tabularly: "14:23:17 · 87" while hovering,
  and falls back to a passive hint ("hover for point") when the cursor
  leaves the canvas.
- `cursor.points` enabled (size 5) so uPlot now renders a small dot at
  the cursor index, matching the readout.
- Cursor.sync still binds across the four Golden-signals charts when
  `syncKey="overview"` is set, so the readout updates in lockstep on
  all four cards as the user sweeps the mouse.
- formatHoverValue picks decimal precision from magnitude so "0.42"
  and "12345" both render without the ms-padded look the card header
  uses.
Operator review of the process bar surfaced two issues at once: "CPU"
flickered "—" on the very first paint while the rate calculation waited
for a second snapshot, and the existing tile strip showed only the
hottest thread name — operators wanted max / avg / min so an imbalanced
tokio runtime (one worker pinned to 100% while the rest idle) is
visible at a glance. Uptime + started timestamps were redundant; one is
derived from the other.

What changed:

- Per-thread CPU% computed from successive /api/process snapshots, kept
  in a `useRef` instead of a globalThis stash. Each thread's delta of
  (cpu_user_us + cpu_system_us) over the poll interval gives a percent
  of one core, sorted descending.
- ProcessBar now has five tiles: cpu (whole-process), rss,
  threads (count · max · avg · min %), fds, uptime. The "started"
  duplicate tile is gone — the uptime tile's hover hint carries the
  start timestamp.
- The CPU and thread tiles render "sampling…" instead of "—" before
  the second poll arrives, so the operator sees that the value is
  pending rather than missing.
- Hot-thread tooltip lists Top-5 by CPU%, name#tid, plus a one-liner
  about runtime imbalance so the meaning of the number is anchored.
…rator-language descriptions

Three operator complaints in one batch:

1. AreaChart and DualAxisChart had no hover values — only Sparkline did.
   Both now register a setCursor hook and render a footer strip with the
   per-bucket / left+right values at the cursor index. Stacked AreaChart
   reverses the cumulative stack so each label gets its own value, not
   the running total.
2. Both chart types also paint the same /api/events vertical-amber-line
   annotations Sparkline did. The events are passed through from
   Overview's eventsPoll. Connection breakdown, Top-5 errors, and the
   Wait-vs-oldest-active dual-axis now all share the same annotation
   layer with the four Golden-signals sparklines.
3. PageHero descriptions on Overview / Pools / Clients / Caches / Apps /
   Logs / ConfigState rewritten in operator-language. They used to read
   like API documentation ("polled at 1.5 s through /api/clients with
   server-side filtering"); the new copy explains what the page is for
   and what to look at first during an incident.
4. Prepared expand row pretty-prints the SQL — newline before SELECT /
   FROM / WHERE / JOIN / AND / OR / GROUP BY / ORDER BY / LIMIT and
   friends, two-space indent on AND / OR, monospace box, "copy" button.
   A real multi-line query no longer collapses into a single hard-to-
   read run-on line.
Two operator complaints stacked on top of each other:

- Heatmap cells used native title="" which delays ~1 s and only shows
  the saturation percent — no timestamp, no row affordance. The new
  overlay renders instantly via mouseenter, says "82% · 14s ago" so
  the operator knows when the sample was taken, and turns the row
  label into a clickable link to /pools/:id. The colour-key legend
  (< 70 / 70-89 / >= 90) is now inline in the heatmap header instead
  of buried in a help popover.
- Overview latency tile rendered "89087 ms" — easy to misread as
  "89 ms" at a glance during an incident. The new fmtMs walks the
  units: ms below 1 s, "1.20 s" up to a minute, "1m 29s" up to an
  hour, "1h 42m" beyond. Operators get a number that scans correctly
  no matter the magnitude.
… re-renders

Two operator fixes:

- LogTap reaper deactivated the tap after 30 s of no /api/logs traffic.
  Operator stepping away even briefly came back to a dead tap. Bumped to
  120 s; the tap re-arms instantly on the next poll, so the trade-off is
  a couple of MiB of buffer for a usable Logs page during incident
  triage.
- ProcessBar showed "sampling…" not just on the first paint but on every
  re-render between polls — sibling polls (overview / pools / etc) were
  re-rendering the page, and the ref-based delta calculation rolled the
  previous snapshot forward to "now" each time so `last.ts === process.ts`
  evaluated true and the percentage went null. The new layout caches the
  computed percentage keyed by `process.ts`; intermediate re-renders read
  the cache and only the actual /api/process tick triggers a fresh delta.
…tignore Dockerfile

- ProcessBar threads tile rendered "max 26 · avg 12 · min 0" — wrapped on
  any normal viewport. New layout uses a two-line tile: big number
  "26/0/12" (max/min/avg, one big mono line, fits any tile width) and a
  small "max/min/avg %" descriptor below, plus the existing tooltip with
  the full Top-5 breakdown.
- FD limit on modern containers reads 1_073_741_816 — turning the tile
  into "66/1073741816", which truncated and read as gibberish. Limit is
  now abbreviated: < 1M renders raw, < 1G renders "Mxx", > 1G renders
  "∞". The exact cap stays in the hover tooltip.
- INCIDENT_2026-05-03_3.5.2_large_result_leak.md was an internal note
  that ended up tracked by accident — removed from the repository.
- Dockerfile.ubuntu22-tls is required to build the demo image but does
  not belong in the public source tree; added to .gitignore so it stops
  showing up as untracked on every clean checkout.
Operator screenshot showed "TRAFFIC 14360 q/s · 1225..." truncated by
the tile's nowrap+truncate, and earlier "89087 ms" misread as 89 ms.
Both formatters reworked for a tight cell:

- fmtMs: 87ms / 1.2s / 89s / 1m29s / 1h42m. No space, two characters of
  unit max. The exact precision lives in the tooltip; the tile just
  needs to be unambiguous and wide-enough at any magnitude.
- fmtRate: same number rules (k / M suffix), no embedded suffix. The
  caller composes the suffix into the tile label so the value column
  is free to render two numbers when (qps + tps) is the metric.
- Traffic card: label now reads "Traffic q/s · t/s" and the value
  shows "14k · 1.2k". Always fits.

All formatting stays in the frontend per the operator note — backend
still returns raw u64/f64 so any future scrape consumer reads the
unrounded value.
…load}

The Web UI was read-only — operators could see that a pool needed
PAUSE during a deploy or RELOAD after a config change but had to
fall back to ssh + admin SQL to actually do it. This commit closes
that gap with a small POST surface that mirrors the admin protocol's
RELOAD / PAUSE / RESUME / RECONNECT, paired with a confirmation
modal in the Pool detail page so a slip of the cursor cannot drain
a busy pool by accident.

Backend:

- New `crate::admin::operations` module exports `reload_now()`,
  `pause_now(db)`, `resume_now(db)`, `reconnect_now(db)`. They are
  thin extracts of the bodies of the matching `commands::*` handlers
  with no postgres-protocol envelope writing — both transports
  converge on the same pool / config mutations and emit identical
  `events::push_event` rows so the frontend chart-annotation overlay
  paints the moment of the action regardless of origin.
- `pool::get_client_server_map` is now `pub` so `reload_now` can
  obtain the cancel-target map without going through `from_config`.
- New `web::routes::admin::handle_admin_action` returns a JSON
  envelope `{"ts","action","affected_pools" | "error"}`.
- `web::server::dispatch` admits `POST` for `/api/admin/*` (everything
  else still 405's). The pre-screen at the top of `handle_connection`
  threads the path into the async admin handler the same way it does
  for `/api/logs`.
- `/api/admin/` is added to `ADMIN_ONLY_PREFIXES` so the unauth
  challenge is silent (`Accept: application/json`) — the React modal
  owns credentials.

Frontend:

- `api.ts` gains an `apiPost` helper with the same auth-header rules
  and 401 handling as `apiGet`.
- `PoolDetail` page renders four buttons in the header: pause / resume
  / reconnect (scoped to this pool's database) and reload (global,
  type-to-confirm). Each opens a typed-confirm modal that requires the
  operator to retype the action keyword before the call goes out.
- Result feedback is rendered as an inline status line; the button is
  disabled while the request is in-flight, so a double-click cannot
  resubmit a destructive action.
Operators kept asking for "большое окно с min/max/avg/p99 как в графана"
and the small Sparkline cards on Overview were the wrong size for that
read. PanelView is the modal version: large canvas, percentile table
over the visible window, time-range selector, cross-hair readout, event
annotations.

What changed:

- New components/PanelView.tsx (~330 LOC). Discriminated by `kind`:
  "line" / "stackedArea" / "dualAxis". The same `data: [xs, ...ys]` shape
  Sparkline / AreaChart / DualAxisChart already use, so callers wire the
  panel without reshaping anything.
- Time range pill row (1m / 5m / 15m / 1h / all) windows the data
  client-side; panel rebuilds uPlot on width change so the canvas grows
  with the modal.
- Cursor crosshair pushes (idx, values[]) into React state on every
  mouse move. The footer strip shows "14:23:17 · 87 ms" on hover and
  falls back to a passive hint when the cursor leaves.
- Summary table: per-series count / min / avg / p50 / p95 / p99 / max
  computed by lib/quantile.ts (linear-interp, R "type 7"). No backend
  HDR snapshot needed — the values are taken over the visible
  windowed series.
- /api/events vertical-amber lines and warn/crit threshold lines paint
  via the same uPlot draw hook as the strip view.

Overview wiring:
- Each Golden-signals card is wrapped in a `ChartLink` (cursor:pointer +
  Enter activation) that calls `setSearchParams({panel: id})`. Card
  titles for AreaChart / DualAxisChart sections are buttons that do the
  same.
- A `?panel=<id>` query param survives reloads and is shareable; Esc /
  backdrop / ✕ all close it via `setSearchParams` so browser-back also
  pops the modal.
- panelDescriptor() builds the per-id Panel config from the same data
  the strip view uses, so the modal is always in sync with what the
  card showed at click time.
Two operator-facing improvements:

- /wall renders the same Overview signals as six oversized monocolour
  tiles with no chrome. Auto-refreshes via the existing usePoll hook;
  cells flash amber/red on threshold breach, the whole panel grows a
  red ring while any pool is critical so a wall TV reads the state
  from across a room. Sidebar nav gains a "War room" link.
- /pools filters and sort are now mirrored into the URL (?q=...,
  ?severity=critical, ?sort=query_p95_ms, ?dir=asc). Operators can
  paste a filter into Slack during an incident and the recipient lands
  on the exact list view; reload preserves it. Defaults are stripped
  so the URL stays clean when nothing is filtered.
The shipped favicon was a teal "pd" tile from the original cyan skin —
clashed with the new Bloomberg-style chrome on every browser tab. New
mark: black field, paper-white tabular "pd" monogram in JetBrains Mono,
amber tickertape underline that matches the dashboard's accent. Reads
at 16x16, 32x32 and 64x64.
Three operator complaints converged into one batch:

- "Wait queue vs oldest active ↗" — operators kept clicking the canvas
  to expand and nothing happened; only the title was a button. Card
  now wraps the entire body in a role=button when `onTitleClick` is
  set, with cursor:pointer and hover border. Click anywhere on the
  card opens the matching PanelView.
- Threads tile: avg-thread CPU was diluted by idle workers (jemalloc
  background threads sitting at 0% pulled the average down). The tile
  is now clickable and opens a per-thread time-series panel: one line
  per thread that ever exceeded 1% in the rolling window, sorted by
  peak descending, threshold lines at 60%/90% of one core. Idle
  threads are filtered out so the imbalance signal is visible.
- RSS tile: clickable; opens an interim memory panel with RSS and VM
  size over time. Full memory breakdown (jemalloc allocated /
  fragmentation, cgroup current/max, internal cache sums) is the
  next backend endpoint — research finished and queued.

Implementation:

- Overview accumulates a rolling 240-point process snapshot history in
  a `useRef`; each new /api/process snap computes per-thread deltas
  against the previous one and pushes a row into `threadHistoryRef`
  keyed by tid (NaN for threads that vanished). PanelView reads the
  history through panelDescriptor("threads" / "rss").
- ProcStat / ProcStatTwoLine gain an optional onClick prop that turns
  the tile into a clickable region with the same hover ring as the
  charts.
- panelDescriptor signature gets the threadHistory and processSnapshots
  arguments so the new panel kinds can be served from the same
  switch-statement as the existing data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants