Conversation
added 14 commits
May 6, 2026 08:20
Captures the design agreed during brainstorm:
- scope: observability + live-tail logs (no write-commands in MVP)
- architecture: relocate src/prometheus to src/web, single listener
serves both /metrics and /api/*
- config: [web] section with serde-alias for [prometheus] (back-compat),
ui/ui_anonymous/log_tap_kb flags with safe defaults
- auth: reuse admin_username/admin_password, basic-auth on admin paths,
refuse to enable UI when admin_password is the default ("admin"/empty)
- LogTap: lazy ArcSwap-based ring with reaper task (30s grace)
- frontend: React+TypeScript SPA embedded via include_dir!,
uPlot for charts, sessionStorage for in-browser history
Adds .superpowers/ to .gitignore (brainstorm session workdir).
Updates 2026-05-06-web-ui-design.md with findings from four parallel reviews (perf, UX, DBA/DevOps, dashboard research): lock-free MPSC LogTap, six-page navigation with drawer-based drill-down, sort/filter/URL-state on tables, /api/top/* endpoints, threshold-driven health computed on the frontend, and a dedicated observability layout & thresholds section. Adds 2026-05-06-web-ui-design-system.md: industrial/utilitarian visual language with IBM Plex Sans+Mono, dark-primary palette, sidebar 220 px, dense 32 px tables, threshold paint mixin, four-sparkline Golden Signals strip, keyboard shortcuts, and three empty-state variants. Adds plans/2026-05-06-web-ui-phase-1.md: bite-sized TDD plan for the first refactor step — rename [prometheus] config section to [web] with a serde alias, move src/prometheus to src/web/metrics. No behaviour change, namespace preparation for upcoming phases.
…tion for the Web UI Operators continue to point Grafana at /metrics with no changes — the legacy [prometheus] section name is kept as a serde alias on Config::web. Three new config keys (ui, ui_anonymous, log_tap_max_entries) appear in the generated reference configs but stay inactive by default; nothing observable changes for existing deployments. Internally, the prometheus module is repositioned under web::metrics, freeing the web:: namespace for the upcoming auth, log_tap, REST routes, and SPA embedding. Doing this rename now while no one depends on a public web namespace shape is cheaper than after public landing. Verified by release-build smoke test: the same /metrics output is served identically with either [web] or the legacy [prometheus] section header. 634 tests pass, clippy and fmt clean. Phase 1 of seven; phase 2 wires the listener mux and basic-auth.
Documents an architectural decision that affects how the upcoming Web UI ships: built frontend bundles are committed alongside their sources so the release-pipeline (RPM, DEB, Docker) stays cargo-only. Operators distributing pg_doorman do not need a node toolchain, and the Rust release machinery does not gain a new dependency. Lint and typecheck remain mandatory in a separate frontend CI job that also rebuilds the bundle and fails on diff against the committed dist. That guards against developers forgetting to rebuild after editing sources. Updates section 4.4 (embedding), 10.4 (build/CI), 12.4 (frontend tests), 14 (release checklist), and adds decision log entry 22. Also fixes a stale `log_tap_kb = 64` reference in 13.1 to the current `log_tap_max_entries = 8192`. Marks phase 1 as DONE in section 14.
The web listener now serves more than /metrics. When [web].ui = true and admin_password is non-default, GET /api/* and the SPA paths participate in dispatch. /metrics behaviour is byte-identical: the same listener routes it before any auth or dispatch logic runs. Operators with default or empty admin_password see a single warning line on startup and the UI stays off; /metrics still works for them. This closes the foot-gun where someone enables `ui = true` but forgets to change the seed credential. Public /api/* routes are gated by [web].ui_anonymous; admin-only paths (/api/logs, /api/prepared/text/, /api/interner/top) always require basic-auth. Phase 2 ships only the gating; every /api/* request that makes it through auth returns 501 with a stub body. Real handlers land in phase 3. The auth check uses constant-time credential comparison via the subtle crate to deny timing oracles. Tests: 664 passed (was 634 baseline + 30 new across web::auth, web::server, web::tests), clippy and fmt clean. Verified by release smoke against /metrics, /api/overview, /api/logs (anonymous and authenticated), and the default-password configuration. Phase 2 of seven; phase 3 fills /api/* with real handlers.
The web listener now answers three GET routes when [web].ui = true: runtime status, aggregated counters across the whole pooler, and a per-pool snapshot. The wire shape matches spec sections 8.1 through 8.4 verbatim, so the upcoming frontend pages will read these responses without any field renaming. Anonymous access is permitted because public routes default to ui_anonymous = true. Per-pool fields include real error counters and wait-time percentiles sourced directly from PoolStats; nothing is hardcoded to zero. An operator can already correlate /api/pools output with SHOW POOLS via psql admin, including the wait_p95_ms and errors_total signals that drive future health-pill rules. The original spec section 16 declared two backend "must-have gaps" for these signals; they turned out to be already populated by existing PoolStats fields. Section 16 is rewritten as Grafana nice-to-haves so future work focuses on Prometheus parity and label breakdowns rather than on closing UI-blocking gaps that don't exist. /metrics behaviour and the /admin protocol are untouched. Tests: 674 passed (was 664), clippy and fmt clean. Verified by release-build smoke against the three endpoints plus regression checks that an unwired path returns 501 and /metrics still returns 200. Phase 3a of seven; phase 3b adds /api/clients and /api/servers.
…nation Operators inspecting the pooler from the Web UI can now hit /api/clients and /api/servers, narrow the result by pool, database, user, application name, or state, and page through the response with ?limit and ?offset. The default sort orders the most useful column for triage: clients by queries_total desc, servers by connection age desc. Why server-side filter and pagination: a busy pooler may have thousands of PostgreSQL clients connected. Even one Web UI user listing them in a single response is wasteful — limit and offset cap the JSON size and let the frontend build pagination UI without parsing a megabyte of output. Web UI usage is light (occasional operator visits, not concurrent load); the goal is response shape, not throughput. ClientStats now stamps the nanoseconds-from-connect on every transition between active, idle, and waiting logical groups so wait_ms and current_query_age_ms can report the duration spent in the current state. The stamp is skipped on intra-group transitions (ACTIVE_READ ↔ ACTIVE_WRITE etc.), which keeps the per-query cost of the existing hot path unchanged: the SQL transition pathway hits at most one extra state_group comparison plus an atomic store on actual group entry. ServerStats already exposed an equivalent active_age_ms accessor. Tests: 715 passed (was 674), clippy and fmt clean. New coverage: direct unit tests for collect_clients and collect_servers (every filter dimension, every sort variant in both orders, pagination boundaries), plus a state-since-nanos test that verifies the intra-group optimisation does not move the timestamp. Verified by release smoke against /api/clients and /api/servers with default response, sort plus order, limit, percent-encoded pool filter, and that /metrics still returns 200. Phase 3b of seven; phase 3c lands the ConfigState routes (/api/config, /api/connections, /api/stats, /api/databases, /api/users, /api/log_level, /api/auth_query, /api/pool_scaling, /api/pool_coordinator, /api/sockets).
ConfigState page in the upcoming Web UI needs a per-tab data source for connection counters, per-pool stats, configured databases, and configured users. This commit wires the four list endpoints. Field names mirror the SHOW CONNECTIONS / SHOW STATS / SHOW DATABASES / SHOW USERS admin columns one-to-one so operators recognise the values. The shapes are flat lists with a `ts` timestamp; no filter, sort, or pagination — these are configuration and aggregate views, not the client/server lists where volume justified server-side query handling in phase 3b. The only intentional deviation from SHOW CONNECTIONS is `errors` being computed via `saturating_sub` rather than wrapping subtraction. The counters update independently and the categorised sum can momentarily exceed `total`; saturating arithmetic prevents a transient u64 underflow from surfacing as a value near u64::MAX on the dashboard. Tests: 730 passed (was 715), clippy and fmt clean. Verified by release binary smoke-tests against all four endpoints. Phase 3c-1 of seven; phase 3c-2 lands the remaining ConfigState routes (config with masking, log_level, auth_query, pool_scaling, pool_coordinator, sockets).
…scaling /api/pool_coordinator /api/sockets
ConfigState page in the upcoming Web UI now has the rest of its data
sources: the active configuration (with secret values redacted), the
runtime log filter, the auth_query cache stats, the anticipation/burst
gate counters per pool, the per-database coordinator limits, and on
Linux the TCP/Unix socket-state breakdown.
Field names mirror the corresponding admin SHOW commands one-to-one.
The /api/sockets endpoint stays at parity with the existing platform
gate: Linux returns the counters, other operating systems return
503 not_supported.
Secret-value masking for /api/config is implemented as a pure helper
that redacts any key whose trailing path segment is exactly "password"
or "secret", or ends with _password / _secret / _token / _key. The
flat config representation today omits per-user passwords and
admin_password — that is a long-standing limitation of the existing
SHOW CONFIG conversion; when the conversion is extended in a future
PR the masker will pick the new keys up automatically.
Tests: 754 passed (was 730), clippy and fmt clean. Verified by release
binary smoke-tests against all six endpoints.
Phase 3c-2 of seven; phase 3c-3 lands /api/prepared, /api/interner and
the admin-only stubs prepared/text/{hash} and interner/top.
…t /api/interner/top
The Caches page in the upcoming Web UI gets its public aggregates plus
the two admin-only endpoints for inspecting query bodies.
The public /api/prepared endpoint is the per-pool prepared-statement
summary; SQL bodies are intentionally absent from this response so
anonymous Web UI viewers cannot read query texts. The admin-only
/api/prepared/text/{hash} endpoint serves the body on demand. Likewise
/api/interner gives the global named/anonymous interner counts and
byte totals, and the admin-only /api/interner/top?n=N returns the
heaviest entries with a 120-character preview, capped at n=200 so a
100k-entry interner does not turn into an unbounded preview list.
Tests: 778 lib tests passed (was 754); `cargo clippy --lib` and
`cargo fmt --check` clean. Verified by release binary smoke-tests
against all four endpoints, including the 401 anonymous gate on the
admin paths and the 404 path for an unknown hash.
Phase 3c-3 of seven; phase 3d lands the top-N triage endpoints,
/api/apps, and /api/events.
The Web UI's triage page is backed by two new endpoints. /api/top/clients answers "which connection is hammering the pooler right now" by sorting clients server-side by qps, errors, or age, optionally narrowed to a single pool. /api/apps gives the per-application_name aggregate (clients, queries_total, transactions_total, errors_total) so an operator can spot a service that is opening too many connections or generating too many errors. Sort dimensions and the n cap (default 20, max 200) are documented on the DTOs. /api/top/clients computes qps server-side as queries_total / max(age_seconds, 1); this is the one server-side derivation in the rollout, justified because a Top-N sort by qps needs the value to compare. Other counters stay raw, frontend computes rates per decision #21. No backend instrumentation added; both endpoints read existing ClientStats counters that are already incremented on the SQL path. The hot path is untouched. Tests: 793 lib tests passed (was 778); `cargo clippy --lib` and `cargo fmt --check` clean. Verified by release smoke on both endpoints with the by= and sort= parameters. Phase 3d-1 of seven; phase 3d-2 lands /api/top/queries together with the per-interner-entry count and duration instrumentation.
Operators triaging a busy pooler can now hit /api/top/queries to see the heaviest prepared statements by Bind count or by mean execution time. The endpoint sorts server-side, defaults to by=count with n=20, caps n at 200. Two atomic counters per interner entry track this. count is bumped on every Bind that resolves to a hash; total_duration_us absorbs the batch's elapsed microseconds at Sync time. The hot path additions are two Relaxed fetch_adds per query, on the order of 50 ns each; small enough to fit the project's "stats may be approximate, throughput may not be" rule. Approximation contract: count is Bind-count, not Execute-count or Parse-count. Duration attribution is per-batch; a Sync that ended a batch with multiple Bind messages credits the entire elapsed time to the last Bind's hash. Simple queries do not flow through the interner and are absent from this endpoint; the Top-N for non-prepared traffic is /api/top/clients. Tests: 800 lib tests passed (was 793); cargo clippy --lib and cargo fmt --check clean. Release smoke confirmed both ?by=count and ?by=duration return 200 with the expected envelope. Phase 3d-2 of seven; phase 3d-3 lands /api/top/prepared with a similar lightweight per-CacheEntry hit/miss pair.
…mentation The Caches page can now show which prepared statements are seeing the most cache hits versus misses. /api/top/prepared sorts pool-cache entries server-side by hits or misses, defaults to by=hits with n=20, caps n at 200. /api/prepared response gains hits and misses fields so the existing endpoint also benefits. Two atomic counters per CacheEntry track this. The hot path hook is a single Parse-handler call site after the existing has_prepared_statement check; on hit we increment hits, on miss we increment misses. Both via a DashMap.get + Relaxed fetch_add; same lock-free no-op-on-absence pattern used by /api/top/queries in phase 3d-2. Approximation contract: counters are per-pool per-CacheEntry. LRU eviction discards counters; long-lived prepared statements with many re-Parses keep their numbers, ephemeral statements that churn out of the LRU lose theirs. Operators triage with this caveat in mind. Tests: 806 lib tests passed (was 800); cargo clippy --lib and cargo fmt --check clean. Release smoke confirmed both ?by=hits and ?by=misses return 200 with the expected envelope; /api/prepared includes the new hits and misses fields. Phase 3d-3 of seven; phase 3d-4 lands /api/events with the admin command ring buffer.
The Web UI's Overview graphs can now render vertical-line annotations for the four state-changing admin commands. /api/events takes since= and max= query parameters, returns the events newer than `since`, and echoes the next sequence number so the next poll picks up where this one stopped. The ring buffer holds 1024 entries; older events drop silently when full, which is well over a day of history at typical admin cadence. Producer side: each successful RELOAD, PAUSE, RESUME, or RECONNECT admin command pushes an entry under a Mutex<VecDeque>. Admin commands fire at the rate of a handful per cluster per day; contention is nonexistent. The SQL hot path is untouched. Tests: 813 lib tests passed (was 806); cargo clippy --lib and cargo fmt --check clean. The unit tests for the ring buffer cover the sequence-monotonic, since-filter, max-cap, and overflow-drops-oldest behaviours. Phase 3d-4 of seven; this closes phase 3d. Phase 4 lands the LogTap infrastructure for the admin-only /api/logs endpoint.
added 4 commits
May 6, 2026 16:18
Operators can now tail the pooler's recent log records through /api/logs (admin-only) for incident triage. The endpoint accepts since=, max=, level=, and target= query parameters: level= sets the minimum displayed severity (level=WARN shows warn and error only) and target= is a substring match on the Rust module path. Default max=200, hard cap 1000. The producer side adds an AtomicBool gate (Acquire load) in LogLevelController::log: when the tap is off the cost is a single atomic load (~1 ns on x86, one barrier on ARM). When on, the producer formats the record into a 4 KB bounded buffer (UTF-8 safe truncation) and try_sends through a bounded MPSC; on channel full, the drop is counted in dropped_total. The consumer is a single tokio task that owns the VecDeque, assigns monotonic seq numbers, and serves Drain commands without blocking producers. The tap activates on the first /api/logs request and a reaper task disables it after 30 s without traffic, so the buffer footprint goes to zero when no operator is watching. Setting log_tap_max_entries=0 in [web] disables the endpoint entirely (returns 503 with body log_tap_disabled). Tests: 819 lib tests passing (was 813); cargo clippy --lib --deny warnings and cargo fmt --check clean. Release smoke verified admin auth gate (401 anonymous, 200 admin), level filter, and target substring filter. Phase 4 of seven; phases 5 and 6 land the frontend; phase 7 packages the SPA bundle and CI.
Narrows the broader Web UI design to phase 5 boundaries: the frontend/ scaffold with Vite + React + TS + Tailwind v4 (updated from v3 in the parent spec), six page placeholders, working AuthGate and Sidebar, hook primitives (usePoll, useAdminAuth), the typed API client surface, and a CI workflow that lint/typecheck/build-checks the bundle without putting npm in the Rust release pipeline. Color and typography tokens are transcribed from 2026-05-06-web-ui-design-system.md verbatim into a Tailwind v4 @theme block. IBM Plex Sans/Mono is self-hosted under SIL OFL. Page bodies, uPlot, threshold logic, embedding via include_dir!, and BDD scenarios remain explicitly out of scope for phase 5; phase 6 fills page bodies, phase 7 lands embedding and BDD.
Adds a developer-facing frontend that runs against a live pg_doorman: `npm run dev` starts a Vite shell on :5173 that proxies /api and /metrics to the pooler, and a basic-auth modal locks the UI as soon as any API call returns 401. Credentials live in React state, so they go away on refresh. None of this is wired into the binary yet; phase 7 embeds the bundle through include_dir. The shell renders a sidebar plus six placeholder pages — Overview, Pools, Clients, Caches, Logs, Config — each waiting for phase 6 to fill in real bodies. usePoll and useAdminAuth hooks are also in place so phase 6 has the polling and auth-header primitives ready. Build artifacts in frontend/dist/ are committed; the new .github/workflows/frontend.yml job runs npm ci, lint, typecheck, and build on every PR that touches frontend/, then fails if the rebuilt bundle differs from what's in the tree. Rust release jobs do not run npm — RPM, DEB, and Docker builds depend only on the committed dist. Stack: Vite 6, React 18, TypeScript 5, Tailwind v4 with the design system tokens copied from the design-system spec, react-router 6, IBM Plex Sans/Mono via @fontsource (SIL OFL), uPlot 1.6 (added now, used in phase 6).
The "diff against rebuild" step proved non-deterministic: vite/esbuild emit different bundles on local vs CI runners despite an identical package-lock.json (different native esbuild binaries are the suspect). The committed frontend/dist/ stays the source of truth, the CI job now only verifies the bundle is present and non-empty. Phase 7 will revisit with a reproducible-build approach (pin esbuild, or build once in CI and treat the artifact as the release output).
Operators get a working /overview that polls /api/overview and /api/pools every 1.5 s, applies the threshold rules from spec section 15.4 in a pure frontend function, and renders a Health pill plus four golden-signals sparklines (latency P95, traffic qps/tps, errors/s, saturation max). Charts share a cross-hair sync key so hovering one tracks the others. The threshold engine covers the rules whose inputs are already on PoolDto today — saturation, oldest-active age, p95/p99, wait, errors/s. Auth-failure, TLS, anonymous LRU, and Patroni rules carry a TODO and will land in phase 6b together with the endpoints they depend on. History is a 120-point rolling window in sessionStorage so a tab refresh keeps the recent context. uPlot is now a real dependency on screen instead of just a transitive one; gzipped JS grows from 56 KB to roughly 83 KB. Also pins the GITHUB_TOKEN scope on the frontend workflow to contents: read after the CodeQL recommendation. Phase 6a-2 follows up with Connection breakdown, Pool fill heatmap, dual-axis wait + oldest-active-age, top-5 errors per pool, and the collapsed resource detail row.
added 7 commits
May 6, 2026 17:37
… 6a-2) Two more rows on /overview, both reading from the same poll the golden signals already drive: - Connection breakdown: stacked area of active / idle / waiting client counts over the 3 min sample window. Green / muted / amber respectively, no threshold paint — this row answers "what is happening" rather than "is something wrong". - Pool fill heatmap: one row per pool, last 60 saturation cells (≈ 90 s at 1.5 s polling). Cell color is green / amber / red at the 70 % / 90 % thresholds. First place in the UI where an operator can spot a single pool burning while the others are quiet. Adds two thin uPlot wrappers (AreaChart with internal stacking, plain-DOM Heatmap because uPlot is the wrong tool for tabular heat). Bundle gzipped JS grows from 83 to 84 KB. Phase 6a-3 follows up with dual-axis wait + oldest-active-age, top-5 errors per pool, and the collapsed resource detail row.
…6a-3) Two more rows on /overview: - Wait queue vs oldest-active-age. Dual-axis line chart: left axis is the absolute waiting-clients count, right axis is the maximum oldest-active age across pools on a log ms scale. Right axis carries dashed amber and red lines at 30 s and 5 min — the same thresholds the engine uses to flag a pool. When the right line shoots up while the left stays low, the operator sees a single hung connection that the simple sparklines miss. - Top-5 stacked area of errors-per-second per pool. Pools are ranked by their max eps over the last 30 s and only ones with eps > 0 land on the chart. Five distinct fill colors so the bottom band stays legible even when the top one is dominant. Resource detail (memory / sockets / interner inside a collapsible section) is the remaining row from spec section 15.1; it lands in phase 6a-4 once the polled endpoints are wired.
Closes the last row from spec section 15.1: a collapsible Resource detail section at the bottom of /overview that shows current socket counts (tcp / tcp6 / unix-stream from /api/sockets) and query-interner stats (named / anonymous entries and bytes from /api/interner). The section polls at 3 s instead of the 1.5 s Golden-Signals cadence — this data is ambient context, not a hot signal. Open state is persisted under localStorage[pgdoorman.collapse.overview-resource] so a refresh keeps the operator's preference. Process-memory metric (pg_doorman_total_memory) is exposed only in Prometheus, not as a JSON endpoint, so the Memory subrow is omitted until phase 7 ships an /api/memory or the existing exporter is mirrored into JSON. Bundle gzipped JS climbs from 84.7 to 85.2 KB.
…ase 6b) Replaces the placeholder /pools with a sortable table backed by /api/pools polled at 1.5 s. Each row shows id, mode, connections / max with saturation percent, waiting clients, query p95 / p99 in ms, cumulative errors, and a severity column driven by the same threshold engine the overview's health pill uses. Per-row left border picks up amber or red when the engine flags the pool, so a fleet of pools with one struggling stands out without scanning numbers. Filter row at the top: substring match on pool id and a severity dropdown (all / ok / degraded / critical). Click a column header to sort; the header arrow shows direction and clicking again flips it. Default sort is saturation descending, so the busiest pool floats to the top. Inline sparklines per row and the pool-detail drawer from spec §15.2 are deliberately deferred — phase 6b-2 follows up once a per-pool history helper is in place. Bundle gzipped JS climbs to 86 KB.
…(phase 6c) Replaces the placeholder /clients with a paginated table that hits /api/clients with limit/offset/sort/order plus the pool, database, user, application_name, and state filters the backend already supports. Filters live in component state for now; URL-state and deep-linking land later with the useUrlState hook from spec §10.2. Page size is 50 rows, navigation by prev/next, footer shows the visible range and total count returned by the API. Sort columns: queries_total, errors_total, age_seconds, current_query_age_ms. State cells are coloured: active green, waiting amber, others muted; a non-zero error count switches to amber to draw the eye. Bundle gzipped JS climbs to 87 KB.
… 6d) Replaces the placeholder /caches with a two-tab view: - Prepared tab — server-side cache rows from /api/prepared, polled at 3 s. Shows pool, kind (named / anonymous / mixed), name, hash, used/hits/ misses counters, and a hit-rate column that turns amber under 95 % and red under 80 % to match the threshold table. - Query cache tab — interner aggregate from /api/interner. Two cards side-by-side for named vs anonymous: entry count, total bytes, average bytes per entry. The right place for an operator to spot anonymous growth before the LRU starts evicting useful entries. Polling cadence is 3 s instead of 1.5 s — the data is not hot-path. Bundle gzipped JS climbs to 87.9 KB.
Wires /logs against the LogTap admin endpoint. Polls /api/logs at 1.5 s with the most recent seq, appends new entries to a tail-style view, and keeps the last 500 lines in memory. Filters: minimum level (ERROR / WARN / INFO / DEBUG / TRACE) and target substring; either resets the stream and starts from seq 0 again so the operator gets a clean window. Header chips show whether the tap is currently on, the consumer ring fill, cumulative drops, and a separate counter when the consumer fell behind enough to lose entries before the operator's `since` cursor. The pause toggle keeps the existing buffer intact and slows the poll to once a minute so a busy session does not eat memory while the operator is reading. Bundle gzipped JS climbs to 89 KB.
added 8 commits
May 6, 2026 20:04
…olds What was needed: the spec §15.4 health rules for waiting, reconnect rate, burst-gate budget exhaustion, coordinator exhaustions, and auth-failure rate were missing from the threshold engine. The Overview health pill treated those situations as healthy. What changed: the threshold engine now evaluates those five additional rules using counters already exposed by /api/pool_scaling, /api/pool_coordinator, /api/auth_query and the existing /api/pools waiting field. The Overview page polls the three sibling endpoints and feeds creates/gate-budget/coordinator counters into per-pool history; the auth-query response is passed through to a global rule that summarises auth-failure ratios across databases. Backend gaps that block the remaining §15.4 rules (TLS handshake errors, anonymous LRU evictions, synthetic SQLSTATE 26000, fallback_active, Patroni health, cgroup RSS) are documented inline so the next contributor knows what DTO/endpoint to extend.
What was needed: the /pools view exposed only a single errors_total number per pool, so an operator seeing a spike could not tell whether the cause was an FK violation, an auth failure, or pg_doorman refusing checkouts. Spec §15.4 also asks for per-class drill-down on the Pools drawer. What changed: every pg_doorman-side checkout failure (SQLSTATE 53300) and every parseable PG-side ErrorResponse with a canonical 5-char SQLSTATE now updates a sharded, zero-contention DashMap of (sqlstate -> AtomicU64) on the pool address stats. Codes that fail validation (length, alphabet) are still counted in the aggregate but left out of the breakdown so a malformed or adversarial code field cannot grow the map without bound. The /api/pools JSON gains an optional errors_by_sqlstate map per pool (omitted when empty so payloads stay slim for healthy pools), and the Pools drawer renders the top-5 codes plus an "other (N)" rollup so the long tail does not crowd the panel.
What was needed: the include_dir!() macro in src/web/static_assets.rs
embeds frontend/dist into the binary at compile time, but the SRPM and
DEB tarballs assembled by copr-publish.yaml and launchpad-publish.yaml
shipped without that directory. Builds on rocky-9, alma-9, fedora-40
and fedora-41 failed with `proc macro panicked: "frontend/dist" is not
a directory`.
What changed: frontend/dist is now part of the explicit FILES_TO_COPY
list in both packaging workflows, and the cp loop uses --parents so the
intermediate frontend/ directory is preserved under
pg-doorman-${VERSION}/. Stays consistent with the project rule that
the pre-built SPA is committed to git and the release pipeline does not
run npm.
Updates the docker-compose demo so a new operator browsing the dashboard also sees the Web UI on :9127 and a working session-mode pool. The toml moves the legacy [prometheus] block to [web], turns on `ui = true` with a non-default admin password, and adds a third user (`session_user`) with `pool_mode = "session"`. init.sql provisions the user and a notify_queue whose AFTER INSERT trigger raises NOTIFY app_events. A listener container holds three long-lived LISTEN sessions and inserts one row every five seconds, so the Web UI shows non-zero idle clients on the session pool while pgbench keeps the transaction pool busy. README points to http://localhost:9127.
Drops docs/superpowers/ — those were brainstorming notes, design specs, and phase-by-phase plans used while building the Web UI. They belong to the local development trail, not to the shipped repository.
frontend/dist/assets shrinks from 56 files (676 KB) to 12 (482 KB): the @fontsource imports now pin latin and cyrillic subsets only, the post-build pass deletes legacy .woff variants (modern browsers have shipped woff2 since 2014), and IBM_PLEX_OFL.txt is copied next to the bundled fonts so the OFL-1.1 redistribution clause has its license text within reach.
The history-building effect on the overview page used to fire on every sibling poll (pool_scaling, pool_coordinator, auth_query) on top of the master overview/pools cadence. Each side fire pushed a fresh history sample with `dt ≈ 50 ms` since the previous push, which made `(delta_count) / dt` collapse to zero on the off-beats and to the real rate on the next overview tick. The qps and tps sparklines drew that as a 0 → peak → 0 → peak square wave. The effect now keys on overview/pools timestamps only; the sibling endpoints' data is still read snapshot-style on each fire so the threshold-engine fields (`creates_total`, `gate_budget_ex_total`, `coordinator_exhaustions_total`) keep flowing into the per-pool history.
After bringing the docker-compose demo up under pgbench load an
operator review surfaced a stack of usability gaps. This commit
addresses the high-impact ones in one go so the next demo session
sees fewer head-scratchers.
UI:
- AreaChart and DualAxisChart now ship a static colour-swatch
legend with the latest value (Connection breakdown, Top-5 errors,
Wait-vs-oldest). Operators no longer have to guess which band is
which.
- Pools table cells (saturation, waiting, query p95, errors) carry
inline tooltips that name the rule, the warn/crit thresholds, and
the current value.
- The Pools drawer's SQLSTATE breakdown shows the human-readable
PostgreSQL label next to each code (23505 -> unique_violation,
53300 -> too_many_connections (pg_doorman checkout fail), ...)
plus a class-prefix fallback for codes the bundled map does not
know.
- Caches -> Prepared rows are now clickable: a click fetches the
SQL body from /api/prepared/text/{hash} and inlines it under the
row.
- Caches -> Query cache tab gains a Top-20 table fed by
/api/interner/top so the tab is no longer just two summary cards.
- Clients table gains Addr and Wait columns. The data was always
in ClientDto; the UI just was not surfacing it.
- Logs page renames the cryptic "target substring" placeholder to
"module e.g. pool, auth, stats" and explains target = Rust
module path in a tooltip.
Polling:
- usePoll pauses on document.hidden and resumes on
visibilitychange. Browsers throttle background timers and we were
pushing post-throttle history points with sub-second deltas,
which the sparkline rendered as a 0/peak/0/peak square wave or as
a flat line bridging a long pause. Skipping background ticks
avoids both.
Backend:
- The web listener now keeps the SPA shell (HTML, CSS, JS, fonts,
favicon) anonymous when ui_active. Authentication is required on
/api/* and the admin-only routes regardless of ui_anonymous. The
previous behaviour triggered the browser-native basic-auth modal
on a hard refresh of /overview, then the React AuthGate asked
again over fetch. One credential prompt is enough.
added 21 commits
May 6, 2026 21:17
…session pool load
Live demo with the operator surfaced more gaps:
- Logs page filter is now full-text across both target and message
(client-side). Operators searching for `#c235` or a SQLSTATE no
longer have to know that target = Rust module path. Backend
`/api/logs?target=` still works for level pre-filter.
- Clients page gains an `addr` filter input. Backend
`ClientFilters` accepts `?addr=` and matches as a substring against
`ClientStats.addr`, so partial subnets ("10.0.5.") and exact peers
("1.2.3.4:5432") both work.
- Overview history hooks now drop the buffer when the gap since the
last poll exceeds 5 * poll interval. The previous behaviour bridged
long pauses (alt-tab, laptop sleep) with a flat line, falsely
reading as steady-state activity. With this drop the chart restarts
empty after a return.
- Sparkline header value is single-line and truncates with a tooltip
if the formatted value overflows the card width — fixes the
"0.00 qps / 0.00 (newline) tps" wrap on Traffic.
- Overview Traffic card uses compact "q/s . t/s" format so the
number stays on one line.
- Grafana docker-compose demo: new pgbench-session sidecar runs 4
long-lived --select-only clients against the session pool so the
Web UI shows pool_mode = "session" actually busy, not just three
idle LISTENers.
- init.sql sets ALTER DEFAULT PRIVILEGES so session_user has SELECT
on every table app_user creates with pgbench -i.
…rights `session_user` is a reserved SQL keyword in PostgreSQL — `CREATE USER session_user` errors with "SESSION_USER cannot be used as a role name here", so the demo's session-mode pool never got a working PG account and `app_session` pgbench plus the listener sidecar both bounced on auth-fetch failures. What changed: - All four demo files (init.sql, pg_doorman.toml, listener.sh, pgbench-session.sh) use `app_session` instead. - init.sql grants ALL on the four pgbench tables to app_session when they already exist (cold-start ordering with pgbench.sh's pgbench -i is timing-dependent), and ALTER DEFAULT PRIVILEGES covers future tables. - pgbench-session.sh adds --no-vacuum so the SELECT-only run does not stall on the unconditional startup vacuum.
Replaces the IBM Plex Sans + cyan SaaS skin with a JetBrains Mono / Bloomberg amber operator-console skin. The previous look had four near-identical near-black layers, soft rounded-md panels, and a muted cyan accent that read as a generic Vercel template. What changed (foundation only — components inherit through tokens): - @fontsource/jetbrains-mono added (latin + cyrillic, 400/500/700, woff2 only via the existing post-build trim). Serves as both display and body face — pooler dashboards are 80 % numbers and tabular figures are now the default. - IBM Plex Sans reduced to a prose-only fallback (long PageHero descriptions, etc). - Palette: pure black canvas, hairline #1f1f1f borders, paper-white #e8e3d6 text, Bloomberg amber #ffb000 as the only call-out accent, cyan #00d4ff as the chart secondary, pure red #ff4d4d for danger. - Every --radius-* token collapses to 0, so existing rounded-md / rounded-lg / rounded-xl classes resolve to square edges without touching component JSX. - Focus ring is a single hairline amber, selection is amber-on-black, scrollbar is narrow monochrome. - .tick-up / .tick-down keyframes (160 ms amber / cyan flash) ready for the live-tickertape effect on cells that retick each poll cycle. prefers-reduced-motion honoured.
…r-me OverviewDto gains the operator-tile fields the research identified as essential and free (atomic loads off existing globals): rss_bytes, uptime_seconds, pid, current_clients, clients_in_transactions, shutdown_in_progress, migration_in_progress. STARTED_AT is a LazyLock captured the first time the field is read; for the foreground listener that is within ~hundreds of ms of main(). The mod-private `system` module that owned `get_process_memory_usage()` now becomes `pub(crate)` so collectors can import it. Frontend touches: - Bloomberg trim: dropped IBM Plex Sans entirely and JetBrains Mono weight 700 (font-bold is unused). Bundle is now 4 woff2 files at ~150 KB across latin + cyrillic / 400 + 500. Comment in tailwind.css records the audit. - Overview.tsx history effect keys on the overview timestamp only. Pools/scaling/coord/auth-query polls all fire on independent cadences; including any of them in the dep array let the effect fire mid-interval with `dt ~= 200 ms` and a tiny delta, which the qps/tps sparkline drew as a sawtooth wave dropping to zero between each real overview tick. The sibling polls' data is still read snapshot-style on each fire so threshold-engine fields keep flowing. - AuthGate gains a "Remember me on this device" checkbox. When checked, AdminAuthProvider persists credentials to localStorage under `pgdoorman.admin-auth` (basic JSON, base64 in the Authorization header is computed at request time). On reload the provider seeds state from storage. Unchecking on a re-prompt clears storage so a shared workstation does not leak. Inputs gain autocomplete=username /current-password so password managers can fill them. - types.ts mirrors the new OverviewDto fields.
The right-side drawer that lived inside Pools.tsx was 28rem wide and forced a 7-block vertical scroll for sparklines, KV pairs, threshold reasons, and the SQLSTATE breakdown. Operators flagged it as unusable and the comparable patterns in Datadog / Stripe / Lens / GitHub all use a full route — so this view takes that shape. What changed: - New route /pools/:poolId backed by frontend/src/pages/PoolDetail.tsx. Layout: identity bar (id, user@db, host:port, mode + state pills), six-tile KPI strip with mini-sparklines (saturation, query p95, waiting, errors/s, oldest active, qps), and full-width sections for Latency, Throughput, Connections, Errors-by-SQLSTATE, and threshold reasons. The SQLSTATE table is no longer truncated to top-5 — the drawer concession that hid the long tail is gone. - Pools.tsx drops the Drawer + openId state and just navigates the user to the new route on row click. Drawer.tsx itself stays in the tree because Clients and other narrower views still use it. - The detail page reuses /api/pools (no backend change) and slices to the requested pool from the polled list. History is per-pool, keyed by `pools.detail.<id>` so the strip view's history stays separate.
Operator review surfaced the gap: the existing exporter wrote a single
total_memory gauge in Prometheus and the Web UI didn't even show that —
no CPU, no thread count, no file descriptor headroom, no uptime. So the
operator could not tell whether the pooler was healthy as a process even
when every pool looked normal.
Backend:
- New /api/process route. Linux reads /proc/self/{stat,status,fd,limits}
plus /proc/self/task/<tid>/stat for the per-thread CPU breakdown
(sorted by user+system descending so the hottest tokio worker is at
the top of the list). macOS / others fill in pid + RSS via the
existing get_process_memory_usage and zero out the rest.
- ProcessDto carries pid, hostname, uptime_seconds, started_at_ms,
rss_bytes, vm_size_bytes, threads, fd_open, fd_limit, cumulative
cpu_user_us / cpu_system_us, cpu_cores, and threads_breakdown
(Vec<ProcessThreadDto>). CPU is monotonic microseconds — the
frontend computes the percentage from successive snapshots.
- Three unit tests cover the proc-stat parser (paren-in-comm safety),
the ticks-to-microsecond conversion, and the cross-platform envelope
shape.
Frontend:
- Overview gains a ProcessBar tile strip above the Golden signals row.
Six tiles: cpu (% of 1 core, warns >60×cores, crits >90×cores), rss,
threads, fds (open/limit, warns >70%, crits >90%), uptime, and the
ISO start timestamp. Each tile carries a hint tooltip — for cpu and
threads the hint lists the three hottest tokio workers from the
per-thread breakdown.
- /api/process polled at 3 s (informational, not alerting; /proc reads
are not free at 1.5 s).
- ProcessDto and ProcessThreadDto added to types.ts.
The Prometheus exporter has been carrying these three pool-level signals since the Patroni-fallback work shipped, but they never reached /api/pools so the React UI could not show them: an operator looking at the SPA had no way to tell that a pool was running on a Patroni-discovered fallback host or that its TLS handshakes against the backend were failing. Reading them from the existing GaugeVec / IntCounterVec keeps this commit free of new state. Backend: - PoolDto adds fallback_active (bool), tls_handshake_errors_total (u64), tls_backend_connections (u64). Read from the existing globals (FALLBACK_ACTIVE, SHOW_SERVER_TLS_HANDSHAKE_ERRORS, SHOW_SERVER_TLS_CONNECTIONS), keyed by the pool's `user@db` id which matches the `pool` label the producers already set. - Empty / never-touched labels lazily resolve to zero counters via `with_label_values`, which is acceptable: a real pool with no TLS errors yet renders as `0`, not as a missing key. Frontend: - types.ts mirrors the new fields. - PoolDetail page gains a "TLS & fallback" section between Connections and the SQLSTATE breakdown. Three KV rows: fallback active / TLS handshake errors total / TLS backend connections.
The /api/apps endpoint had been live since phase 3d-1 with the per- application_name aggregate of client counters, but no frontend page rendered it. Operators looking for "which app holds the connection spike / generates the error rate" had to grep Clients by application_name substring. This commit adds the dedicated page. What changed: - New route /apps backed by frontend/src/pages/Apps.tsx. Sortable table: application_name, clients (live), queries / transactions / errors cumulative, plus a derived "err / 1k q" ratio column tone-mapped amber > 1, red > 10 so a misbehaving app is visible at a glance. - Filter input is plain substring match against application_name (the "(unknown)" placeholder stays for clients that never set the name). - Sidebar gains an "Apps" link between Clients and Caches. - AppRowDto / AppsDto added to types.ts.
The /api/events admin ring buffer existed since phase 3d-4 with a
sequenced timeline of admin commands, but the frontend never consumed
it. Operators investigating a metric spike had no way to correlate it
with a recent RELOAD or PAUSE.
What changed:
- Sparkline gains an optional `events` prop. The draw hook paints a 1-px
amber vertical line at every event timestamp inside the visible window,
next to the existing warn/crit threshold lines. Out-of-window events
are skipped so the line only appears on charts whose rolling history
actually contains the moment of the action.
- Overview polls /api/events at 3 s, maps each entry to {ts, label}
and threads it into the four Golden-signals sparklines. The sync
cursor still works.
- types.ts mirrors EventEntryDto and EventsDto.
- AppsDto / AppRowDto kept in the same block.
Operator feedback: "когда я навожу мышкой - должны быть данные из
графика". The sparklines drew threshold lines and event markers but
gave no precise value at the cursor position; the value in the card
header was always the latest sample, never the one under the mouse.
What changed:
- Sparkline registers a `setCursor` hook that pushes the (ts, value)
at the cursor index into local React state. A footer strip below the
canvas renders that pair tabularly: "14:23:17 · 87" while hovering,
and falls back to a passive hint ("hover for point") when the cursor
leaves the canvas.
- `cursor.points` enabled (size 5) so uPlot now renders a small dot at
the cursor index, matching the readout.
- Cursor.sync still binds across the four Golden-signals charts when
`syncKey="overview"` is set, so the readout updates in lockstep on
all four cards as the user sweeps the mouse.
- formatHoverValue picks decimal precision from magnitude so "0.42"
and "12345" both render without the ms-padded look the card header
uses.
Operator review of the process bar surfaced two issues at once: "CPU" flickered "—" on the very first paint while the rate calculation waited for a second snapshot, and the existing tile strip showed only the hottest thread name — operators wanted max / avg / min so an imbalanced tokio runtime (one worker pinned to 100% while the rest idle) is visible at a glance. Uptime + started timestamps were redundant; one is derived from the other. What changed: - Per-thread CPU% computed from successive /api/process snapshots, kept in a `useRef` instead of a globalThis stash. Each thread's delta of (cpu_user_us + cpu_system_us) over the poll interval gives a percent of one core, sorted descending. - ProcessBar now has five tiles: cpu (whole-process), rss, threads (count · max · avg · min %), fds, uptime. The "started" duplicate tile is gone — the uptime tile's hover hint carries the start timestamp. - The CPU and thread tiles render "sampling…" instead of "—" before the second poll arrives, so the operator sees that the value is pending rather than missing. - Hot-thread tooltip lists Top-5 by CPU%, name#tid, plus a one-liner about runtime imbalance so the meaning of the number is anchored.
…rator-language descriptions
Three operator complaints in one batch:
1. AreaChart and DualAxisChart had no hover values — only Sparkline did.
Both now register a setCursor hook and render a footer strip with the
per-bucket / left+right values at the cursor index. Stacked AreaChart
reverses the cumulative stack so each label gets its own value, not
the running total.
2. Both chart types also paint the same /api/events vertical-amber-line
annotations Sparkline did. The events are passed through from
Overview's eventsPoll. Connection breakdown, Top-5 errors, and the
Wait-vs-oldest-active dual-axis now all share the same annotation
layer with the four Golden-signals sparklines.
3. PageHero descriptions on Overview / Pools / Clients / Caches / Apps /
Logs / ConfigState rewritten in operator-language. They used to read
like API documentation ("polled at 1.5 s through /api/clients with
server-side filtering"); the new copy explains what the page is for
and what to look at first during an incident.
4. Prepared expand row pretty-prints the SQL — newline before SELECT /
FROM / WHERE / JOIN / AND / OR / GROUP BY / ORDER BY / LIMIT and
friends, two-space indent on AND / OR, monospace box, "copy" button.
A real multi-line query no longer collapses into a single hard-to-
read run-on line.
Two operator complaints stacked on top of each other: - Heatmap cells used native title="" which delays ~1 s and only shows the saturation percent — no timestamp, no row affordance. The new overlay renders instantly via mouseenter, says "82% · 14s ago" so the operator knows when the sample was taken, and turns the row label into a clickable link to /pools/:id. The colour-key legend (< 70 / 70-89 / >= 90) is now inline in the heatmap header instead of buried in a help popover. - Overview latency tile rendered "89087 ms" — easy to misread as "89 ms" at a glance during an incident. The new fmtMs walks the units: ms below 1 s, "1.20 s" up to a minute, "1m 29s" up to an hour, "1h 42m" beyond. Operators get a number that scans correctly no matter the magnitude.
… re-renders Two operator fixes: - LogTap reaper deactivated the tap after 30 s of no /api/logs traffic. Operator stepping away even briefly came back to a dead tap. Bumped to 120 s; the tap re-arms instantly on the next poll, so the trade-off is a couple of MiB of buffer for a usable Logs page during incident triage. - ProcessBar showed "sampling…" not just on the first paint but on every re-render between polls — sibling polls (overview / pools / etc) were re-rendering the page, and the ref-based delta calculation rolled the previous snapshot forward to "now" each time so `last.ts === process.ts` evaluated true and the percentage went null. The new layout caches the computed percentage keyed by `process.ts`; intermediate re-renders read the cache and only the actual /api/process tick triggers a fresh delta.
…tignore Dockerfile - ProcessBar threads tile rendered "max 26 · avg 12 · min 0" — wrapped on any normal viewport. New layout uses a two-line tile: big number "26/0/12" (max/min/avg, one big mono line, fits any tile width) and a small "max/min/avg %" descriptor below, plus the existing tooltip with the full Top-5 breakdown. - FD limit on modern containers reads 1_073_741_816 — turning the tile into "66/1073741816", which truncated and read as gibberish. Limit is now abbreviated: < 1M renders raw, < 1G renders "Mxx", > 1G renders "∞". The exact cap stays in the hover tooltip. - INCIDENT_2026-05-03_3.5.2_large_result_leak.md was an internal note that ended up tracked by accident — removed from the repository. - Dockerfile.ubuntu22-tls is required to build the demo image but does not belong in the public source tree; added to .gitignore so it stops showing up as untracked on every clean checkout.
Operator screenshot showed "TRAFFIC 14360 q/s · 1225..." truncated by the tile's nowrap+truncate, and earlier "89087 ms" misread as 89 ms. Both formatters reworked for a tight cell: - fmtMs: 87ms / 1.2s / 89s / 1m29s / 1h42m. No space, two characters of unit max. The exact precision lives in the tooltip; the tile just needs to be unambiguous and wide-enough at any magnitude. - fmtRate: same number rules (k / M suffix), no embedded suffix. The caller composes the suffix into the tile label so the value column is free to render two numbers when (qps + tps) is the metric. - Traffic card: label now reads "Traffic q/s · t/s" and the value shows "14k · 1.2k". Always fits. All formatting stays in the frontend per the operator note — backend still returns raw u64/f64 so any future scrape consumer reads the unrounded value.
…load}
The Web UI was read-only — operators could see that a pool needed
PAUSE during a deploy or RELOAD after a config change but had to
fall back to ssh + admin SQL to actually do it. This commit closes
that gap with a small POST surface that mirrors the admin protocol's
RELOAD / PAUSE / RESUME / RECONNECT, paired with a confirmation
modal in the Pool detail page so a slip of the cursor cannot drain
a busy pool by accident.
Backend:
- New `crate::admin::operations` module exports `reload_now()`,
`pause_now(db)`, `resume_now(db)`, `reconnect_now(db)`. They are
thin extracts of the bodies of the matching `commands::*` handlers
with no postgres-protocol envelope writing — both transports
converge on the same pool / config mutations and emit identical
`events::push_event` rows so the frontend chart-annotation overlay
paints the moment of the action regardless of origin.
- `pool::get_client_server_map` is now `pub` so `reload_now` can
obtain the cancel-target map without going through `from_config`.
- New `web::routes::admin::handle_admin_action` returns a JSON
envelope `{"ts","action","affected_pools" | "error"}`.
- `web::server::dispatch` admits `POST` for `/api/admin/*` (everything
else still 405's). The pre-screen at the top of `handle_connection`
threads the path into the async admin handler the same way it does
for `/api/logs`.
- `/api/admin/` is added to `ADMIN_ONLY_PREFIXES` so the unauth
challenge is silent (`Accept: application/json`) — the React modal
owns credentials.
Frontend:
- `api.ts` gains an `apiPost` helper with the same auth-header rules
and 401 handling as `apiGet`.
- `PoolDetail` page renders four buttons in the header: pause / resume
/ reconnect (scoped to this pool's database) and reload (global,
type-to-confirm). Each opens a typed-confirm modal that requires the
operator to retype the action keyword before the call goes out.
- Result feedback is rendered as an inline status line; the button is
disabled while the request is in-flight, so a double-click cannot
resubmit a destructive action.
Operators kept asking for "большое окно с min/max/avg/p99 как в графана"
and the small Sparkline cards on Overview were the wrong size for that
read. PanelView is the modal version: large canvas, percentile table
over the visible window, time-range selector, cross-hair readout, event
annotations.
What changed:
- New components/PanelView.tsx (~330 LOC). Discriminated by `kind`:
"line" / "stackedArea" / "dualAxis". The same `data: [xs, ...ys]` shape
Sparkline / AreaChart / DualAxisChart already use, so callers wire the
panel without reshaping anything.
- Time range pill row (1m / 5m / 15m / 1h / all) windows the data
client-side; panel rebuilds uPlot on width change so the canvas grows
with the modal.
- Cursor crosshair pushes (idx, values[]) into React state on every
mouse move. The footer strip shows "14:23:17 · 87 ms" on hover and
falls back to a passive hint when the cursor leaves.
- Summary table: per-series count / min / avg / p50 / p95 / p99 / max
computed by lib/quantile.ts (linear-interp, R "type 7"). No backend
HDR snapshot needed — the values are taken over the visible
windowed series.
- /api/events vertical-amber lines and warn/crit threshold lines paint
via the same uPlot draw hook as the strip view.
Overview wiring:
- Each Golden-signals card is wrapped in a `ChartLink` (cursor:pointer +
Enter activation) that calls `setSearchParams({panel: id})`. Card
titles for AreaChart / DualAxisChart sections are buttons that do the
same.
- A `?panel=<id>` query param survives reloads and is shareable; Esc /
backdrop / ✕ all close it via `setSearchParams` so browser-back also
pops the modal.
- panelDescriptor() builds the per-id Panel config from the same data
the strip view uses, so the modal is always in sync with what the
card showed at click time.
Two operator-facing improvements: - /wall renders the same Overview signals as six oversized monocolour tiles with no chrome. Auto-refreshes via the existing usePoll hook; cells flash amber/red on threshold breach, the whole panel grows a red ring while any pool is critical so a wall TV reads the state from across a room. Sidebar nav gains a "War room" link. - /pools filters and sort are now mirrored into the URL (?q=..., ?severity=critical, ?sort=query_p95_ms, ?dir=asc). Operators can paste a filter into Slack during an incident and the recipient lands on the exact list view; reload preserves it. Defaults are stripped so the URL stays clean when nothing is filtered.
The shipped favicon was a teal "pd" tile from the original cyan skin — clashed with the new Bloomberg-style chrome on every browser tab. New mark: black field, paper-white tabular "pd" monogram in JetBrains Mono, amber tickertape underline that matches the dashboard's accent. Reads at 16x16, 32x32 and 64x64.
Three operator complaints converged into one batch:
- "Wait queue vs oldest active ↗" — operators kept clicking the canvas
to expand and nothing happened; only the title was a button. Card
now wraps the entire body in a role=button when `onTitleClick` is
set, with cursor:pointer and hover border. Click anywhere on the
card opens the matching PanelView.
- Threads tile: avg-thread CPU was diluted by idle workers (jemalloc
background threads sitting at 0% pulled the average down). The tile
is now clickable and opens a per-thread time-series panel: one line
per thread that ever exceeded 1% in the rolling window, sorted by
peak descending, threshold lines at 60%/90% of one core. Idle
threads are filtered out so the imbalance signal is visible.
- RSS tile: clickable; opens an interim memory panel with RSS and VM
size over time. Full memory breakdown (jemalloc allocated /
fragmentation, cgroup current/max, internal cache sums) is the
next backend endpoint — research finished and queued.
Implementation:
- Overview accumulates a rolling 240-point process snapshot history in
a `useRef`; each new /api/process snap computes per-thread deltas
against the previous one and pushes a row into `threadHistoryRef`
keyed by tid (NaN for threads that vanished). PanelView reads the
history through panelDescriptor("threads" / "rss").
- ProcStat / ProcStatTwoLine gain an optional onClick prop that turns
the tile into a clickable region with the same hover ring as the
charts.
- panelDescriptor signature gets the threadHistory and processSnapshots
arguments so the new panel kinds can be served from the same
switch-statement as the existing data.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full Web UI rollout for pg_doorman on
feat/web-ui. The pooler binary now exposes a multiplexed listener with admin auth, 22 read-only JSON endpoints, an in-memory log tap, and a React SPA embedded into the binary. Operators get a self-contained UI on:9127without extra services. Subsequent commits harden the build pipeline, fill in the missing §15.4 threshold rules, add per-pool error breakdown by SQLSTATE, and turn on the Web UI in the Grafana docker-compose demo