feat(telemetry): Add Prometheus metrics endpoint for node monitoring#552
feat(telemetry): Add Prometheus metrics endpoint for node monitoring#552dvansari65 wants to merge 3 commits intosolana-foundation:mainfrom
Conversation
MicaiahReid
left a comment
There was a problem hiding this comment.
Thanks @dvansari65! I can tell a lot of work went into this!
I've got some minor comments now, and a general architectural question. In the coming days I'll test and likely have some more questions/comments.
But for now: it seems like you went with a "half-way" feature-gated approach. We always initialize the prometheus mod, but if the feature isn't enabled it's a no-op/warn. We always receive the SimnetEvent::MetricsData event, but if the feature isn't enabled it's a no-op again.
Why not just fully feature gate? The event doesn't exist/isn't received/isn't emitted, the telemetry mod isn't imported, and the CLI flags aren't there at all, unless the feature is enabled. Let me know if you've got thoughts, and thanks again!
crates/core/src/telemetry.rs
Outdated
| // Create prometheus exporter using the new 0.28 API | ||
| // In 0.28, opentelemetry-prometheus uses a different approach |
There was a problem hiding this comment.
| // Create prometheus exporter using the new 0.28 API | |
| // In 0.28, opentelemetry-prometheus uses a different approach |
Doesn't seem like a helpful comment :)
crates/core/src/telemetry.rs
Outdated
| } | ||
| }; | ||
|
|
||
| // Build resource using 0.28 API |
There was a problem hiding this comment.
| // Build resource using 0.28 API |
crates/core/src/telemetry.rs
Outdated
| .with_attributes(vec![KeyValue::new("service.name", service_name_owned)]) | ||
| .build(); | ||
|
|
||
| // Build meter provider using 0.28 API |
There was a problem hiding this comment.
| // Build meter provider using 0.28 API |
crates/core/src/telemetry.rs
Outdated
| if METER_PROVIDER.set(provider).is_err() { | ||
| result = Err("Meter provider already initialized".into()); | ||
| return; | ||
| } | ||
|
|
||
| if METRICS.set(metrics).is_err() { | ||
| result = Err("Metrics already initialized".into()); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Are these the only reasons for an error here? Why .is_err rather than if let Err(e) ... and returning the actual error?
crates/core/src/telemetry.rs
Outdated
| rt.block_on(async { | ||
| let registry_clone = registry.clone(); | ||
|
|
||
| // Build axum 0.8 router with new path syntax |
There was a problem hiding this comment.
| // Build axum 0.8 router with new path syntax |
|
@MicaiahReid means , when feature flag is disabled then the code of my implementation should not exist at all , am I right? |
|
just tell me end to end workflow , I will try to add! |
…lemetryConfig, CLI flags, and telemetry mod behind prometheus feature
--Add Prometheus metrics endpoint at /metrics (default: 0.0.0.0:9000)
--Track key node metrics: slot, epoch, transaction counts, WebSocket subscriptions
--Record metrics on every transaction via event-driven architecture
--Include feature flag prometheus with zero overhead when disabled
--Add CLI flag --metrics-enabled for easy activation
--Expose metrics: surfpool_slot, surfpool_epoch, surfpool_slot_index, surfpool_transactions_count, surfpool_transactions_processed_total, surfpool_uptime_seconds, surfpool_ws_subscriptions_total, surfpool_ws_signature_subscriptions, surfpool_ws_account_subscriptions, surfpool_ws_slot_subscriptions, surfpool_ws_logs_subscriptions
--Enable real-time monitoring and Prometheus integration for production observability