Skip to content

feat(telemetry): Add Prometheus metrics endpoint for node monitoring#552

Open
dvansari65 wants to merge 3 commits intosolana-foundation:mainfrom
dvansari65:feat/prometheus-metrics
Open

feat(telemetry): Add Prometheus metrics endpoint for node monitoring#552
dvansari65 wants to merge 3 commits intosolana-foundation:mainfrom
dvansari65:feat/prometheus-metrics

Conversation

@dvansari65
Copy link
Contributor

--Add Prometheus metrics endpoint at /metrics (default: 0.0.0.0:9000)
--Track key node metrics: slot, epoch, transaction counts, WebSocket subscriptions
--Record metrics on every transaction via event-driven architecture
--Include feature flag prometheus with zero overhead when disabled
--Add CLI flag --metrics-enabled for easy activation
--Expose metrics: surfpool_slot, surfpool_epoch, surfpool_slot_index, surfpool_transactions_count, surfpool_transactions_processed_total, surfpool_uptime_seconds, surfpool_ws_subscriptions_total, surfpool_ws_signature_subscriptions, surfpool_ws_account_subscriptions, surfpool_ws_slot_subscriptions, surfpool_ws_logs_subscriptions
--Enable real-time monitoring and Prometheus integration for production observability

Copy link
Collaborator

@MicaiahReid MicaiahReid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dvansari65! I can tell a lot of work went into this!

I've got some minor comments now, and a general architectural question. In the coming days I'll test and likely have some more questions/comments.

But for now: it seems like you went with a "half-way" feature-gated approach. We always initialize the prometheus mod, but if the feature isn't enabled it's a no-op/warn. We always receive the SimnetEvent::MetricsData event, but if the feature isn't enabled it's a no-op again.

Why not just fully feature gate? The event doesn't exist/isn't received/isn't emitted, the telemetry mod isn't imported, and the CLI flags aren't there at all, unless the feature is enabled. Let me know if you've got thoughts, and thanks again!

Comment on lines +148 to +149
// Create prometheus exporter using the new 0.28 API
// In 0.28, opentelemetry-prometheus uses a different approach
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Create prometheus exporter using the new 0.28 API
// In 0.28, opentelemetry-prometheus uses a different approach

Doesn't seem like a helpful comment :)

}
};

// Build resource using 0.28 API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Build resource using 0.28 API

.with_attributes(vec![KeyValue::new("service.name", service_name_owned)])
.build();

// Build meter provider using 0.28 API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Build meter provider using 0.28 API

Comment on lines +175 to +183
if METER_PROVIDER.set(provider).is_err() {
result = Err("Meter provider already initialized".into());
return;
}

if METRICS.set(metrics).is_err() {
result = Err("Metrics already initialized".into());
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the only reasons for an error here? Why .is_err rather than if let Err(e) ... and returning the actual error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it

rt.block_on(async {
let registry_clone = registry.clone();

// Build axum 0.8 router with new path syntax
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Build axum 0.8 router with new path syntax

@dvansari65
Copy link
Contributor Author

dvansari65 commented Mar 5, 2026

@MicaiahReid means , when feature flag is disabled then the code of my implementation should not exist at all , am I right?

@dvansari65
Copy link
Contributor Author

just tell me end to end workflow , I will try to add!

…lemetryConfig, CLI flags, and telemetry mod behind prometheus feature
@dvansari65 dvansari65 requested a review from MicaiahReid March 6, 2026 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants