Addition of metrics to track when self-healing started/ended#2268
Addition of metrics to track when self-healing started/ended#2268ShubhamRwt wants to merge 3 commits intolinkedin:mainfrom
Conversation
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
tomncooper
left a comment
There was a problem hiding this comment.
A couple of nits.
Can you give an example output of these metrics? I'm not that familiar with dropwizard metrics. Does the dropwizard counter emit a timestamp when something is added to the internal map? I was curious how this gives us the time and type of anomaly healing being triggered.
Also, can you fix the typos in the PR description?
| MetricRegistry.name(ANOMALY_DETECTOR_SENSOR, String.format("%s-detect-to-fix-complete-timer", anomalyType.toString().toLowerCase()))); | ||
| _anomalyDetectToFixCompleteTimer.put(anomalyType, timer); | ||
|
|
||
| Counter counter = dropwizardMetricRegistry.counter( |
There was a problem hiding this comment.
If these counter and counter1 variables are not actually used anywhere other than the map.put call, can you just in-line them?
Same with timer variable above?
| _ongoingAnomalyDurationSumForAverageMs = 0; | ||
| _numSelfHealingStarted = new AtomicLong(0L); | ||
| _numSelfHealingFailedToStart = new AtomicLong(0L); | ||
| _numSelfHealingStartedPerAnomaly = new HashMap<>(); |
There was a problem hiding this comment.
Do you need to do anything in the refreshMetrics method to reset these maps? There seems to be a numSelfHealingStarted() call in there as part of creating a new AnomalyMetrics instance.
I don't know if that is relevant but you should check if these maps need to be reset.
|
Looks like you have CI failures in the |
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
|
I had a look but it looks to me we are adding kind of "historical" anomaly metrics, so for example how many anomalies (per type) started and ended. But what about what's going on in a specific time? How many anomaly fixes are actually running and on which detected anomalies? |
|
To know if self healing still running, we can have the sum(self healing started) minus sum(self healing ended) that can tell if the self healing is still in progress for a particular anomaly. |
|
@CCisGG Hi, If you get some time, can you have a look at this. It would be great to know your views on this. Thankyou |
|
I'm pretty out of bandwidth recently. @mhratson @allenxwang could you please help to take a look? |
Summary
New metrics:
Categorization
This PR resolves #2266.