Skip to content

Commit c93b1dd

Browse files
author
t00939662
committed
delete offline metrics part
1 parent cd511b3 commit c93b1dd

File tree

1 file changed

+14
-124
lines changed

1 file changed

+14
-124
lines changed
Lines changed: 14 additions & 124 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,12 @@
1-
# UCM Observability
1+
# Observability with Prometheus
22

3-
UCM (Unified Cache Management) provides comprehensive observability features to monitor cache performance and behavior. This document describes two complementary monitoring approaches:
4-
5-
1. **Prometheus Metrics**: Real-time metrics exposed via Prometheus endpoints for live monitoring and visualization
6-
2. **Operation Logging**: File-based operation logs for offline analysis, debugging, and auditing
7-
8-
Both features can be used independently or together, depending on your monitoring needs.
3+
UCM (Unified Cache Management) provides detailed metrics monitoring through Prometheus endpoints, allowing in-depth monitoring of cache performance and behavior. This document describes how to enable and configure observability from the embedded vLLM `/metrics` API endpoint.
94

105
---
116

12-
## Part 1: Prometheus Metrics
13-
14-
Prometheus metrics provide real-time monitoring of UCM operations through the embedded vLLM `/metrics` API endpoint. This approach is ideal for live dashboards, alerting, and performance monitoring.
7+
## Quick Start Guide
158

16-
### Quick Start Guide
17-
18-
#### 1) On UCM Side
9+
### 1) On UCM Side
1910

2011
First, set the `PROMETHEUS_MULTIPROC_DIR` environment variable.
2112

@@ -78,9 +69,9 @@ curl http://$<vllm-worker-ip>:8000/metrics | grep ucm:
7869

7970
You will also find some `.db` files in the `$PROMETHEUS_MULTIPROC_DIR` directory, which are temporary files used by Prometheus.
8071

81-
#### 2) Start Prometheus and Grafana with Docker Compose
72+
### 2) Start Prometheus and Grafana with Docker Compose
8273

83-
##### Create Docker Compose Configuration Files
74+
#### Create Docker Compose Configuration Files
8475

8576
First, create the `docker-compose.yaml` file:
8677

@@ -123,7 +114,7 @@ scrape_configs:
123114

124115
**Note**: Make sure the port number in `prometheus.yaml` matches the port number used when starting the vLLM service.
125116

126-
##### Start Services
117+
#### Start Services
127118

128119
Run the following command in the directory containing `docker-compose.yaml` and `prometheus.yaml`:
129120

@@ -133,21 +124,21 @@ docker compose up
133124

134125
This will start Prometheus and Grafana services.
135126

136-
#### 3) Configure Grafana Dashboard
127+
### 3) Configure Grafana Dashboard
137128

138-
##### Access Grafana
129+
#### Access Grafana
139130

140131
Navigate to `http://<your-host>:3000`. Log in with the default username (`admin`) and password (`admin`). You will be prompted to change the password on first login.
141132

142-
##### Add Prometheus Data Source
133+
#### Add Prometheus Data Source
143134

144135
1. Navigate to `http://<your-host>:3000/connections/datasources/new` and select **Prometheus**.
145136

146137
2. On the Prometheus configuration page, add the Prometheus server URL in the **Connection** section. For this Docker Compose setup, Grafana and Prometheus run in separate containers, but Docker creates DNS names for each container. You can directly use `http://prometheus:9090`.
147138

148139
3. Click **Save & Test**. You should see a green checkmark showing "Successfully queried the Prometheus API."
149140

150-
##### Import Dashboard
141+
#### Import Dashboard
151142

152143
1. Navigate to `http://<your-host>:3000/dashboard/import`.
153144

@@ -159,7 +150,7 @@ Navigate to `http://<your-host>:3000`. Log in with the default username (`admin`
159150

160151
You should now be able to see the UCM monitoring dashboard with real-time visualization of all 9 metrics.
161152

162-
### Available Metrics
153+
## Available Metrics
163154

164155
UCM exposes various metrics to monitor its performance. The following table lists all available metrics organized by category:
165156

@@ -178,7 +169,7 @@ UCM exposes various metrics to monitor its performance. The following table list
178169
| **Lookup Hit Rate Metrics** | | |
179170
| `ucm:interval_lookup_hit_rates` | Histogram | Hit rate of UCM lookup requests |
180171

181-
### Prometheus Configuration
172+
## Prometheus Configuration
182173

183174
Metrics configuration is defined in the `ucm/metrics/metrics_configs.yaml` file:
184175

@@ -201,105 +192,4 @@ prometheus:
201192
# ... other metric configurations
202193
```
203194

204-
---
205-
206-
## Part 2: Operation Logging
207-
208-
In addition to Prometheus metrics, UCM provides a file-based operation logging feature that records detailed operation data (load and dump operations) to log files. This feature is useful for offline analysis, debugging, and auditing.
209-
210-
211-
### Quick Start Guide
212-
213-
#### 1) Enable Operation Logging
214-
215-
1. Create or modify the metrics configuration file (`ucm/metrics/metrics_configs.yaml`).
216-
217-
2. Start the UCM service. If the configuration has `enabled: True`, operation logging will be automatically enabled.
218-
219-
#### 2) View Log Files
220-
221-
Log files are written to the directory specified by `log_dir` in the configuration file:
222-
223-
```bash
224-
# List log files
225-
ls -lh /vllm-workspace/ucm_logs/
226-
227-
# View active log file
228-
tail -f /vllm-workspace/ucm_logs/ucm_operation.log
229-
230-
# View compressed log file
231-
zcat /vllm-workspace/ucm_logs/ucm_operation.log.gz | head -20
232-
```
233-
234-
#### 3) Analyze Log Data
235-
236-
Since log files are in JSON Lines format (one JSON object per line), you can easily analyze them:
237-
238-
```bash
239-
# Count load operations
240-
grep '"op_type":"load"' /vllm-workspace/ucm_logs/ucm_operation.log | wc -l
241-
242-
# Count dump operations
243-
grep '"op_type":"dump"' /vllm-workspace/ucm_logs/ucm_operation.log | wc -l
244-
245-
# Extract all block IDs from load operations
246-
grep '"op_type":"load"' /vllm-workspace/ucm_logs/ucm_operation.log | jq -r '.blocks[]'
247-
248-
# Count unique blocks
249-
grep '"op_type":"load"' /vllm-workspace/ucm_logs/ucm_operation.log | jq -r '.blocks[]' | sort -u | wc -l
250-
```
251-
252-
### Configuration Parameters
253-
254-
The operation logging feature is configured in the `operation_db` section of the metrics configuration file. You can use `ucm/metrics/metrics_configs.yaml` or create a separate configuration file.
255-
256-
| Parameter | Default Value | Description |
257-
|-----------|---------------|-------------|
258-
| `enabled` | False | Enable/disable operation logging |
259-
| `log_dir` | `/vllm-workspace/ucm_logs` | Directory where log files are stored |
260-
| `log_name` | `ucm_operation` | Base name for log files |
261-
| `max_file_size` | 104857600 | Maximum size of a single log file in bytes (100MB). When exceeded, file rotation occurs |
262-
| `batch_size` | 100 | Number of log entries to batch before writing to disk |
263-
| `flush_interval` | 5.0 | Time interval (seconds) to force flush buffered logs |
264-
| `encoding` | `utf-8` | File encoding |
265-
| `compress_rotated` | True | Whether to compress rotated log files using gzip |
266-
| `compress_level` | 6 | Gzip compression level (1-9, where 1 is fastest and 9 is smallest) |
267-
| `max_log_files` | 30 | Maximum number of log files to retain (including compressed files) |
268-
| `max_log_days` | 7 | Maximum number of days to retain logs |
269-
| `max_log_total_size` | 1073741824 | Maximum total size of all log files in bytes (1GB) |
270-
271-
### Example Configuration
272-
273-
Add the following section to your `ucm/metrics/metrics_configs.yaml` file:
274-
275-
```yaml
276-
operation_db:
277-
enabled: True
278-
log_dir: "/vllm-workspace/ucm_logs"
279-
log_name: "ucm_operation"
280-
max_file_size: 104857600
281-
batch_size: 100
282-
flush_interval: 5.0
283-
encoding: "utf-8"
284-
compress_rotated: True
285-
compress_level: 6
286-
max_log_files: 30
287-
max_log_days: 7
288-
max_log_total_size: 1073741824
289-
```
290-
291-
---
292-
293-
## Comparison: Prometheus Metrics vs. Operation Logging
294-
295-
| Feature | Prometheus Metrics | Operation Logging |
296-
|---------|-------------------|-------------------|
297-
| **Purpose** | Real-time monitoring and alerting | Offline analysis and debugging |
298-
| **Data Format** | Time-series metrics | JSON Lines (detailed operation records) |
299-
| **Storage** | Prometheus time-series database | File system (compressed logs) |
300-
| **Retention** | Configurable in Prometheus | Configurable (file count, days, total size) |
301-
| **Query Interface** | PromQL queries, Grafana dashboards | File-based analysis (grep, jq, etc.) |
302-
| **Performance Impact** | Minimal (async metrics collection) | Minimal (async file writes) |
303-
| **Use Cases** | Live dashboards, alerting, performance monitoring | Debugging, audit trails, detailed analysis |
304-
305-
Both features can be enabled simultaneously. Prometheus metrics are ideal for real-time monitoring, while operation logs provide detailed historical records for in-depth analysis.
195+
---

0 commit comments

Comments
 (0)