ModelEngine-Group · zhou-haitao · Dec 3, 2025 · Dec 5, 2025 · Dec 5, 2025 · yuanzhg078
@@ -6,7 +6,7 @@ This document describes how to install unified-cache-management when using Ascen
 - Python: >= 3.9, < 3.12
 - A hardware with Ascend NPU. It’s usually the Atlas 800 A2 series.
 
-The current version of unified-cache-management based on vLLM-Ascend v0.9.2rc1, refer to [vLLM-Ascend Installation Requirements](https://vllm-ascend.readthedocs.io/en/latest/installation.html#requirements) to meet the requirements.
+The current version of unified-cache-management based on vLLM-Ascend v0.11.0rc1 and v0.9.1, refer to [vLLM-Ascend Installation Requirements](https://vllm-ascend.readthedocs.io/en/latest/installation.html#requirements) to meet the requirements.
 
 You have 2 ways to install for now:
 - Setup from code: First, prepare vLLM-Ascend environment, then install unified-cache-management from source code.
@@ -17,14 +17,14 @@ You have 2 ways to install for now:
 ### Prepare vLLM-Ascend Environment
 For the sake of environment isolation and simplicity, we recommend preparing the vLLM-Ascend environment by pulling the official, pre-built vLLM-Ascend Docker image.
 ```bash
-docker pull quay.io/ascend/vllm-ascend:v0.9.2rc1
+docker pull quay.io/ascend/vllm-ascend:v0.9.1
 ```
 Use the following command to run your own container:
 ```bash
 # Update DEVICE according to your device (/dev/davinci[0-7])
 export DEVICE=/dev/davinci7
 # Update the vllm-ascend image
-export IMAGE=quay.io/ascend/vllm-ascend:v0.9.2rc1
+export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1
 docker run --rm \
     --name vllm-ascend-env \
     --device $DEVICE \

@@ -14,7 +14,9 @@
 - NPU: Atlas 800 A2/A3 series
 - CANN: CANN Version 8.1.RC1
 - vLLM: v0.9.2
-- vLLM Ascend: v0.9.2rc1
+- vLLM Ascend: v0.9.1/0.9.2rc1
+
+note: If you are using Prefix Cache, please choose the vllm ascend v0.9.1 vesion; if you are using Sparse Attention, please choose the vllm ascend v0.9.2rc1 version.
 
 ## Installation
 Before you start with UCM, please make sure that you have installed UCM correctly by following the [GPU Installation](./installation_gpu.md) guide or [NPU Installation](./installation_npu.md) guide.
@@ -54,19 +56,47 @@ python offline_inference.py
 
 For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.
 
-First, specify the python hash seed by:
-```bash
-export PYTHONHASHSEED=123456
-```
+
 
 Create a config yaml like following and save it to your own directory:
 ```yaml
 # UCM Configuration File Example
-# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details
-ucm_connector_name: "UcmNfsStore"
-
-ucm_connector_config:
-  storage_backends: "/mnt/test"
+# 
+# This file demonstrates how to configure UCM using YAML.
+# You can use this config file by setting the path to this file in kv_connector_extra_config in launch script or command line like this:
+# kv_connector_extra_config={"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
+#
+# Alternatively, you can still use kv_connector_extra_config in KVTransferConfig
+# for backward compatibility.
+
+# Connector name (e.g., "UcmNfsStore")
+ucm_connectors:
+  - ucm_connector_name: "UcmNfsStore"
+    ucm_connector_config:
+      storage_backends: "/mnt/test"
+      use_direct: false
+
+load_only_first_rank: false
+
+# Enable UCM metrics so they can be monitored online via Grafana and Prometheus.
+# metrics_config_path: "/workspace/unified-cache-management/examples/metrics/metrics_configs.yaml"
+
+# Sparse attention configuration
+# Format 1: Dictionary format (for methods like ESA, KvComp)
+# ucm_sparse_config:
+#   ESA:
+#     init_window_sz: 1
+#     local_window_sz: 2
+#     min_blocks: 4
+#     sparse_ratio: 0.3
+#     retrieval_stride: 5
+  # Or for GSA:
+  # GSA: {}
+
+
+# Whether to use layerwise loading/saving (optional, default: True for UnifiedCacheConnectorV1)
+# use_layerwise: true
+# hit_ratio: 0.9
 ```
 
 Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model and your config file path:

@@ -90,12 +90,44 @@ To use the NFS connector, you need to configure the `connector_config` dictionar
 Create a config yaml like following and save it to your own directory:
 ```yaml
 # UCM Configuration File Example
-# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details
-ucm_connector_name: "UcmNfsStore"
+# 
+# This file demonstrates how to configure UCM using YAML.
+# You can use this config file by setting the path to this file in kv_connector_extra_config in launch script or command line like this:
+# kv_connector_extra_config={"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
+#
+# Alternatively, you can still use kv_connector_extra_config in KVTransferConfig
+# for backward compatibility.
+
+# Connector name (e.g., "UcmNfsStore")
+ucm_connectors:
+  - ucm_connector_name: "UcmNfsStore"
+    ucm_connector_config:
+      storage_backends: "/mnt/test"
+      use_direct: false
+
+load_only_first_rank: false
+
+# Enable UCM metrics so they can be monitored online via Grafana and Prometheus.
+# metrics_config_path: "/workspace/unified-cache-management/examples/metrics/metrics_configs.yaml"
+
+# Sparse attention configuration
+# Format 1: Dictionary format (for methods like ESA, KvComp)
+# ucm_sparse_config:
+#   ESA:
+#     init_window_sz: 1
+#     local_window_sz: 2
+#     min_blocks: 4
+#     sparse_ratio: 0.3
+#     retrieval_stride: 5
+  # Or for GSA:
+  # GSA: {}
+
+
+# Whether to use layerwise loading/saving (optional, default: True for UnifiedCacheConnectorV1)
+# use_layerwise: true
+# hit_ratio: 0.9
+
 
-ucm_connector_config:
-  storage_backends: "/mnt/test"
-  transferStreamNumber: 32
 ```
 
 ## Launching Inference
@@ -116,7 +148,6 @@ Then run the script as follows:
 
 ```bash
 cd examples/
-export PYTHONHASHSEED=123456
 python offline_inference.py
 ```
 
@@ -166,10 +197,9 @@ curl http://localhost:7800/v1/completions \
 ```
 To quickly experience the NFS Connector's effect:
 
-1. Start the service with:  
-   `--no-enable-prefix-caching`  
+1. Start the service with:   `--no-enable-prefix-caching`  
 2. Send the same request (exceed 128 tokens) twice consecutively
-3. Remember to enable prefix caching (do not add `--no-enable-prefix-caching`) in production environments.
+
 ### Log Message Structure
 ```text
 [UCMNFSSTORE] [I] Task(<task_id>,<direction>,<task_count>,<size>) finished, elapsed <time>s