Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/source/getting-started/installation_npu.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This document describes how to install unified-cache-management when using Ascen
- Python: >= 3.9, < 3.12
- A hardware with Ascend NPU. It’s usually the Atlas 800 A2 series.

The current version of unified-cache-management based on vLLM-Ascend v0.9.2rc1, refer to [vLLM-Ascend Installation Requirements](https://vllm-ascend.readthedocs.io/en/latest/installation.html#requirements) to meet the requirements.
The current version of unified-cache-management based on vLLM-Ascend v0.11.0rc1 and v0.9.1, refer to [vLLM-Ascend Installation Requirements](https://vllm-ascend.readthedocs.io/en/latest/installation.html#requirements) to meet the requirements.

You have 2 ways to install for now:
- Setup from code: First, prepare vLLM-Ascend environment, then install unified-cache-management from source code.
Expand All @@ -17,14 +17,14 @@ You have 2 ways to install for now:
### Prepare vLLM-Ascend Environment
For the sake of environment isolation and simplicity, we recommend preparing the vLLM-Ascend environment by pulling the official, pre-built vLLM-Ascend Docker image.
```bash
docker pull quay.io/ascend/vllm-ascend:v0.9.2rc1
docker pull quay.io/ascend/vllm-ascend:v0.9.1
```
Use the following command to run your own container:
```bash
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.2rc1
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1
Copy link

@yuanzhg078 yuanzhg078 Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Confirm if it is this vllm-ascend version?

docker run --rm \
--name vllm-ascend-env \
--device $DEVICE \
Expand Down
50 changes: 40 additions & 10 deletions docs/source/getting-started/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
- NPU: Atlas 800 A2/A3 series
- CANN: CANN Version 8.1.RC1
- vLLM: v0.9.2
- vLLM Ascend: v0.9.2rc1
- vLLM Ascend: v0.9.1/0.9.2rc1

note: If you are using Prefix Cache, please choose the vllm ascend v0.9.1 vesion; if you are using Sparse Attention, please choose the vllm ascend v0.9.2rc1 version.

## Installation
Before you start with UCM, please make sure that you have installed UCM correctly by following the [GPU Installation](./installation_gpu.md) guide or [NPU Installation](./installation_npu.md) guide.
Expand Down Expand Up @@ -54,19 +56,47 @@ python offline_inference.py

For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.

First, specify the python hash seed by:
```bash
export PYTHONHASHSEED=123456
```


Create a config yaml like following and save it to your own directory:
```yaml
# UCM Configuration File Example
# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details
ucm_connector_name: "UcmNfsStore"

ucm_connector_config:
storage_backends: "/mnt/test"
#
# This file demonstrates how to configure UCM using YAML.
# You can use this config file by setting the path to this file in kv_connector_extra_config in launch script or command line like this:
# kv_connector_extra_config={"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
#
# Alternatively, you can still use kv_connector_extra_config in KVTransferConfig
# for backward compatibility.

# Connector name (e.g., "UcmNfsStore")
ucm_connectors:
- ucm_connector_name: "UcmNfsStore"
ucm_connector_config:
storage_backends: "/mnt/test"
use_direct: false

load_only_first_rank: false

# Enable UCM metrics so they can be monitored online via Grafana and Prometheus.
# metrics_config_path: "/workspace/unified-cache-management/examples/metrics/metrics_configs.yaml"

# Sparse attention configuration
# Format 1: Dictionary format (for methods like ESA, KvComp)
# ucm_sparse_config:
# ESA:
# init_window_sz: 1
# local_window_sz: 2
# min_blocks: 4
# sparse_ratio: 0.3
# retrieval_stride: 5
# Or for GSA:
# GSA: {}


# Whether to use layerwise loading/saving (optional, default: True for UnifiedCacheConnectorV1)
# use_layerwise: true
# hit_ratio: 0.9
```

Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model and your config file path:
Expand Down
48 changes: 39 additions & 9 deletions docs/source/user-guide/prefix-cache/nfs_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,12 +90,44 @@ To use the NFS connector, you need to configure the `connector_config` dictionar
Create a config yaml like following and save it to your own directory:
```yaml
# UCM Configuration File Example
# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details
ucm_connector_name: "UcmNfsStore"
#
# This file demonstrates how to configure UCM using YAML.
# You can use this config file by setting the path to this file in kv_connector_extra_config in launch script or command line like this:
# kv_connector_extra_config={"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
#
# Alternatively, you can still use kv_connector_extra_config in KVTransferConfig
# for backward compatibility.

# Connector name (e.g., "UcmNfsStore")
ucm_connectors:
- ucm_connector_name: "UcmNfsStore"
ucm_connector_config:
storage_backends: "/mnt/test"
use_direct: false

load_only_first_rank: false

# Enable UCM metrics so they can be monitored online via Grafana and Prometheus.
# metrics_config_path: "/workspace/unified-cache-management/examples/metrics/metrics_configs.yaml"

# Sparse attention configuration
# Format 1: Dictionary format (for methods like ESA, KvComp)
# ucm_sparse_config:
# ESA:
# init_window_sz: 1
# local_window_sz: 2
# min_blocks: 4
# sparse_ratio: 0.3
# retrieval_stride: 5
# Or for GSA:
# GSA: {}


# Whether to use layerwise loading/saving (optional, default: True for UnifiedCacheConnectorV1)
# use_layerwise: true
# hit_ratio: 0.9


ucm_connector_config:
storage_backends: "/mnt/test"
transferStreamNumber: 32
```

## Launching Inference
Expand All @@ -116,7 +148,6 @@ Then run the script as follows:

```bash
cd examples/
export PYTHONHASHSEED=123456
python offline_inference.py
```

Expand Down Expand Up @@ -166,10 +197,9 @@ curl http://localhost:7800/v1/completions \
```
To quickly experience the NFS Connector's effect:

1. Start the service with:
`--no-enable-prefix-caching`
1. Start the service with: `--no-enable-prefix-caching`
2. Send the same request (exceed 128 tokens) twice consecutively
3. Remember to enable prefix caching (do not add `--no-enable-prefix-caching`) in production environments.

### Log Message Structure
```text
[UCMNFSSTORE] [I] Task(<task_id>,<direction>,<task_count>,<size>) finished, elapsed <time>s
Expand Down
Loading