Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 20 additions & 6 deletions examples/llm-api/llm_runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,29 @@ def example_cuda_graph_config():


def example_kv_cache_config():
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please put the document to at the beginning of the llm_runtime.py? Here is a reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_kv_cache_offloading.py

Example demonstrating KV cache configuration for memory management and performance.
KV cache configuration helps with:
- Controlling GPU memory allocation for key-value cache
- Enabling block reuse to optimize memory usage for shared prefixes
- Balancing memory usage between model weights and cache storage
Please refer to the api reference for more details.
"""

print("\n=== KV Cache Configuration Example ===")
print("\n1. KV Cache Configuration:")

llm_advanced = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
max_batch_size=8,
max_seq_len=1024,
kv_cache_config=KvCacheConfig(
free_gpu_memory_fraction=0.5,
enable_block_reuse=True))
llm_advanced = LLM(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please submit this PR to release/1.1 branch. Thanks

model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
max_batch_size=8,
max_seq_len=1024,
kv_cache_config=KvCacheConfig(
# free_gpu_memory_fraction: the fraction of free GPU memory to allocate to the KV cache
free_gpu_memory_fraction=0.5,
# enable_block_reuse: whether to enable block reuse
enable_block_reuse=True))

prompts = [
"Hello, my name is",
Expand Down
Loading