Skip to content

Q Scaling factor for FP8 KV Cache ? #1294

@Syst3m1cAn0maly

Description

@Syst3m1cAn0maly

When using a FP8 quantized model with FP8 KV cache scales I get this warning in newer versions of vLLM :
WARNING 03-25 22:34:30 [kv_cache.py:82] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.

I indeed use Flash Attention 3.

How can I provide the Q scaling factor to vLLM ?

I use this recipe :

recipe = """

quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["re:.*lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
kv_cache_scheme:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
"""

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions