Skip to content

Conversation

@lizexu123
Copy link
Collaborator

@lizexu123 lizexu123 commented Sep 28, 2025

This PR supports actual inference for pooling models

Usage

online serving

  • launch the serving
  • Embeddings can be obtained via the EmbeddingCompletionRequest API or the EmbeddingChatRequest API
# Start the service demo
model_path=Qwen3-Embedding-0.6B
# 以下三个参数是必须的 
export ENABLE_V1_KVCACHE_SCHEDULER=1 #启用 KV Cache 块调度器的 V1 版本
export FD_DISABLE_CHUNKED_PREFILL=1 #强制禁用默认的分块预填充(Chunked Prefill)功能
export FD_USE_GET_SAVE_OUTPUT_V1=1  # 是否使用新版本的 get_output 和 save_output 方法

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-num-seqs 256 --max-model-len 32768 \
    --port 13331 --engine-worker-queue-port 7132 \
    --metrics-port 7431 --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --load-choices "default_v1" \
    --runner pooling \
    --graph-optimization-config '{"use_cudagraph":false}'

Request Method (curl example)

A. EmbeddingCompletionRequest 示例(标准文本输入)

curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "text-embedding-chat-model",
    "input": [
      "This is a sentence for pooling embedding.",
      "Another input text."
    ],
    "user": "test_client"
  }'

B. EmbeddingChatRequest 示例(消息序列输入)

curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "text-embedding-chat-model",
    "messages": [
      {"role": "user", "content": "Generate embedding for user query."}
    ]
  }'

@paddle-bot
Copy link

paddle-bot bot commented Sep 28, 2025

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants