Dataset, prompts, evaluation code, and result artifacts for the CCiV paper (ICASSP).
CCiV is a comprehensive benchmark for evaluating Large Language Models (LLMs) on Chinese Ci (Song dynasty poetry) generation tasks. It provides multi-dimensional evaluation metrics including structural format accuracy, tonal pattern compliance, and LLM-as-judge semantic assessment.
- 300 High-Quality Ci Samples: Covering 36 classical Cipai (tune patterns) including浣溪沙, 鹧鸪天, 菩萨蛮, etc.
- Multi-Dimensional Evaluation:
- Structural Format Accuracy: Validates sentence count and character count per line
- Tonal Pattern Score: Evaluates compliance with classical ping-ze (平仄) patterns
- LLM-as-Judge: Assesses informativeness and aesthetic quality
- Flexible Model Support: Local vLLM inference and API-based evaluation (OpenAI, Kimi, Doubao, Gemini)
- Complete Evaluation Pipeline: From data preparation to result analysis
CCiV/
├── ci_gen.json # 300 Ci generation samples
├── dev_fewshot.json # Few-shot development examples
├── evaluate_cipai.py # Main evaluation script
├── gen_csv.py # Aggregate metrics to CSV
├── requirements.txt # Python dependencies
├── README.md # English documentation
├── README_zh.md # Chinese documentation
├── src/ # Utility modules
│ ├── cipai_utils.py # Format evaluation utilities
│ ├── cipai2info.json # Cipai format database (36 patterns)
│ └── xinyun/ # Tonal pattern conversion toolkit
│ ├── __init__.py
│ ├── converter.py # Poem to tonal pattern converter
│ ├── pinyin_utils.py # Pinyin processing utilities
│ └── exceptions.py
└── llm_eval/ # LLM-as-judge evaluation
├── main.py # LLM evaluation script
├── k2/ # Kimi API results
├── doubao/ # Doubao API results
└── gemini/ # Gemini API results
# Clone the repository
git clone https://github.com/your-username/CCiV.git
cd CCiV
# Install dependencies
pip install -r requirements.txt
# For local vLLM inference (optional)
pip install vllm>=0.6.0python evaluate_cipai.py \
-s ./results_enhanced \
-m /path/to/model \
-v qwen2.5_7b_inst \
-t ./ci_gen.json \
-g 1# Using OpenAI-compatible API
python evaluate_cipai.py \
-s ./results_api \
--mode api \
-m gpt-4o \
-v gpt-4o \
--api_key $API_KEY \
--api_base $API_BASE
# Using custom chat endpoint
python evaluate_cipai.py \
-s ./results_api \
--mode api \
-m your-model \
-v v1 \
--api_key $API_KEY \
--chat_url https://your-api.com/chatpython gen_csv.pyEdit gen_csv.py to customize the target folder and metrics:
# For form-aware results
folder = './results_enhanced'
main(folder, metric_key=['score', 'tonal_score', 'tonal_multiple_score'])
# For zero-shot results
folder = './results_direct'
main(folder, metric_key=['score'])cd llm_eval
# Using Kimi API
python main.py --model kimi-k2-0711-preview --prefix k2
# Using Gemini API
python main.py --model gemini-2.0-flash --prefix gemini
# Using Doubao API
python main.py --model your-model --prefix doubaoSet environment variables before running:
# Kimi API
export KIMI_API_KEY="your-key"
export KIMI_API_BASE="https://api.moonshot.cn"
# Doubao API
export ARK_API_KEY="your-key"
export ARK_API_BASE="https://ark.cn-beijing.volces.com/api/v3"Each sample contains:
{
"instruction": "按照提供的词牌名和题目写一首词...",
"input": "词牌: 浣溪沙\n题目: 春雨渡江有忆",
"output": "参考词作内容",
"cipai": "浣溪沙",
"format_standard": "{...}",
"sample_id": "fuxi_CiG_0"
}<save_dir>/<version>.json: Raw generation results<save_dir>/<version>_evaluated.json: Results with evaluation metrics
| Metric | Description |
|---|---|
score |
Structural format accuracy (boolean) |
tonal_score |
Tonal pattern compliance score (0.0-1.0) |
tonal_multiple_score |
Best match across multiple templates (0.0-1.0) |
informativeness |
Information density score (1-5) |
aesthetic |
Artistic quality score (1-5) |
| Cipai | Cipai | Cipai | Cipai |
|---|---|---|---|
| 浣溪沙 | 鹧鸪天 | 菩萨蛮 | 蝶恋花 |
| 临江仙 | 满江红 | 清平乐 | 水调歌头 |
| 虞美人 | 沁园春 | 念奴娇 | 满庭芳 |
| 西江月 | 金缕曲 | 点绛唇 | 减字木兰花 |
| 踏莎行 | 浪淘沙 | 水龙吟 | 望江南 |
| 如梦令 | 南乡子 | 贺新郎 | 卜算子 |
| 采桑子 | 摸鱼儿 | 忆江南 | 渔家傲 |
| 江城子 | 鹊桥仙 | 忆秦娥 | 青玉案 |
| 苏幕遮 | 一剪梅 | 声声慢 | 醉花阴 |
| Argument | Short | Default | Description |
|---|---|---|---|
--mode |
vllm | Mode: vllm or api |
|
--save_dir |
-s |
./results | Output directory |
--version |
-v |
v1 | Version identifier |
--model_name_or_path |
-m |
None | Model path or name |
--test_path |
-t |
./ci_gen.json | Test data path |
--gpu_num |
-g |
1 | Number of GPUs |
--quantization |
-q |
None | Quantization type |
--max_length |
-l |
1024 | Max generation length |
--api_key |
-k |
EMPTY | API key |
--api_base |
-b |
None | API base URL |
--chat_url |
-c |
None | Custom chat URL |
| Argument | Default | Description |
|---|---|---|
--model |
kimi-k2-0711-preview | Model name for evaluation |
--prefix |
k2 | API prefix (k2/doubao/gemini) |
--input-dir |
None | Input directory |
--output-dir |
None | Output directory |
If you use this benchmark in your research, please cite:
@inproceedings{cciv2025,
title={CCiV: A Benchmark for Chinese Ci-Generation and Evaluation},
author={Your Name},
booktitle={Arxiv},
year={2025}
}This project is licensed under the MIT License.
- Tonal pattern evaluation based on the 14-rhyme system (中华新韵, 2005)
- Cipai format data sourced from classical Chinese poetry databases