Skip to content

cubenlp/CCiV

Repository files navigation

CCiV: Chinese Ci-Generation & Evaluation Benchmark

中文文档

Dataset, prompts, evaluation code, and result artifacts for the CCiV paper (ICASSP).

CCiV is a comprehensive benchmark for evaluating Large Language Models (LLMs) on Chinese Ci (Song dynasty poetry) generation tasks. It provides multi-dimensional evaluation metrics including structural format accuracy, tonal pattern compliance, and LLM-as-judge semantic assessment.

Features

  • 300 High-Quality Ci Samples: Covering 36 classical Cipai (tune patterns) including浣溪沙, 鹧鸪天, 菩萨蛮, etc.
  • Multi-Dimensional Evaluation:
    • Structural Format Accuracy: Validates sentence count and character count per line
    • Tonal Pattern Score: Evaluates compliance with classical ping-ze (平仄) patterns
    • LLM-as-Judge: Assesses informativeness and aesthetic quality
  • Flexible Model Support: Local vLLM inference and API-based evaluation (OpenAI, Kimi, Doubao, Gemini)
  • Complete Evaluation Pipeline: From data preparation to result analysis

Project Structure

CCiV/
├── ci_gen.json              # 300 Ci generation samples
├── dev_fewshot.json         # Few-shot development examples
├── evaluate_cipai.py        # Main evaluation script
├── gen_csv.py               # Aggregate metrics to CSV
├── requirements.txt         # Python dependencies
├── README.md                # English documentation
├── README_zh.md             # Chinese documentation
├── src/                     # Utility modules
│   ├── cipai_utils.py       # Format evaluation utilities
│   ├── cipai2info.json      # Cipai format database (36 patterns)
│   └── xinyun/              # Tonal pattern conversion toolkit
│       ├── __init__.py
│       ├── converter.py     # Poem to tonal pattern converter
│       ├── pinyin_utils.py  # Pinyin processing utilities
│       └── exceptions.py
└── llm_eval/                # LLM-as-judge evaluation
    ├── main.py              # LLM evaluation script
    ├── k2/                  # Kimi API results
    ├── doubao/              # Doubao API results
    └── gemini/              # Gemini API results

Installation

# Clone the repository
git clone https://github.com/your-username/CCiV.git
cd CCiV

# Install dependencies
pip install -r requirements.txt

# For local vLLM inference (optional)
pip install vllm>=0.6.0

Quick Start

1. Local Evaluation (vLLM)

python evaluate_cipai.py \
    -s ./results_enhanced \
    -m /path/to/model \
    -v qwen2.5_7b_inst \
    -t ./ci_gen.json \
    -g 1

2. API Mode Evaluation

# Using OpenAI-compatible API
python evaluate_cipai.py \
    -s ./results_api \
    --mode api \
    -m gpt-4o \
    -v gpt-4o \
    --api_key $API_KEY \
    --api_base $API_BASE

# Using custom chat endpoint
python evaluate_cipai.py \
    -s ./results_api \
    --mode api \
    -m your-model \
    -v v1 \
    --api_key $API_KEY \
    --chat_url https://your-api.com/chat

3. Aggregate Metrics to CSV

python gen_csv.py

Edit gen_csv.py to customize the target folder and metrics:

# For form-aware results
folder = './results_enhanced'
main(folder, metric_key=['score', 'tonal_score', 'tonal_multiple_score'])

# For zero-shot results
folder = './results_direct'
main(folder, metric_key=['score'])

4. LLM-as-Judge Evaluation

cd llm_eval

# Using Kimi API
python main.py --model kimi-k2-0711-preview --prefix k2

# Using Gemini API
python main.py --model gemini-2.0-flash --prefix gemini

# Using Doubao API
python main.py --model your-model --prefix doubao

Set environment variables before running:

# Kimi API
export KIMI_API_KEY="your-key"
export KIMI_API_BASE="https://api.moonshot.cn"

# Doubao API
export ARK_API_KEY="your-key"
export ARK_API_BASE="https://ark.cn-beijing.volces.com/api/v3"

Data Format

Input Data (ci_gen.json)

Each sample contains:

{
  "instruction": "按照提供的词牌名和题目写一首词...",
  "input": "词牌: 浣溪沙\n题目: 春雨渡江有忆",
  "output": "参考词作内容",
  "cipai": "浣溪沙",
  "format_standard": "{...}",
  "sample_id": "fuxi_CiG_0"
}

Output Format

  • <save_dir>/<version>.json: Raw generation results
  • <save_dir>/<version>_evaluated.json: Results with evaluation metrics

Evaluation Metrics

Metric Description
score Structural format accuracy (boolean)
tonal_score Tonal pattern compliance score (0.0-1.0)
tonal_multiple_score Best match across multiple templates (0.0-1.0)
informativeness Information density score (1-5)
aesthetic Artistic quality score (1-5)

Supported Cipai (36 Patterns)

Cipai Cipai Cipai Cipai
浣溪沙 鹧鸪天 菩萨蛮 蝶恋花
临江仙 满江红 清平乐 水调歌头
虞美人 沁园春 念奴娇 满庭芳
西江月 金缕曲 点绛唇 减字木兰花
踏莎行 浪淘沙 水龙吟 望江南
如梦令 南乡子 贺新郎 卜算子
采桑子 摸鱼儿 忆江南 渔家傲
江城子 鹊桥仙 忆秦娥 青玉案
苏幕遮 一剪梅 声声慢 醉花阴

Command Line Arguments

evaluate_cipai.py

Argument Short Default Description
--mode vllm Mode: vllm or api
--save_dir -s ./results Output directory
--version -v v1 Version identifier
--model_name_or_path -m None Model path or name
--test_path -t ./ci_gen.json Test data path
--gpu_num -g 1 Number of GPUs
--quantization -q None Quantization type
--max_length -l 1024 Max generation length
--api_key -k EMPTY API key
--api_base -b None API base URL
--chat_url -c None Custom chat URL

llm_eval/main.py

Argument Default Description
--model kimi-k2-0711-preview Model name for evaluation
--prefix k2 API prefix (k2/doubao/gemini)
--input-dir None Input directory
--output-dir None Output directory

Citation

If you use this benchmark in your research, please cite:

@inproceedings{cciv2025,
  title={CCiV: A Benchmark for Chinese Ci-Generation and Evaluation},
  author={Your Name},
  booktitle={Arxiv},
  year={2025}
}

License

This project is licensed under the MIT License.

Acknowledgments

  • Tonal pattern evaluation based on the 14-rhyme system (中华新韵, 2005)
  • Cipai format data sourced from classical Chinese poetry databases

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages