LLM-jp LongBench.

これは、日本語版ロングコンテキストデータセットに対して、大規模言語モデルを実行・評価するためのものである。

環境構築

本ベンチマークでは，パッケージマネージャとしてuvを使用している．環境構築は以下のコードで行うことができる．

uv sync

JEMHopQA

概要

JEMHopQAは回答導出ステップの情報付きの日本語の根拠情報付きマルチホップQAデータセット（Ishii et al., 2024）。これに、Wikipediaの該当する記事をコンテキストとして与えることで作成されたデータセット。

ディレクトリ構成

jemhop/src/
- filter_by_length.py
  入力CSV（context, question列）を指定モデルのトークン長でフィルタリングする。
  最大長を超えるサンプルを除外し、残りを新しいCSVに保存する。
- inference.py
  vLLMを用いてフィルタ済みCSVに対し推論を実行する。
  DataFrame（question_id, question, gold, answer）を返す。
- eval.py
  推論結果と正解ラベルを完全一致で比較し、正解数・全体数・正解率を算出する。
- eval_inference.py
  上記3ステップ（フィルタ→推論→評価）を一括で実行するパイプラインスクリプト。
  コマンドライン引数で各種パラメータを指定できる。
- load_dataset.py データセットをjemhop/data/にダウンロードするスクリプト。

実行コード例

uv run python ./jemhop/src/load_dataset.py

uv run python ./jemhop/src/eval_inference.py \
--input_csv ./jemhop/data/jemhop_with_context.csv \
--model_name meta-llama/Llama-3.2-3B-Instruct \
--max_model_len 65536 \
--tensor_parallel_size 8 \
--gpu_memory_utilization 0.87 \
--swap_space 16 \
--max_num_seqs 4 \
--max_new_tokens 512

Command-line arguments (JEMHop)

Argument	Type	Default	Description
`--input_csv`	str	Required	Input CSV file (must contain columns: context, question, answer, qid)
`--output_prefix`	str	None	Output CSV prefix. Results will be saved as `{prefix}_answers_{model_name}_{max_model_len}.csv`
`--built_csv`	str	None	Path to save built CSV. If omitted, auto-named as `{input_stem}_built_{model_short}_{max_model_len}.csv`

Build options:

Argument	Type	Default	Description
`--build_max_model_len`	int	65536	Maximum model length for building contexts (default: 65536)
`--build_min_model_len`	int	0	Minimum model length for building contexts (default: 0)

Inference options:

Argument	Type	Default	Description
`--model_name`	str	meta-llama/Llama-3.2-3B-Instruct	Model name (e.g., meta-llama/Llama-3.2-3B-Instruct)
`--tensor_parallel_size`	int	8	Tensor parallel size
`--max_model_len`	int	65536	Maximum model input length
`--gpu_memory_utilization`	float	0.87	GPU memory utilization
`--swap_space`	int	16	Swap space (GB)
`--max_num_seqs`	int	1	Maximum number of concurrent sequences
`--max_new_tokens`	int	512	Maximum number of tokens to generate
`--temperature`	float	None	Sampling temperature
`--top_p`	float	None	Top-p value for nucleus sampling
`--seed`	int	None	Random seed

Evaluation options:

Argument	Type	Default	Description
`--pred_col`	str	prediction	Column name for model output
`--gold_col`	str	gold	Column name for ground truth label
`--add_correct_col`	flag	False	Add correctness column to CSV

JEMHopデータセット（max_seq_len = 65536）における各Llamaモデルの正答率比較

モデル名	パラメータ規模	正答数 / 総数	正答率
meta-llama/Llama-3.2-3B-Instruct	3B	87 / 978	8.90%
meta-llama/Meta-Llama-3.1-8B-Instruct	8B	242 / 978	24.74%
meta-llama/Llama-3.3-70B-Instruct	70B	458 / 978	46.83%
meta-llama/Llama-4-Scout-17B-16E-Instruct	17B (Activated) 109B (Total)	587 / 996	58.94%

NIILC

概要

NIILCは質問応答システムの研究のために作成されたデータセット（Sekine, 2003）。データセットのうち、回答が唯一定まり、それが時間によって正解が変化しないもので、回答の根拠となるWikipedia記事が付加されていものだけを使用。

ディレクトリ例

jemhop/src/
- build.py
  テスト・開発セットを縦結合し、各質問ごとに「正例コンテキスト＋他文脈」をトークン長制約内(完全に抑えることは難しいので、適切にバッファを取る必要あり)で詰め込むコンテキストを構築する。
  正例コンテキストの配置位置（先頭・末尾・ランダム）も選択可能。
- inference.py
  構築済みNIILC形式CSV（question_id, question, context, answer列）に対して、
  vLLMを用いて推論を実行する。
  DataFrame（question_id, question, gold, answer）を返す。
- eval.py
  推論結果と正解ラベルを完全一致で比較し、正解数・全体数・正解率を算出します。
- eval_inference.py
  上記3ステップ（コンテキスト構築→推論→評価）を一括で実行するパイプラインスクリプトです。
  コマンドライン引数で各種パラメータを指定できます。
- load_dataset.py データセットをniilc/data/にダウンロードするスクリプト。

実行コード例

uv run python ./niilc/src/load_dataset.py

uv run python ./niilc/src/eval_inference.py \
--test_csv ./niilc/data/niilc_test_with_context.csv \
--dev_csv  ./niilc/data/niilc_dev_with_context.csv \
--build_model_name meta-llama/Llama-3.2-3B-Instruct \
--build_max_model_len $((1024*64 - 512)) \
--model_name meta-llama/Llama-3.2-3B-Instruct \
--max_model_len $((1024*64)) \
--tensor_parallel_size 8 \
--gpu_memory_utilization 0.86 \
--swap_space 16 \
--max_num_seqs 4 \
--max_new_tokens 512

Command-line arguments (NIILC)

Argument	Type	Default	Description
`--test_csv`	str	Required	NIILC test CSV (must contain question_id, question, context, answers)
`--dev_csv`	str	Required	NIILC dev CSV (must contain question_id, question, context, answers)
`--built_csv`	str	None	Path to save built CSV. If omitted, auto-named beside test_csv.
`--output_prefix`	str	None	Prefix for final answers CSV. If omitted, auto-named from test_csv stem.

Build options:

Argument	Type	Default	Description
`--build_model_name`	str	meta-llama/Llama-3.2-3B-Instruct	Model name for building contexts
`--build_seed`	int	42	Random seed for building contexts
`--build_max_model_len`	int	65536-512	Maximum model length for building contexts
`--build_positive_position`	str	random	Position of positive context in built context (`random`, `first`, `last`)

Inference options:

Argument	Type	Default	Description
`--model_name`	str	meta-llama/Llama-3.2-3B-Instruct	Model name for inference
`--tensor_parallel_size`	int	8	Tensor parallel size for inference
`--max_model_len`	int	65536	Maximum model length for inference
`--gpu_memory_utilization`	float	0.86	GPU memory utilization for inference
`--swap_space`	int	16	Swap space in GB for inference
`--max_num_seqs`	int	1	Maximum number of sequences for inference
`--max_new_tokens`	int	256	Maximum number of new tokens to generate
`--temperature`	float	None	Sampling temperature for inference
`--top_p`	float	None	Top-p sampling for inference
`--seed`	int	None	Random seed for inference

Evaluation options:

Argument	Type	Default	Description
`--pred_col`	str	prediction	Column name for predicted answers
`--gold_col`	str	gold	Column name for gold answers
`--add_correct_col`	flag	False	Add a column indicating correctness

NIILCデータセット（max_seq_len = 65536）における各Llamaモデルの正答率比較

モデル名	パラメータ規模	正答数 / 総数	正答率
meta-llama/Llama-3.2-3B-Instruct	3B	87 / 538	16.17%
meta-llama/Meta-Llama-3.1-8B-Instruct	8B	188 / 538	34.94%
meta-llama/Llama-3.3-70B-Instruct	70B	206 / 538	38.29%
meta-llama/Llama-4-Scout-17B-16E-Instruct	17B (Activated) 109B (Total)	288 / 540	53.33%

ライセンス

MIT

参考文献

Ai Ishii, Naoya Inoue, Hisami Suzuki, and Satoshi Sekine. 2024. JEMHopQA: Dataset for Japanese Explainable Multi-Hop Question Answering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9515–9525, Torino, Italia. ELRA and ICCL.
Satoshi Sekine. 2003. Development of a question answering system focused on an encyclopedia. 9th Annual Meeting of the Association for Natural Language Processing. (in Japanese)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
jemhop/src		jemhop/src
niilc/src		niilc/src
README.md		README.md
gitignore		gitignore
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-jp LongBench.

環境構築

JEMHopQA

概要

ディレクトリ構成

実行コード例

Command-line arguments (JEMHop)

JEMHopデータセット（max_seq_len = 65536）における各Llamaモデルの正答率比較

NIILC

概要

ディレクトリ例

実行コード例

Command-line arguments (NIILC)

NIILCデータセット（max_seq_len = 65536）における各Llamaモデルの正答率比較

ライセンス

参考文献

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-jp LongBench.

環境構築

JEMHopQA

概要

ディレクトリ構成

実行コード例

Command-line arguments (JEMHop)

JEMHopデータセット（max_seq_len = 65536）における各Llamaモデルの正答率比較

NIILC

概要

ディレクトリ例

実行コード例

Command-line arguments (NIILC)

NIILCデータセット（max_seq_len = 65536）における各Llamaモデルの正答率比較

ライセンス

参考文献

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages