This repository serves as an all-packaged KGQA system, hosts of the original paper Large Language Models Meet Knowledge Graphs to Answer Factoid Questions and its extended version Re-ranking Answers by a Large Language Model with Knowledge Graphs. The extended paper is currently in review for a journal. Once the process is finished, we will update the link here.
Our KGQA pipeline is a novel framework which enhances Large Language Models' performance on complex question answering task by ranking its answer candidates while leveraging Knowledge Graphs post generation. KGQA is based on the observation that, while the correct answer may not always be the most likely according to language model predictions, it often appears later within the sequence of generated predictions. The entire pipeline of the proposed KGQA pipeline is shown in the following figure.
Our KGQA pipeline includes generating answer candidates, entity linking for question entities, subgraphs generation, feature extractors for subgraphs, and various ranking models. The pipeline leverages Wikidata as the Knowledge Graph and extracts subgraphs by calculating shortest paths between entities. Experiments were conducted on Mintaka and MKQA - complex factoid question answering datasets.
In the original paper, we generated answer candidates using T5-large-ssm and T5-xl-ssm. To finetune your own T5-like models and generate a new batch of answer candidates:
python3 kbqa/seq2seq.py --mode train_eval --model_name t5-large
python3 kbqa/seq2seq.py --mode train_eval --model_name t5-3bThe generated candidates for the T5-like models will be .csv format.
In the extended paper, we added Mistral and Mixtral as answer candidates generating LLMs. To finetune your own Mistral/ Mixtral models and generate a new batch of answer candidates:
python3 kbqa/mistral_mixtral.py --mode train_eval --model_name mistralai/Mixtral-8x7B-Instruct-v0.1
python3 kbqa/mistral_mixtral.py --mode train_eval --model_name mistralai/Mistral-7B-Instruct-v0.2The generated candidates for the T5-like models and Mixtral/Mistral will be .csv and .json format, respectively.
In both seq2seq.py and mistral_mixtral.py, you can other useful arguments, which includes tracking, training parameters, finetuning parameters, path for checkpoints etc. These arguments are detailed within the files themselves.
Supported Datasets: The seq2seq.py script supports multiple datasets including:
AmazonScience/mintaka(default): The Mintaka datasetmkqa-hf: MKQA dataset in Mintaka format fromDms12/mkqa_mintaka_format_with_question_entitiesmkqa: Local MKQA dataset files (mkqa_train.jsonandmkqa_test.json)s-nlp/lc_quad2: LC-QuAD 2.0 dataset
To use a specific dataset, set the --dataset_name argument accordingly. For example:
python3 seq2seq.py --mode train_eval --dataset_name mkqa-hf --model_name t5-largeLastly, if you prefer to use our prepared finetuned models and generated candidates, we have uplodaed them to HuggingFace.
In both of our papers, we decided to use the golden question entities provided by the Mintaka dataset. The scope of our research were solely on the novelty of the subgraphs and the efficacy of different ranking methods.
The subgraph extraction process extracts subgraphs related to entity candidates from question-and-answer sets by calculating shortest paths between entities in the Wikidata Knowledge Graph. You can either 1) use our prepared dataset at HuggingFace for Mintaka or 2) extract your own dataset.
- Parsing the Wikidata dump to build our Wikidata graph via iGraph.
- Loading our iGraph representation of Wikidata and generating the subgraph dataset.
All subgraphs extraction codes can be found in kbqa/subgraphs_dataset_creation/.
Wikidata frequently releases and updates their dumps, which can be found in various formats: JSON, RDF, XML, etc. Firstly, to utilise our code, download the dumps in JSON format. Then, to parse the wikidata json dump, run:
python3 kbqa/subgraphs_dataset_creation/dump_repacker.py --data_path /path/to/downloaded_dump --save_path /path/to/parse_dumpwhere the arguments:
data_pathrefers to the path where the json dump is stored.save_pathrefers to the path where we want to save the igraph triple triples representation.
After running the above script, a wikidata_triples.txt file will be created within the saved path mentioned in the argument above. This triples text file is ready to be loaded via Igraph via:
# graph_path is where we stored wikidata_triples.txt
igraph_wikidata = Graph.Read_Ncol(
graph_path, names=True, directed=True, weights=True
)Since parsing this Wikidata dump takes a long time, checkpoints were implemented. If for some unfortunate reason, our process crashed, you can simply rerun kbqa/subgraphs_dataset_creation/dump_repacker.py. The code will automatically continue parsing on where the crash happened.
After we have parsed the Wikidata dump and have our Igraph triples representation, we are ready for subgraphs dataset generation. Firstly, we need to pre-proccess Mintaka (fetch label for each Wikidata question entities, prepare the answer candidates by our LLM; all in 1 accessible jsonl file). To do so, please run the jupyter notebook kbqa/experiments/subgraphs_datasets_prepare_input_data/mintaka_subgraphs_preparing.ipynb. The input of this notebook are the answer candidates generated by our LLMs (.csv and .json formatting for T5-like and Mixtral/Mistral respectively).
Finally, to fetch the desired subgraphs, run:
python3 kbqa/subgraphs_dataset_creation/mining_subgraphs_dataset_processes.pywhich have the following available arguments:
--save_jsonl_pathindicates the path of the final resultingjsonlfile (with our subgraphs).--igraph_wikidata_pathindicates the path of the file with our Igraph triples representation.--subgraphs_dataset_prepared_entities_jsonl_pathindicates the path of the preproccessed Mintaka dataset, output of/experiments/subgraphs_datasets_prepare_input_data/mintaka_subgraphs_preparing.ipynb.--n_jobsindicates how many jobs for our multi-processing scheme. ATTENTION: Each process require ~60-80Gb RAM.--skip_linesindicates the number of lines for skip in prepared_entities_jsonl file (from--subgraphs_dataset_prepared_entities_jsonl_path).
After running the above file, the final data will be a jsonl file in the path --save_jsonl_path
Each entry in the final .jsonl file will represent one question-answer pair and its corresponding subgraph. One sample entry can be seen below:
{"id":"fae46b21","question":"What man was a famous American author and also a steamboat pilot on the Mississippi River?","answerEntity":["Q893594"],"questionEntity":["Q1497","Q846570"],"groundTruthAnswerEntity":["Q7245"],"complexityType":"intersection","graph":{"directed":true,"multigraph":false,"graph":{},"nodes":[{"type":"INTERNAL","name_":"Q30","id":0},{"type":"QUESTIONS_ENTITY","name_":"Q1497","id":1},{"type":"QUESTIONS_ENTITY","name_":"Q846570","id":2},{"type":"ANSWER_CANDIDATE_ENTITY","name_":"Q893594","id":3}],"links":[{"name_":"P17","source":0,"target":0},{"name_":"P17","source":1,"target":0},{"name_":"P17","source":2,"target":0},{"name_":"P527","source":2,"target":3},{"name_":"P17","source":3,"target":0},{"name_":"P279","source":3,"target":2}]}}One could simply turn these JSONL files into pandas.DataFrame format. We recommend to upload this subgraph dataset on HuggingFace first. Then, in the next section, we could leverage this subgraph dataset to extract the features and finalise the final KGQA dataset.
In the original paper, we introduced 2 main methods of ranking; which utilised the raw subgraphs itself and a simple linerisation algorithm - G2T Deterministic. In the extended paper, we extracted more features from the subgraphs, which include graph, text, and more G2T features.
Firstly, we need to prepare the neural-based complex G2T sequences (G2T T5 & G2T GAP). Unlike the neural-based G2T sequences, G2T Deterministic, introduced in the original paper, is not complex (unravelling of the subgraph in its adjacency matrix representation). Therefore, G2T Determnistic sequences can be prepared along with simple graph features. With that in mind, G2T T5 & GAP generation codes are stored inside /kbqa/experiments/graph2text
To generate the G2T T5 and GAP sequences, we must first train T5 on WebNLG dataset, as mentioned on the extended paper. If you'd like you use our finetuned T5 WebNLG, we have uploaded these models to HuggingFace. Otherwise, to train your own T5 on WebNLG, firstly download the WebNLG dataset (link from the original GAP repo) and store it inside kbqa/experiments/graph2text/data/webnlg. For the config file for finetuning WebNLG T5 kbqa/experiments/graph2text/configs/finetune_t5_webnlg.yaml, please pay attention to the path for the train, dev, and test for the WebNLG dataset. With that in mind, please run:
python3 kbqa/experiments/graph2text/main_seq2seq.py -cd configs --config-name=finetune_t5_webnlgWith the finetuned T5 and GAP on WebNLG, we can now generate G2T T5 and G2T GAP on the Mintaka dataset. Firstly, we need to prepare the Mintaka dataset into WebNLG format. To do so, run the notebook kbqa/experiments/graph2text/preprocess_mintaka_webnlg.ipynb. We need to prepare this format for T5-large-ssm, T5-xl-ssm, Mixtral, and Mistral (to do this: change ds variable to the desired dataset, and use our prepared subgraph dataset/set your own prepared subgraph dataset path). After running kbqa/experiments/graph2text/preprocess_mintaka_webnlg.ipynb, the datasets for T5-like and Mixtral/Mistral in WebNLG format should be stored in kbqa/experiments/graph2text/data.
Firstly, to generate G2T T5 sequences, we need to edit the config files in kbqa/experiments/graph2text/configs. Place the path of the finetuned WebNLG (from Converting Mintaka Subgraph Dataset to WebNLG Format) T5 in the kbqa/experiments/graph2text/configs/model/t5_xl_checkpoint.yaml file. Then, with the Mintaka subgraph dataset in WebNLG format (in kbqa/experiments/graph2text/data), configure the appropriate paths for train, val, and test for the desired LLM in kbqa/experiments/graph2text/configs/dataset. Lastly, run the following command for the desired [DATASET]:
python3 kbqa/experiments/graph2text/main_seq2seq.py -cd configs --config-name=graph2text_seq2seq_[DATASET]After finishing these procedures, the G2T T5 dataset for the desired answer candidates LLM will be in .yaml format.
To generate G2T GAP, we use the same Mintaka subgraph dataset in WebNLG format. Clone the GAP Repo and change the desired dataset path in kbqa/experiments/graph2text/start_g2t_gap.sh. Lastly, run:
kbqa/experiments/graph2text/start_g2t_gap.sh After finishing these procedures, the G2T GAP dataset for the desired answer candidates LLM will be in .txt format.
Now that we have the complex neural-based G2T features, we can prepare the rest of our subgraph's features (which includes text, graph, and G2T Deterministic). Assuming we have stored the subgraphs dataset on HuggingFace, as recommended in the "Buidling the Subgraph" section, we can extract the remaining subgraph features by running:
python3 kbqa/experiments/subgraphs_reranking/graph_features_preparation.py --subgraphs_dataset_path HF_subgraphs_path --g2t_t5_train_path path_to_generated_g2t_t5_train --g2t_t5_val_path path_to_generated_g2t_t5_val --g2t_t5_test_path path_to_generated_g2t_t5_test --g2t_gap_train_path path_to_generated_g2t_gap_train --g2t_gap_val_path path_to_generated_g2t_gap_val --g2t_gap_test_path path_to_generated_g2t_gap_test --upload_dataset True --hf_path HF_finalized_dataset_pathThe output file will be a .csv file of the same format as the published finalised HuggingFace dataset. Please pay attention that one would need to repeat the "Building the Subgraphs" and "Subgraphs Feature Extraction" sections for train, val, test for T5-large-ssm, T5-xl-ssm, Mistral, and Mixtral. The finalised HuggingFace dataset already combined all data splits and LLMs into one total-packaged dataset.
Using the finalised dataset, we devised the following reranking methods to select the most probable answers from candidate lists:
- Regression-based: Logistic and Linear Regression models using graph features and MPNet embeddings of text and G2T features.
- Gradient Boosting (CatBoost): Gradient boosting models with graph features and MPNet embeddings of text and G2T features.
- Sequence Ranker (MPNet): Semantic similarity-based ranking using MPNet embeddings of G2T features.
- RankGPT: Zero-shot LLM-based reranking using instructional permutation generation (supports both Mintaka and MKQA-hf datasets).
These methods utilize various features extracted from the mined subgraphs, including graph structural features, text embeddings, and graph-to-text (G2T) sequence embeddings.
After training/fitting, all tuned rankers will generate the list of re-ranked answer candidates with the same skeleton, outlined in /kbqa/experiments/subgraphs_reranking/ranking_model.py (beside Graphormer). This list of re-ranked answer candidates (in jsonl format) is then evaluated with Hits@N metrics with kbqa/mintaka_evaluate.py.
Graphormer: As Graphormer was introduced in the original paper, it is the only ranker that was not updated to work with kbqa/experiments/subgraphs_reranking/ranking_model.py and kbqa/mintaka_evaluate.py. We are still working to refractor the code to the unified ranking pipeline, introduced in the extended paper. With that in mind, you can train the Graphormer model with:
python3 kbqa/experiments/subgraphs_reranking/graphormer/train_graphormer.py run_name graphormer_runTo evaluate, please load the trained model (inside of the desired --output_path) and run the notebook /kbqa/experiments/subgraphs_reranking/graphormer/graphormer.ipynb
Regression-based: As we don't "train" regression-based model but simply fit them, training & generating Logistic and Linear Regression are both in ranking.ipynb. The fitted model will produce a jsonl file, which include the list of reranked answer candidates.
Gradient Boosting: to train Catboost:
python3 kbqa/experiments/subgraphs_reranking/graph_features/train_catboost_regressor.pyFor the Catboost code, there are several available arguments, which can be seen in details within experiments/subgraphs_reranking/graph_features/train_catboost_regressor.py. The most important arguments are:
ds_type: which answer candidate LLM to use (T5-large-ssm, T5-xl-ssm, Mistral, or Mixtral)use_text_features: whether to use text features while training Catboostuse_graph_features: whether to use graph features while training Catboostsequence_type: whether to use G2T features while training Catboost. If we don't want any G2T features, choosenone; otherwise state the desired G2T sequence (g2t_determ,g2t_t5, org2t_gap)
After training Catboost on the desired answer candidate LLM subgraph dataset, please load the path of the tuned model in ranking.ipynb to evaluate. Please pay attention to the parameters of CatboostRanker() in ranking.ipynb (the different feature sets used must be pass in accordingly). It is important to note that the tuned model will generate and rank answer candidates to produce a ranking .jsonl file.
Sequence Ranker: to train the sequence ranker:
python3 kbqa/experiments/subgraphs_reranking/sequence/train_sequence_ranker.pyFor the sequence ranker code, there are several available arguments, which can be seen in details within experiments/subgraphs_reranking/graph_features/train_catboost_regressor.py. The most important arguments are:
sequence_type: which sequence type used to rank (g2t_determ, g2t_gap, g2t_t5, or just question_answer concatenation)do_highlighting: to whether use the highlighted sequence (wrapping special tokens around the answer candidate)model_name: model used to classify the sequence. The default model is MPNet.
After training sequence ranker on the desired answer candidate LLM subgraph dataset and sequence, please load the path of the tuned model in ranking.ipynb to evaluate. Please pay attention to the parameters of MPNetRanker() in ranking.ipynb (the different sequence used must be pass in accordingly). It is important to note that the tuned model will generate and rank answer candidates to produce a ranking .jsonl file.
RankGPT: RankGPT is a zero-shot LLM-based reranking approach that uses instructional permutation generation. It requires no training and works with OpenAI-compatible APIs (including vLLM). To use RankGPT:
cd experiments/subgraphs_reranking/rankgpt
python3 predict.py \
--model_name meta-llama/Llama-2-7b-chat-hf \
--dataset mintaka \
--ds_type t5xlssm \
--output_path /path/to/output.jsonl \
--window_size 20 \
--step_size 10For MKQA-hf dataset:
python3 predict.py \
--model_name meta-llama/Llama-2-7b-chat-hf \
--dataset mkqa-hf \
--ds_type t5xlssm \
--output_path /path/to/output.jsonlKey arguments:
--model_name: LLM model name for ranking (e.g.,meta-llama/Llama-2-7b-chat-hf,mistralai/Mistral-7B-Instruct-v0.2)--dataset: Dataset to use (mintakaormkqa-hf)--ds_type: Answer candidate LLM type (t5largessm,t5xlssm,mistral,mixtral)--window_size: Window size for sliding window ranking (default: 20)--step_size: Step size for sliding window (default: 10)--graph_sequence_feature: Optional graph sequence feature (highlighted_determ_sequenceorno_highlighted_determ_sequence)
The ranker requires API configuration via environment variables:
export OPENAI_BASE_URL="http://localhost:8000/v1" # Your vLLM or OpenAI endpoint
export OPENAI_API_KEY="your-api-key" # Can be "EMPTY" for local vLLMRankGPT automatically handles answer deduplication and uses sliding window strategy for large answer sets. The output format is compatible with mintaka_evaluate.py.
After producing the new list of re-ranked answer candidates, you can evaluate this .jsonl file by running:
python3 kbqa/mintaka_evaluate.py --predictions_path path_to_jsonl_prediction_fileRunning the above code will produce the final evaluation of our ranker. The evaluation includes Hits@1-5 for the entire Mintaka dataset and each of the question type (intersection, comparative, generic, etc.).
The hardware requirements vary significantly depending on which components of the pipeline you plan to use:
- CPU: Multi-core processor (4+ cores recommended)
- RAM: 32GB minimum, 120GB recommended
- GPU: Optional, but recommended for training and inference
- For T5-base/large: 12GB VRAM
- For T5-3B/XL: 24GB VRAM
- For Mistral/Mixtral: 80GB VRAM
- Storage: 100GB+ free space for datasets and models
- CPU: High-performance multi-core processor (32+ cores recommended for parallel processing)
- RAM: 60-80GB per parallel process (critical requirement)
- The
--n_jobsparameter controls parallelism - Example: With
--n_jobs=4, you need 240-320GB total RAM - Consider using fewer jobs if RAM is limited
- The
- Storage: 2000GB+ free space for Wikidata dumps and parsed graph representations
- Time: Parsing Wikidata dump can take several days depending on hardware
Recommendation: Use our pre-computed datasets from HuggingFace instead of extracting subgraphs yourself unless you have access to high-memory compute infrastructure.
We have prepared a Docker environment for all experiments outlined above. Please run:
docker build -f ./Dockerfile -t kbqa_dev ./
docker run -v $PWD:/workspace/kbqa/ --network host -ti kbqa_devAs this repository was researched and implemented with Wikidata as the Knowledge Graph, we have implemented several modules to work with the Wikidata Query API. All of the implemented modules can be found at kbqa/kbqa.
Firstly, Wikidata SPARQL endpoint and engine can be configured in kbqa/kbqa/config.py. By default, please use query.wikidata.org
SPARQL_ENDPOINT = "https://query.wikidata.org/sparql"
SPARQL_ENGINE = "blazegraph"As an alternative, to use a graphDB instance of Wikidata, use the following config. Please do not forget to forward 7200 port to your machine.
SPARQL_ENDPOINT = "http://localhost:7200/repositories/wikidata"
SPARQL_ENGINE = "graphdb"All of the different Wikidata tools can be found in kbqa/kbqa/wikidata. The implemented tools include the following (with caching implemented):
- entity to label: converting Wikidata entity ID to label(s)
- label to entity: converting label to Wikidata entity ID(s)
- shortest path: return the shortest path(s) between
entity1andentity2 - property: return the property of an entity
- redirects: return all the redirects of an entity
- relationship: return the relationship of an entity
- subgraphs retriever: same idea as "Subgraphs Extraction", but by requesting Wikidata Query service
- entity similarity: returns the similarity score between
entity1andentity2
- entity similarity: returns the similarity score between
One could use any of the above tools by:
from kbqa.wikidata import (
WikidataEntityToLabel,
WikidataShortestPathCache,
WikidataLabelToEntity,
WikidataRedirectsCache,
wikidata_entity_similarity,
...
)Examples of the above Wikidata tools can be found in kbqa/wikidata_tools_example.ipynb
If you find some issues, do not hesitate to add it to Github Issues.
For any questions please contact: Mikhail Salnikov, Hai Le, or Alexander Panchenko
@inproceedings{salnikov-etal-2023-large,
title = "Large Language Models Meet Knowledge Graphs to Answer Factoid Questions",
author = "Salnikov, Mikhail and
Le, Hai and
Rajput, Prateek and
Nikishina, Irina and
Braslavski, Pavel and
Malykh, Valentin and
Panchenko, Alexander",
editor = "Huang, Chu-Ren and
Harada, Yasunari and
Kim, Jong-Bok and
Chen, Si and
Hsu, Yu-Yin and
Chersoni, Emmanuele and
A, Pranav and
Zeng, Winnie Huiheng and
Peng, Bo and
Li, Yuxi and
Li, Junlin",
booktitle = "Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation",
month = dec,
year = "2023",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.paclic-1.63",
pages = "635--644",
}