Authors:
- Github: manhdo249
- Email: ducmanhdo2492003@gmail.com
HUST Deep Learning and Application Project:
This project is a Japanese Wiki Search Engine designed to retrieve information from Japanese language Wikipedia articles (7000 documents) efficiently. It utilizes a combination of indexing, querying, and ranking algorithms to provide relevant results to user queries.
- Step 1: create a Conda environment named your_env_name with Python version 3.8
conda create -n ${your_env_name} python=3.8- Step 2: Activate the newly created environment using the following command
conda activate ${your_env_name}
- Step 3: Install Packages from requirements.txt
pip install -r requirements.txt
This project utilized fujikillm-japanese-dataset_wikipedia
from datasets import load_dataset
dataset = load_dataset("fujiki/llm-japanese-dataset_wikipedia")
- We use
dataset['train']['output']as a wiki docs. - We use Mecab for tokenizing text by breaking it down into individual words or tokens. Using the -Owakati option, MeCab outputs the text with words separated by spaces for easy further analysis.
The initial query is processed using a coarse-grained search mechanism based on Term Frequency-Inverse Document Frequency (TF-IDF). To improve query speed, we have saved the TF-IDF scores of the words in the paragraphs as well as the penalty scores ds in the formula below.
The k texts with the highest scores will be selected.
- We use sentence-luke-japanese-base-lite model to embed the texts .
- In this project we use FAISS (Facebook AI Similarity Search) to effectively search for similar database vectors with high speed and accuracy. We use the cosine method to calculate similarity scores.
In this case, the query speed is 2.32s.


