Japanese Wiki Search Engine

Authors:

Github: manhdo249
Email: ducmanhdo2492003@gmail.com

HUST Deep Learning and Application Project:

This project is a Japanese Wiki Search Engine designed to retrieve information from Japanese language Wikipedia articles (7000 documents) efficiently. It utilizes a combination of indexing, querying, and ranking algorithms to provide relevant results to user queries.

Architecture Image

I. Set up environment

Step 1: create a Conda environment named your_env_name with Python version 3.8

conda create -n ${your_env_name} python=3.8

Step 2: Activate the newly created environment using the following command

conda activate ${your_env_name}

Step 3: Install Packages from requirements.txt

pip install -r requirements.txt

II. Set up your dataset

This project utilized fujikillm-japanese-dataset_wikipedia

from datasets import load_dataset

dataset = load_dataset("fujiki/llm-japanese-dataset_wikipedia")

We use dataset['train']['output'] as a wiki docs.
We use Mecab for tokenizing text by breaking it down into individual words or tokens. Using the -Owakati option, MeCab outputs the text with words separated by spaces for easy further analysis.

III. Architecture

1. Raw Search (TF-IDF):

The initial query is processed using a coarse-grained search mechanism based on Term Frequency-Inverse Document Frequency (TF-IDF). To improve query speed, we have saved the TF-IDF scores of the words in the paragraphs as well as the penalty scores ds in the formula below.

The k texts with the highest scores will be selected.

2. Similarity Calculation (Cosine Similarity):

We use sentence-luke-japanese-base-lite model to embed the texts .
In this project we use FAISS (Facebook AI Similarity Search) to effectively search for similar database vectors with high speed and accuracy. We use the cosine method to calculate similarity scores.

IV. Demo

In this case, the query speed is 2.32s.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Assets		Assets
Japan_data		Japan_data
Test		Test
.gitignore		.gitignore
README.md		README.md
TF_IDF_Search.py		TF_IDF_Search.py
rawSearch.py		rawSearch.py
requirements.txt		requirements.txt
reranker.py		reranker.py
sentence_process.py		sentence_process.py
serve.py		serve.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Japanese Wiki Search Engine

Architecture Image

I. Set up environment

II. Set up your dataset

III. Architecture

1. Raw Search (TF-IDF):

2. Similarity Calculation (Cosine Similarity):

IV. Demo

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Japanese Wiki Search Engine

Architecture Image

I. Set up environment

II. Set up your dataset

III. Architecture

1. Raw Search (TF-IDF):

2. Similarity Calculation (Cosine Similarity):

IV. Demo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages