Skip to content

enricodoretto/information-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information Retrieval Course Project

This project was built for the Information Retrieval 2020/2021 Course.

It consists in a Python implementation of a Boolean Information Retrieval System that can answer AND, OR, NOT queries as well as phrase queries using a positional index and wildcard queries using permuterm index.

Python dependencies

  • nltk
  • functools
  • re
  • csv
  • time
  • pickle
  • sys
  • gc

Initialization

  • Start the program
  • Press "i" to build the index the first time
  • The index will be a Trie data structure
  • The index will be automatically saved in a file in data/ folder

Operation performed

  • Normalization removing punctuation and putting to lower case
  • Tokenization splitting the words by space
  • Stop words removal
  • Porter Stemmer

Perform queries

  • Start the program after the index is built
  • Press "q" to perform a query
  • Press "i" to automatically load the index
  • Once the index is loaded press "q" and write the query

And, Or, Not queries

  • Write the query in the form term operator term
  • The operators can be and, or, and not
  • If you want all documents where only a term is not present the query will be not term
  • You can have more than a single operator in the query, like term operator term operator term ..

Phrase queries

  • Simply put two or more terms in order separated by a space, like term term term

Wildcard queries

  • Put an '#' in the part of the term you don't know, like te#m
  • You can also perform multiple wildcards putting more '#', like #e#m

All queries combined

  • You can perform all kind of query together like term1 and term2 term3 and not term4 where single terms can be wilcards
  • Priority will always be given to phrase queries

Evaluation

Test queries

  • «christmas and New York»
  • «McCallister and christmas»
  • «space and not Nasa»
  • «car or motor# and gran# prix»

About

Python implementation of a Boolean Information Retrieval System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages