This project was built for the Information Retrieval 2020/2021 Course.
It consists in a Python implementation of a Boolean Information Retrieval System that can answer AND, OR, NOT queries as well as phrase queries using a positional index and wildcard queries using permuterm index.
- nltk
- functools
- re
- csv
- time
- pickle
- sys
- gc
- Start the program
- Press "i" to build the index the first time
- The index will be a Trie data structure
- The index will be automatically saved in a file in data/ folder
- Normalization removing punctuation and putting to lower case
- Tokenization splitting the words by space
- Stop words removal
- Porter Stemmer
- Start the program after the index is built
- Press "q" to perform a query
- Press "i" to automatically load the index
- Once the index is loaded press "q" and write the query
- Write the query in the form term operator term
- The operators can be and, or, and not
- If you want all documents where only a term is not present the query will be not term
- You can have more than a single operator in the query, like term operator term operator term ..
- Simply put two or more terms in order separated by a space, like term term term
- Put an '#' in the part of the term you don't know, like te#m
- You can also perform multiple wildcards putting more '#', like #e#m
- You can perform all kind of query together like term1 and term2 term3 and not term4 where single terms can be wilcards
- Priority will always be given to phrase queries
- «christmas and New York»
- «McCallister and christmas»
- «space and not Nasa»
- «car or motor# and gran# prix»