This is code for the NN experiment in the paraphrase paper. It is capable of embedding and indexing 400M sentences, and doing fast lookup among these to establish how often paraphrases are found in each others' nearest neighbor lists. This readme documents the order in which these steps should run and how the experiment is built.
The overall input to this pipeline is a list of unique sentences to embed and index. This is produced by these:
gather_uniq_sents.shproduces the fileall_data_pos_uniq.gzwith unique, clean(-ish) sentencesclean_data.pydoes elementary cleanup by dropping too short, too long, or too weird sentences
We need to be able to look up a sentence position (index) in
all_data_pos_uniq.gz by its text. This is done by calculating a 15-byte hash
of the sentence (there seem to be no collisions at this length on our 400M
sentence data)
index_texts.sbatchis the script to run indexingindex_sentences.pyimplements the indexing, and produces a single large pickle file which is a dictionary of sentence-text -> position index as integer
We will need to be able to print the sentences based on their position (index)
in all_data_pos_uniq.gz. This step implements a simple indexing for fast
random-access to the sentence file. No gzipping, so this consumes about as much
space as the unzipped all_data_pos_uniq.gz
zcat all_data_pos_uniq.gz | python3 mmap_index.py --index-lines-to all_data_pos_uniqthis produces several filesall_data_pos_uniq.data,all_data_pos_uniq.index,all_data_pos_uniq.lengths,all_data_pos_uniq.metawhich can be used to get sentence text based on its ID
The embeddings of each sentence in all_data_pos_uniq.gz needs to be calculated
in a distributed fashion. These are stored into iterable pickle files, one file
per parallel embedding process. The input is a single file, and when running an
individual embed_data.py job, it is told which is its rank and what is the
total number of jobs. So if there are 30 processes, rank 1 process takes lines
0, 30, 60; rank 2 process takes 1,31,61,... etc. For the 400M sentences, this
produces about 1.2TB of embeddings.
embed.pyproduces BERT embeddings,embed_sbert.pyproduces SBERT embeddingsembed_data.pyis joint for the above two and does dynamic batching etcembed_sbatch.shis a script (slurm also) which runs a single process ofembed_sbert.pygen_embed_sbatch.shgenerates a number ofsbatchcommands which submit the slurm jobs for the parallel runs
This step takes the embeddings from step 4 and pushes them into FAISS index.
create_faiss_index.pybuilds the index, it needs to happen in three steps: a) sample data on which the FAISS index' quantizer is trained b) create an index pre-trained with this data c) fill the index with all datacreate_faiss_sbatch.shcarries out these steps and ends up producingfaiss_index_filled_{bert|sbert}.faissfiles
print_nearest_from_faiss_index.pytakes on input any number of sentences, embeds them with BERT or SBERT and uses the faiss index from step 5 to get their nearest neighbors. These are then output as pickled tensors with the ids of the nearest neighbors, by default the code asks for 2048 NNs which is the max allowed on GPUquery_sents_sbatch.shthe slurm script for this
id2txt.pytakes a) the input file for (6); b) the id2text sentence index from step (3); c) the nearest neighbor output from step (6); and prints all of these as texts, not ids; this makes it possible to check the sanity of the output and post-process it any way you like
Establishing how often a paraphrase pair is found and on what rank. Given
paraphrase data (sent1,sent2,label), this code outputs how often sent2 is
found when querying with sent1. The assumptions here is that these were
included in the original data. But because the original data is produced with
sort and uniq, and there is plenty of it, we don't necessarily know at which
index sent1 and sent2 reside.
run_nn_experiment.pyis first run with these parameters: a) query file which is the texts of the query sentences used in step (6) ie when building the nearest neighbor lists; b) the text2index index from step (2); c) paraphrase data with the pairs in the Turku Paraphrase corpus format. When run like this, it outputs the paraphrase data with additional keys in the json that contain the indices intoall_data_pos_uniq.gzand the query filerun_nn_experiment.pyis then run with these parameters: a) the output of the previous step, basically a paraphrase json file with information about where to find what; b) the nearest neighbor output from step (6); c) the fast mmap id->text lookup from step (3); and it output the statistics and prints the actual sentences for the NN experiment