Skip to content

Using the Dictionary Generating Script

maxdudek edited this page Jul 18, 2018 · 6 revisions

Table of Contents

Using the Dictionary Generating Script

Getting Started

The script which allowed the fast creation of the compound dictionary is dictionary_generator.py. This script may have general purpose applications beyond building a dictionary of chemical compounds, and so this page explains how to configure the application.

How the Command Line Application Works

The dictionary generating function is simply a script that iterates through a list of elements, and prompts the user to classify an element if it is unrecognized. The user can choose to add the element as a key in a dictionary, where it is assigned a value. The user can also choose to add the element to a list of ignored elements.

There are three important data structures used by the dictionary generator, which are stored in json input files:

  • The compound dictionary is the dictionary which is being built up by the dictionary generator. It is used as an input as well, because it will not prompt the user to add elements which are already found in the dictionary. By default, the dictionary is location in Input\compound_dictionary.json.
  • The unknownList is a list of elements ignored by the application. They may be unrecognized compounds, or the results of errors in the detail parser. The user can choose to add elements to this list. By default, the dictionary is location in Input\unknown_list.json.
  • The stopWords list is also a list of words ignored by the application. The difference between this list and the unknownList is that stop words are used by the structure.parseDetails() function to avoid treating plain English words as names of chemical compounds. If the dictionary generator is not being used to improve the database, then it will not be necessary to add elements to this list. By default, the dictionary is location in Input\stop_words.json.
For more information on how these files are used in the database, see Input Files

Specification of Input Files

The location of the input files is specified at the top of the script:

COMPOUND_DICTIONARY_FILE = "Input\\compound_dictionary.json"
UNKNOWN_LIST_FILE = "Input\\unknown_list.json"
STOP_WORDS_FILE = "Input\\stop_words.json"

The default input files are used to build the compound dictionary which standardizes the names for the database.

If you wish to start from scratch with blank input files or use custom input files, then change the filenames to the name of the file you would like to use. If the file specified does not exist, then a new blank file will be created with that name.

IMPORTANT: The file locations specified above are for a Windows file system. For a Unix file system, you may need to change the directory syntax from backslashes to forward slashes.

NOTE: Never edit these files while the application is running, or the changes will be overwritten by the program.

Configure Inputs

Below the filename specifications is the configuration for the possible inputs of the application. When the script reads an unrecognized compound (a compound that is not already in any of the data structures) it will prompt the user for an input. The following lines specify which input strings correspond to valid commands:

INPUT_SAME = "="
INPUT_UNKNOWN = "unknown"
INPUT_STOP_WORDS = "sw"
INPUT_ADD_STOP_WORD = "add"
INPUT_PASS = "pass"
INPUT_UNDO = "u"
INPUT_SAVE = "save"
INPUT_QUIT = "quit"
INPUT_QUIT_WITHOUT_SAVING = "quit no save" 

Here is an explanation of every input option:

  • INPUT_SAME: Adds the current element to the dictionary exactly as it is. For example, "sodium chloride" would map to "sodium chloride".
  • INPUT_UNKNOWN: Adds the current element to the unknownList
  • INPUT_STOP_WORDS: Adds all words in the element to the stopWords list. (Ex. if the element is "well plate", then both "well" and "plate" will be added to the list).
  • INPUT_ADD_STOP_WORD: Adds a single word to the stopWord list. For example, using the default input "add", "add well" will append "well" to the stopWords list. The word added does not necessarily need to be a part of the element that the application is currently asking about.
  • INPUT_PASS: Skips all instances of the element in the current session, temporarily. Useful if you don't want to classify an element right away, and want to move on to the next one.
  • INPUT_UNDO: If the last input added to the unknownList or the dictionary, this command will undo that action and re-prompt for the last element. It will not remove elements from stopWords. This command is not 100% reliable, so the best way to undo a mistake is to EXIT THE APPLICATION and change the corresponding text/json file directly.
  • INPUT_SAVE: Saves changes to all of the data structures in their respective files. By default, this happens automatically after every step.
  • INPUT_QUIT: Saves changes and exits the script.
  • INPUT_QUIT_WITHOUT_SAVING: Quit without saving.
  • Any other input will result in a dictionary value being added to the dictionary, with the specified element as the key. For example, if the application prompts the user about "nacl", and the user enters "sodium chloride", then the key "nacl": "sodium chloride" will be added to the dictionary.

Using the Dictionary Generator Application

To use the functionality of this script, either create a new script and import the module, or import the module directly from the Python command line interface:

from dictionary_generator import *

Getting the List

The dictionary generator application needs a list of elements to go through in order to build the dictionary. This can be any list of strings, but for building the compound dictionary, it needs a list of compounds from the Structure objects that have had their details parsed. This can be done with the getCompoundList() function:

structureList = loadStructures(STRUCTURES_FILE)
compoundList = getCompoundList(structureList)

First the script loads the Structure objects into a Python list from the serialized structure file. For more about the structure file, see Structure File

By default, the getCompoundList() function sorts compounds by frequency, which allows more common compounds to be added to the dictionary first. If you are using your own list and would like to sort by frequency, use:

sortedList = sorted(yourList, key=lambda x: -counts[x])

Starting the Command Line Application

Once you have a Python list that you want to use to generate a dictionary, run the application using the generateDictionary() function:

generateDictionary(yourList)

The application will begin iterating through the list. If it comes across an element which is not in any of the data structures (compoundDictionary, unknownList, stopWords), then it will prompt the user for classification. The data structures are built dynamically, so if an element is added to one of them, the application will not ask about it again in the same session.

The getKey() Function

It is important to understand the getKey() function in order to effectively use the dictionary generator. The function removes certain characters from a string such as spaces and punctuation, so that only alphanumeric characters remain. It also converts all of the characters to lowercase. It is used in the dictionary generator because the keys in the compound dictionary (and the unknownList) only consist of lowercase alphanumeric characters, to reduce the number of keys necessary. In this way, variants of the same compound such as “NaAcetate”, “Na-acetate”, and “na acetate” can all be covered by the same “naacetate” key. Essentially, the getKey() function will take an element of the list and return a valid dictionary key:

getKey("Na-acetate”) == "naacetate”

The implication of this is that if you attempt to add a key to the dictionary such as "na-cl": "sodium chloride", then you will actually add the key as "nacl". Furthermore, when iterating through the list, if the application comes across "na cl", then it will be ignored since the corresponding key is already in the dictionary. If you wish to build your own dictionary and this behavior is not desired, then you will need to modify the getKey() function yourself. It is located in the misc_functions.py script. In the function is a list of characters to remove, as well as a command to convert the string into lowercase.

Automatic Dictionary Key Additions

As you add to the dictionary, the dictionary generator will automatically add keys to speed up generation. The dictionary will add a key for the value that you just entered, if such a key does not yet exist. For example, if you add the value "sodium chloride" for the key "nacl", then another key will be created for "sodiumchloride" (without a space, since it's a key - see the getKey() function above) which also maps to the "sodium chloride".

To turn this feature off, use the autoAdd parameter in the generateDictionary() function:

generateDictionary(yourList, autoAdd=False)