Skip to content

Using the Database Script

maxdudek edited this page Dec 19, 2019 · 22 revisions

Table of Contents

Using the Database Script

Getting Started

The script which allows the creation and searching of the crystallization database is pdb_crystal_database.py. To use the functionality of this script, either create a new script and import the module, or import the module directly from the Python command line interface:

from pdb_crystal_database import *

Verifying File Locations

The top of the script defines the locations of input and output files, and loads them into global variables used by the functions of the script. This is done by creating Path objects for each directory using the Python module pathlib, meaning the paths are universal and should work on Windows and Unix systems. See Explanation of Files for descriptions of all of the files. Only the Input files are included in this repository, as the Output files are generated by the script. If an Input file is changed or moved, the pointer to the file must also be changed at the top of the script.

The only required input file not included in the Input directory is the database itself, stored in Structures/structures.pkl. The file can either be downloaded from this repository or created from scratch using the tutorial found at Downloading Structures from the PDB.

Installing NLTK

The use of the structure.parseDetails() function requires installation of the NLTK package. If you do not need to modify the structure.parseDetails() function or can't install the package, then you will still be able to search the database so long as you download the pre-built structure file in this repository. The details in that file have already been parsed, so there is no need to call the function. Skip to Standardizing Chemical Names for information on how to search through the database.

To install NLTK, follow the instructions here. Once it has been installed, you will need to download the additional 'punkt' package within nltk. From the terminal, enter a Python program:

$ python

Once you are in python, execute the following commands to download the additional package:

>>> import nltk
>>> nltk.download('punkt')

If the second command gives an SSL error, check this thread to see if it fixes the problem.

Loading and Manipulating the Structures

NOTE: For more advanced uses of the functions mentioned in this tutorial, see the documentation in the script regarding each function.

Once a valid structure file is in the "Structures" directory, the Structure objects can be loaded into a Python list and manipulated. To do this, call:

structureList = loadStructures()

Parsing Details

The pre-built structure file has already had its crystallization details parsed into a list of compounds. If you have built a new structure file from scratch, or you modified the structure.parseDetails() function, then they will need to be parsed again. Make sure you have the nltk package properly installed.

To parse the details of every compound in the list, call:

parseAllDetails(structureList, structureFile=STRUCTURES_FILE)

The optional parameter structureFile specifies where to save the list of Structure objects after they have been parsed.

This will take about 90 seconds, and it will edit the compounds list of every structure and write the results to the structure file. This means that unless the structure.parseDetails() function is modified, this function does not need to be called multiple times. To make this function run faster, the parseAllDetails() function has an optional parameter searchString, which restricts parsing to only structures which include a specific substring in their crystallization details. This is useful if a change is made to structure.parseDetails() which only affects structures which contain a certain word or character combination.

Standardizing Chemical Names

To standardize the chemical names of the compounds based on the compound dictionary, call:

standardizeAllNames(structureList, structureFile=STRUCTURES_FILE)

Note that once again, a structure file to write the results to is specified. It is important to save the list of structures after they are processed so that they will not need to be processed the next time they are loaded. If custom changes are made to the structures, they can be written to a file by calling the following function:

writeStructures(structureList, structureFile)

Where structureList specifies the list of structures to write, and structureFile specifies the name of the file to write them to.

The next step is to export all of the Output files to verify the detail parsing and standardization:

exportOutputFiles(structureList)

For an explanation of each of the Output files, see Output

One of the files outputted is sensible_structures.pkl. This file words the same way as the main structure file, except it only includes "sensible" structures, in which all compounds are recognized by the compound dictionary. The terminal output should print the number of sensible structures retrieved, which should be over 80,000.

Searching the Database

To search the database, the database first must be loaded into a Python list. Use the loadStructures() function for this, passing in a reference to the structure file you want to load. For instance, if you only wanted to search through the sensible structures, call:

structureList = loadStructures("Structures\\sensible_structures.pkl")
or
structureList = loadStructures(SENSIBLE_STRUCTURES_FILE)

Once the list of structure objects is loaded, it should be straightforward to write a custom script to search them in any way desired. One tool that might be helpful is the getStructure() function, which returns a specific Structure object based on PDB ID. For example:

specificStructure = getStructure(structureList, "3NNQ")

This will return the Structure object associated with the PDB ID 3NNQ.

To get a list of JUST the compound names (and not the concentration) of a specific structure, use Python's list slicing notation to get a list of every other element:

compoundNames = structure.compounds[::2]

Creating a Database Subset

Sometimes it is useful to work with a smaller subset of the database. To create a smaller list of structure objects from the larger structure list, call:

subsetList = getDatabaseSubset(structureList, pdbList)

Where pdbList is a list of PDB IDs of the structures to put in the subset.

By default, the function only includes "sensible" structures in its output - that is, all of their compounds are recognized by the compound dictionary. The optional parameter structureFile can be used to specify an output file to write the subset list to. If no file is specified, the list can be written to a file later using the writeStructures() function as demonstrated above.

An easy way to import a large list of PDB IDs is to use the fileToList() function located in misc_functions.py. This function takes every line in a text file and adds it as an element to a list. For example:

pdbList = fileToList("pdb_ids.txt")

This pdbList can then be used to create a database subset.

Exporting the database to a csv file

If you wish to export a list of structures to a TAB-DELIMITED csv file (some compound names contain commas, making comma-delimited files unusable), you can do so with the exportCsv() function:

exportCsv(myStructureList, "my_csv_file.csv")

For more information on the format of the csv file which is exported, see Explanation of Files.

Advanced Functionality

Debugging the structure.parseDetails() Function

If you want to modify the structure.parseDetails() function to include more structures, or you want to see why the function outputted a certain compound list, you can run the function in debug mode for a specific compound. Doing so will print out every step of the function, which is useful in determining which step has gone wrong, or is producing an undesirable outcome. To run the function in debug mode for a specific structure, you'll need to know the PDB ID of that structure:

getStructure(structureList, "3NNQ").parseDetails(debug=True) 

Analyzing the Compound Dictionary

To get a list of every unique compound in the dictionary:

uniqueCompounds = list(set(compoundDictionary.values())) 

To get a dictionary which maps every unique compound to its frequency in a structureList:

freqDictionary = getCompoundFrequencies(structureList)

The optional parameter outputFilename in this function will specify a text file to export the frequencies to.

Analyzing Set Frequency

Every structure in the database contains a set of compounds used to crystallize that structures. These sets can be analyzed as sets in discrete mathematics, which contain an unordered grouping of elements. The function getSetFrequencies() will return a dictionary which maps every unique set of compounds (as a Python frozenset) to its frequency in the database. Here's an example of how it's used:

setFrequency = getSetFrequencies(myStructureList, requiredCompounds=['HEPES', 'ammonium sulfate'])
s = frozenset(['HEPES', 'ammonium sulfate', 'PEG 3350'])
pegFreq = setFrequency[s]
listOfSupersets = [t for t in setFrequency.keys() if t.issuperset(s)] # A list of every set containing the elements in s
pegFreq2 = 0
for a in listOfSupersets:
	pegFreq2 += setFrequency[a]

The function will search through myStructureList and find the frequency of every unique set of compounds. The requiredCompounds parameter specifics a list of compounds which must be included in every set, so the resulting dictionary setFrequency only contains sets of compounds which contain ammonium sulfate and HEPES. The variable pegFreq is an integer which refers to the number of structures in myStructureList which contain HEPES, ammonium sulfate, and PEG 3350, but no other compounds. pegFreq2 will refer to the number of structures which contain AT LEAST HEPES, ammonium sulfate, and PEG 3350, because it also includes the frequency of every superset of the given set.

What if you didn't want to go through all of this every time you wanted to look at subset frequency instead of complete set frequency? You can generate a dictionary which finds the frequency of all subsets of a certain length using:

subset3frequency = getSetFrequencies(myStructureList, requiredCompounds='HEPES', csvFilename='subset_frequency.csv' , subsetLength=3)

Here the subsetLength parameter specifies that you don't want to look for absolute, complete sets, but instead subsets of length 3. This will output a dictionary which maps every possible 3-compound subset (which contains 'HEPES') to its frequency in the compound list. It doesn't only count structures which contain EXACTLY those 3 compounds, but all structures which contain AT LEAST those three. The csvFilename parameter specifies a TAB-DELIMITED csv file where this data can be written to. Another optional textFilename parameter specifies the location of a text file output, which is more easily readable.

For more about how frozensets work in Python, see frozenset.

To read more about the default set frequency files, see set_frequency.txt.

If you want to perform your own analysis on sets of compounds, here is how to convert the list of compounds into a hashable set data structure which can be put into dictionaries:

for structure in structureList:
	setOfCompounds = frozenset(structure.compounds[::2])