Skip to content

Overview

maxdudek edited this page Jan 8, 2021 · 21 revisions

Table of Contents

Overview

The HWI Crystallization Conditions Database is a Python based tool for searching crystallization conditions uploaded to the PDB. The database contains a method to parse the crystallization details in the PDB into a formatted list of compounds and their concentration, as well as a dictionary to standardize the names of the chemical compounds. This module also contains a tool to aid in expanding the dictionary.

Running the scripts in this module requires Python 3.6 or greater. Other required packages are explained as they are needed.

NOTE: Any additional scripts should be created in src/, but run in the root directory of the repo.

The Structure Class

A Structure object is created for every structure in the PDB which includes crystallization details. Each Structure object contains the following attributes:

  • String pdbid: The unique 4-character PDB ID code
  • String pmcid: The PubMed Central ID (PMCID) of the paper associated with the structure (pmcid=None if no paper is available)
  • String details: The crystallization details of the structure, found under the "Experiment" tab of the PDB page.
  • float pH: The pH of the crystallization experiment, found under the "Experiment" tab of the PDB page (pH=None if no pH value is found).
  • float temperature: The temperature of the crystallization experiment in K, found under the "Experiment" tab of the PDB page (temperature=None if no temperature is found).
  • String method: The method of crystallization, found under the "Experiment" tab of the PDB page (method=None if no method is found).
  • float resolution: The High Overall resolution in ångströms, found under "Data Collection" in the "Experiment" tab of the PDB page (resolution=None if no resolution is found).
  • String[] sequences: A list of all sequences associated with the structure, in the order they appear in the PDB.
  • String[] compounds: A formatted list of the chemical compounds in the crystallization details, followed by their concentration. This list is initialized as empty when the structure is created, but gets filled by the structure.parseDetails() function, which pulls out the compounds from the crystallization details and adds them to the list. See the section on parsing details below for more info.
The script pdb_crystal_database.py has the ability to generate and search through the database of Structures. The database is stored as a list of Structure objects serialized to a file called structures.pkl (see Structure Files). A Structure object has already been created for every structure in the PDB with crystallization details (as of 12-9-2019) and the prebuilt structure file can be found here. If you want to update the database to include new information from the PDB, or downloading the file is not an option, the file can also be created from scratch by scraping data from the PDB, which takes about 5-10 hours. For a tutorial on how to create the structures.pkl file from scratch, see Creating the Structure File from Scratch.

Parsing Crystallization Details into Compounds List

This database is built by parsing the inconsistently-formatted crystallization details in the PDB into a consistently formatted list of compounds for computational analysis. This is done through the structure.parseDetails() function, which extracts compound names, and the structure.standardizeNames() function, which standardizes the names of compounds based on a dictionary of the possible representations of each compound (see compound_dictionary.json for more info).

IMPORTANT: The use of the structure.parseDetails() function requires installation of the NLTK package. If you do not wish to modify the structure.parseDetails() function or can't install the package, remove the nltk import in the script and be sure to download the prebuilt structure file using the link above. The details in that file have already been parsed, so there is no need to call the function.

Below is an example of the detail parsing for one structure (PBD ID 3DZU):

exampleStructure.details == "15-18% PEG 3350, 25mM MgCl2, 100mM NH4Cl, 5mM DTT and 0.1M MES, pH 6.5, VAPOR DIFFUSION, HANGING DROP, temperature 277K"

After parsing details:

exampleStructure.compounds == ['PEG 3350', '16.5%', 'mgcl2', '25', 'nh4cl', '100', 'dtt', '5', 'mes', '100.0']

The function pulls out each individual compound and places the concentration (in mM or %) directly after in the list. The concentration will be equal to null or None in Python if no concentration is found for that compound.

After standardizing names:

exampleStructure.compounds == ['PEG 3350', '16.5%', 'magnesium chloride', '25', 'ammonium chloride', '100', 'DTT', '5', 'MES', '100.0']

This example demonstrates several features of the detail parsing function, which include:

  • The reservoir solution is isolated from the protein solution/cryoprotectant/soaking
  • Molar (M) and micromolar (uM) concentrations are converted to mM
  • Percent (%) concentrations can be associated with (w/v) or (v/v) specifications (e.g. '20% w/v')
  • Ranges of concentrations are averaged to produce one concentration for each compound
  • Names of chemical compounds are standardized (see compound_dictionary.json for explanation of how names are standardized)

Building the Compound Dictionary

The script dictionary_generator.py is an application which allows the user to dynamically add to the compound dictionary. By default, the application iterates through a list of unrecognized 'compounds' pulled out by the structure.parseDetails() function, and prompts the user to sort them into three categories:

  • Add the compound as a new key in the compound dictionary
  • Add the compound to the unknown list
  • Add the words in the compound to the stop words list
By default, the application sorts the unrecognized 'compounds' by frequency, allowing the most common strings to be dealt with first. For a more advanced tutorial for using the dictionary generator, see Using the Dictionary Generator. Note that the dictionary included in this repository is able to completely recognize the compounds in about 70% of all structures in the PDB with crystallization details.

Clone this wiki locally