Skip to content

Commit 31de56b

Browse files
chore: readme
1 parent b795f62 commit 31de56b

1 file changed

Lines changed: 165 additions & 5 deletions

File tree

src/term_normalization/README.md

Lines changed: 165 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,173 @@
11
# Term Normalization
22

33
## Goal
4-
Take the incoming variant annotation and replace all of the terms with normalized terms that map to current entries in ClinPGx
4+
Take incoming variant annotations and replace all terms with normalized identifiers that map to current entries in ClinPGx and PharmGKB. This ensures consistent terminology across pharmacogenomic data.
55

6-
## Drug
6+
## Overview
77

8-
## Gene
8+
The term normalization module provides automated lookup and normalization for:
9+
- **Variants/Alleles**: rsIDs and star alleles
10+
- **Drugs**: Drug names, generic names, and trade names
911

10-
## Variant / Allele
12+
## Architecture
1113

12-
## Phenotype
14+
### Main Components
1315

16+
1. **`term_lookup.py`**: Main entry point providing `TermLookup` class and `normalize_annotation()` function
17+
2. **`variant_search.py`**: Handles variant/allele normalization via `VariantLookup` class
18+
3. **`drug_search.py`**: Handles drug normalization via `DrugLookup` class
19+
4. **`search_utils.py`**: Shared utilities for similarity matching
20+
21+
## Usage
22+
23+
### Normalizing an Annotation File
24+
25+
```python
26+
from pathlib import Path
27+
from src.term_normalization.term_lookup import normalize_annotation
28+
29+
input_path = Path("data/example_annotation.json")
30+
output_path = Path("data/example_annotation_normalized.json")
31+
32+
normalize_annotation(input_path, output_path)
33+
```
34+
35+
This will:
36+
1. Load the annotation JSON file
37+
2. Normalize all `Variant/Haplotypes` and `Drug(s)` fields in annotation types: `var_pheno_ann`, `var_fa_ann`, `var_drug_ann`
38+
3. Add normalized fields with `_normalized` suffix (e.g., `Variant/Haplotypes_normalized`)
39+
4. Include a `term_mappings` section with details about each normalized term
40+
41+
### Using TermLookup Directly
42+
43+
```python
44+
from src.term_normalization.term_lookup import TermLookup, TermType
45+
46+
lookup = TermLookup()
47+
48+
# Search for a variant
49+
variant_results = lookup.search("rs12345", term_type=TermType.VARIANT, threshold=0.8, top_k=1)
50+
51+
# Search for a drug
52+
drug_results = lookup.search("aspirin", term_type=TermType.DRUG, threshold=0.8, top_k=1)
53+
```
54+
55+
## Variant Normalization
56+
57+
The `VariantLookup` class handles variant normalization with the following features:
58+
59+
### Search Strategy
60+
61+
1. **rsID Lookup** (for variants starting with "rs"):
62+
- Queries PharmGKB API (`/v1/data/variant`)
63+
- Searches local ClinPGx variant database (`data/term_lookup_info/variants.tsv`)
64+
- Searches variant names and synonyms
65+
66+
2. **Star Allele Lookup** (for variants like *1, *2):
67+
- Queries PharmGKB API (`/v1/data/haplotype`)
68+
- Searches local ClinPGx variant database
69+
70+
### Return Format
71+
72+
```python
73+
VariantSearchResult(
74+
raw_input="rs12345",
75+
id="PA166154595",
76+
normalized_term="rs12345",
77+
url="https://www.clinpgx.org/variant/PA166154595",
78+
score=1.0
79+
)
80+
```
81+
82+
## Drug Normalization
83+
84+
The `DrugLookup` class handles drug normalization with the following features:
85+
86+
### Search Strategy
87+
88+
1. **ClinPGx Lookup** (primary):
89+
- Searches drug name in local database (`data/term_lookup_info/drugs.tsv`)
90+
- Searches generic names and trade names
91+
- Returns PharmGKB Accession IDs
92+
93+
2. **RxNorm Lookup** (fallback):
94+
- Queries RxNorm API when ClinPGx search yields no results
95+
- Converts RxCUI to PharmGKB Accession ID using local mapping
96+
- Provides broader drug name coverage
97+
98+
### Return Format
99+
100+
```python
101+
DrugSearchResult(
102+
raw_input="aspirin",
103+
id="PA449552",
104+
normalized_term="etoposide",
105+
url="https://www.clinpgx.org/chemical/PA449552",
106+
score=1.0
107+
)
108+
```
109+
110+
## Data Requirements
111+
112+
The module requires local TSV files in the `data/term_lookup_info/` directory:
113+
114+
- `variants.tsv`: Variant names, IDs, and synonyms from ClinPGx
115+
- `drugs.tsv`: Drug names, generic names, trade names, RxNorm IDs, and PharmGKB IDs
116+
117+
## Configuration
118+
119+
### Parameters
120+
121+
- **`threshold`** (default: 0.8): Minimum similarity score for fuzzy matching (0.0-1.0)
122+
- **`top_k`** (default: 1): Number of top results to return
123+
- **`data_dir`** (default: "data"): Base directory for lookup TSV files
124+
125+
### Similarity Matching
126+
127+
The module uses string similarity (via `calc_similarity` in `search_utils.py`) to match input terms against database entries, allowing for:
128+
- Typos and spelling variations
129+
- Case insensitivity
130+
- Partial matches
131+
132+
## Output Format
133+
134+
The `normalize_annotation()` function adds:
135+
136+
1. **Normalized fields** in each annotation object:
137+
- `Variant/Haplotypes_normalized`: PharmGKB variant ID
138+
- `Drug(s)_normalized`: PharmGKB drug ID
139+
140+
2. **Term mappings section** at the root level:
141+
```json
142+
{
143+
"term_mappings": {
144+
"rs6539870": {
145+
"raw_input": "rs6539870",
146+
"id": "PA166154595",
147+
"normalized_term": "rs6539870",
148+
"url": "https://www.clinpgx.org/variant/PA166154595",
149+
"score": 1.0
150+
},
151+
"etoposide": {
152+
"raw_input": "etoposide",
153+
"id": "PA449552",
154+
"normalized_term": "etoposide",
155+
"url": "https://www.clinpgx.org/chemical/PA449552",
156+
"score": 1.0
157+
}
158+
}
159+
}
160+
```
161+
162+
## Future Work
163+
164+
### Gene Normalization
165+
Currently not implemented. Will require:
166+
- HGNC gene symbol lookup
167+
- Gene ID normalization
168+
- Alias resolution
169+
170+
### Phenotype Normalization
171+
Currently not implemented. Will require:
172+
- Ontology mapping (HPO, MeSH, etc.)
173+
- Phenotype term standardization

0 commit comments

Comments
 (0)