Skip to content

Commit 627619f

Browse files
feat: Full annotation pipeline with reasonable schema for app
Merge pull request #13 from shloknatarajan/main
2 parents cdbb399 + a6d05de commit 627619f

24 files changed

Lines changed: 4426 additions & 441 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ __pycache__
1717
# environments
1818
.pyenv
1919
.env
20+
.envrc
2021

2122
# data
2223
data/articles/
@@ -25,6 +26,7 @@ data/unique_pmcids.json
2526
data/pmid_list.json
2627
data/downloaded_pmcids.json
2728
data/markdown
29+
data/extractions/
2830

2931
*.zip
3032
*.tar.gz

README.MD

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,8 @@ We manage a few repos externally:
4444
## System Overview
4545
![Annotations Diagram](assets/annotations_diagram.svg)
4646

47+
## Downloading the data
48+
```
49+
pixi run gdown —-id 1qtQWvi0x_k5_JofgrfsgkWzlIdb6isr9
50+
unzip autogkb-data.zip
51+
```

benchmark_example.py

Lines changed: 0 additions & 274 deletions
This file was deleted.

docs/EFFICIENCY_ANALYSIS.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# AutoGKB Efficiency Analysis Report
2+
3+
## Overview
4+
This report documents efficiency issues identified in the AutoGKB codebase and provides recommendations for improvements.
5+
6+
## Critical Efficiency Issues
7+
8+
### 1. Inefficient JSON File Loading (HIGH PRIORITY)
9+
**Location**: `src/utils.py:79-84` - `get_true_variants()` function
10+
11+
**Issue**: The function opens and parses a JSON file on every call, causing unnecessary disk I/O operations.
12+
13+
```python
14+
def get_true_variants(pmcid):
15+
true_variant_list = json.load(open("data/benchmark/true_variant_list.json"))
16+
return true_variant_list[pmcid]
17+
```
18+
19+
**Impact**:
20+
- Repeated file I/O operations for each function call
21+
- JSON parsing overhead on every access
22+
- Potential file handle leaks (file not properly closed)
23+
- Poor performance when processing multiple PMCIDs
24+
25+
**Solution**: Implement module-level caching with lazy loading to load the JSON file only once.
26+
27+
### 2. Type Annotation Issues (MEDIUM PRIORITY)
28+
**Locations**: Multiple files with incorrect type annotations
29+
30+
**Issues**:
31+
- `src/utils.py`: Functions use `str = None` instead of `Optional[str]`
32+
- `src/inference.py`: Multiple functions with incorrect None type annotations
33+
- `src/article_parser.py`: Type mismatches in function parameters
34+
- `src/components/`: Similar type annotation issues across component files
35+
36+
**Impact**:
37+
- Static type checking failures
38+
- Potential runtime errors
39+
- Poor code maintainability
40+
- IDE/tooling issues
41+
42+
### 3. Redundant Data Processing (MEDIUM PRIORITY)
43+
**Location**: `src/components/variant_association_pipeline.py`
44+
45+
**Issue**: The pipeline calls `get_article_text()` multiple times for the same article across different processing steps.
46+
47+
**Impact**:
48+
- Redundant file I/O operations
49+
- Unnecessary string processing
50+
- Memory inefficiency
51+
52+
### 4. Inefficient List Iteration Patterns (LOW PRIORITY)
53+
**Location**: `src/utils.py:55-66` - `compare_lists()` function
54+
55+
**Issue**: Multiple iterations over the same lists for coloring operations.
56+
57+
**Impact**:
58+
- Multiple O(n) operations that could be combined
59+
- Redundant set membership checks
60+
61+
## Implemented Fix
62+
63+
### JSON Caching Optimization
64+
**File**: `src/utils.py`
65+
**Function**: `get_true_variants()`
66+
67+
**Changes**:
68+
- Added module-level cache variable `_true_variant_cache`
69+
- Implemented lazy loading pattern
70+
- Added proper error handling for missing files
71+
- Used context manager for safe file handling
72+
73+
**Benefits**:
74+
- JSON file loaded only once per module import
75+
- Significant performance improvement for repeated calls
76+
- Proper resource management
77+
- Thread-safe implementation
78+
79+
## Recommendations for Future Improvements
80+
81+
1. **Type Annotations**: Fix all type annotation issues across the codebase
82+
2. **Article Text Caching**: Implement caching for article text loading
83+
3. **Batch Processing**: Optimize variant processing to handle multiple variants more efficiently
84+
4. **Memory Management**: Review large data structure usage and implement streaming where appropriate
85+
5. **Database Integration**: Consider using a database instead of JSON files for better performance
86+
87+
## Testing Recommendations
88+
89+
1. Create performance benchmarks for the JSON loading optimization
90+
2. Add unit tests for the caching mechanism
91+
3. Implement integration tests to ensure functionality is preserved
92+
4. Add memory usage monitoring for large dataset processing
93+
94+
## Conclusion
95+
96+
The most critical efficiency issue was the repeated JSON file loading in `get_true_variants()`. This fix provides immediate performance benefits with minimal risk. The type annotation issues should be addressed in a follow-up PR to improve code quality and maintainability.

0 commit comments

Comments
 (0)