An end-to-end machine learning project for predicting patient triage priority from free-text symptom descriptions using both traditional ML models and modern transformer-based approaches.
This project demonstrates a complete ML pipeline for healthcare triage classification, featuring:
- Goal: Predict patient triage priority (Immediate vs Urgent vs Routine) from symptom descriptions
- Data: Real-world triage dataset with 2,000 patient records
- Models: Logistic Regression (98.8%), Random Forest, Naive Bayes, and BERT
- Evaluation: Comprehensive metrics including accuracy, F1-score, precision, recall, and confusion matrices
- Demo: Interactive Streamlit web application with all 4 models
- GPU Support: Train BERT 10-20x faster with NVIDIA GPU (RTX 3080 tested)
# 1. Install dependencies (with GPU support for NVIDIA users)
pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 2. Train all models
python train_models.py
# 3. Launch the app
streamlit run app.pyThat's it! The app will open in your browser with all 4 trained models ready to use.
Triage Text Classifier/
โโโ real_triage_data.csv # Real triage dataset (2,000 records)
โโโ preprocessing.py # Text preprocessing module
โโโ baseline_models.py # Traditional ML models (LR, RF, NB)
โโโ bert_model.py # BERT transformer model
โโโ train_models.py # ๐ Easy training script for all models
โโโ app.py # Streamlit demo application
โโโ notebook.ipynb # Complete analysis notebook (optional)
โโโ requirements.txt # Python dependencies
โโโ models/ # ๐ Saved baseline models
โ โโโ logistic_regression.joblib
โ โโโ random_forest.joblib
โ โโโ naive_bayes.joblib
โ โโโ tfidf_vectorizer.joblib
โโโ bert_model/ # ๐ Saved BERT model
โ โโโ model.safetensors
โ โโโ config.json
โ โโโ tokenizer files
โ โโโ label_encoder.joblib
โโโ README.md # This file
# Clone or download the project
cd "Triage Text Classifier"
# Install dependencies
pip install -r requirements.txt
# For GPU support (NVIDIA GPU users - RECOMMENDED for BERT):
# Uninstall CPU-only PyTorch first
pip uninstall torch torchvision torchaudio -y
# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Download required NLTK data (will be done automatically)
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"๐ก GPU Note: If you have an NVIDIA GPU, installing PyTorch with CUDA support will make BERT training 10-20x faster (50 seconds vs 5-15 minutes).
python train_models.pyThis script will:
- Load the real triage data (2,000 records)
- Preprocess the text
- Train all baseline models (Logistic Regression, Random Forest, Naive Bayes)
- Optionally train BERT (you'll be prompted)
- Save all models for use in the app
Expected Results:
- Logistic Regression: ~98% accuracy
- Naive Bayes: ~94% accuracy
- Random Forest: ~90% accuracy
- BERT: High accuracy with deep understanding
streamlit run app.pyThis launches an interactive web application where you can:
- Enter symptom descriptions and get priority predictions
- Compare all 4 models (Logistic Regression, Random Forest, Naive Bayes, BERT)
- View confidence scores and probability distributions
- Explore data analysis and model performance
The app will automatically load all trained models.
For detailed analysis and experimentation:
jupyter notebook notebook.ipynbThe notebook includes:
- Data exploration and visualization
- Text preprocessing pipeline
- Training of baseline models
- BERT model training and fine-tuning
- Comprehensive model evaluation and comparison
- Performance visualizations
- Total Samples: 2,000 triage records
- Priority Distribution:
- Routine: 1,100 samples (55%)
- Urgent: 600 samples (30%)
- Immediate: 300 samples (15%)
| patient_id | age | gender | presenting_complaint | triage_category | arrival_time | disposition |
|---|---|---|---|---|---|---|
| 1 | 69 | M | Chest pain radiating to left arm with diaphoresis | Immediate | 2025-10-05 | Admitted |
| 2 | 32 | F | Mild chest pain, musculoskeletal | Routine | 2025-09-19 | Discharged |
- Average text length: ~50 characters
- Average word count: ~8 words
- Text preprocessing: Lowercasing, tokenization, stopword removal, lemmatization
- Purpose: Fast, interpretable baseline
- Features: TF-IDF vectorization
- Strengths: Fast training, good interpretability
- Use case: Quick predictions, feature importance analysis
- Purpose: Robust ensemble method
- Features: TF-IDF vectorization
- Strengths: Handles overfitting well, feature importance
- Use case: Stable predictions, feature analysis
- Purpose: Probabilistic text classifier
- Features: TF-IDF vectorization
- Strengths: Fast, good for text classification
- Use case: Quick text classification
- Model:
bert-base-uncased - Purpose: State-of-the-art text understanding
- Features: Pre-trained embeddings, fine-tuned for triage
- Strengths: Deep contextual understanding, high accuracy
- Use case: Best performance, complex text understanding
Baseline Models:
- Logistic Regression: 98.8% accuracy, 98.7% F1 score โญ Best baseline
- Naive Bayes: 94.0% accuracy, 93.9% F1 score
- Random Forest: 90.0% accuracy, 89.9% F1 score
Advanced Model:
- BERT: High accuracy with deep contextual understanding
- Training time: ~50 seconds (with RTX 3080 GPU)
- Training time: 5-15 minutes (CPU only)
- Accuracy: Overall correct predictions
- F1 Score: Harmonic mean of precision and recall
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- Confusion Matrix: Detailed classification breakdown
All models are evaluated using 5-fold cross-validation to ensure robust performance estimates.
- Text Cleaning: Remove special characters, normalize case
- Tokenization: Split text into individual words
- Stopword Removal: Remove common words (with medical-specific additions)
- Lemmatization: Reduce words to their base forms
- Vectorization: Convert to numerical features (TF-IDF or BERT embeddings)
- Data Splitting: 80% training, 20% testing with stratification
- Feature Engineering: TF-IDF for baseline models, BERT embeddings for transformer
- Model Training: Optimized hyperparameters for each model type
- Evaluation: Comprehensive metrics and visualizations
- Model Persistence: Save trained models for deployment
The project includes comprehensive visualizations:
- Data Distribution: Priority levels, text length distributions
- Model Performance: Accuracy, F1-score comparisons
- Confusion Matrices: Detailed classification results
- Feature Importance: Most important words/features
- Cross-Validation: Model stability analysis
The interactive web application provides:
- Text input for symptom descriptions
- Model selection (Logistic Regression, Random Forest, Naive Bayes, BERT)
- Real-time priority prediction
- Confidence scores and probability distributions
- Sample examples to try
- Dataset overview and statistics
- Priority distribution visualizations
- Text characteristics analysis
- Interactive data exploration
- Performance metrics comparison
- Cross-validation results
- Detailed evaluation tables
- Project overview and features
- Technical specifications
- Usage guidelines and disclaimers
pandas==2.1.4 # Data manipulation
numpy==1.24.3 # Numerical computing
scikit-learn==1.3.2 # Machine learning
transformers==4.36.2 # HuggingFace transformers
torch>=2.5.0 # PyTorch for BERT (CUDA version recommended)
accelerate>=0.26.0 # Required for BERT training
nltk==3.8.1 # Natural language processing
spacy==3.7.2 # Advanced NLP
matplotlib==3.8.2 # Plotting
seaborn==0.13.0 # Statistical visualization
streamlit==1.29.0 # Web application
jupyter==1.0.0 # Notebook environment
plotly==5.17.0 # Interactive plots
wordcloud==1.9.2 # Word cloud generation
tqdm==4.66.1 # Progress bars
joblib==1.3.2 # Model serialization
For NVIDIA GPU users:
BERT training is 10-20x faster with GPU support:
- With GPU (RTX 3080): ~50 seconds for 3 epochs
- Without GPU (CPU only): 5-15 minutes for 3 epochs
To enable GPU support, install PyTorch with CUDA:
# Uninstall CPU-only version
pip uninstall torch torchvision torchaudio -y
# Install CUDA-enabled version (for CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121Check GPU availability:
import torch
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")from preprocessing import TextPreprocessor
from baseline_models import BaselineModels
# Load trained model
models = BaselineModels()
models.load_models('models')
# Preprocess text
preprocessor = TextPreprocessor()
text = "Patient experiencing severe chest pain"
processed_text = preprocessor.preprocess_text(text)
# Make prediction
prediction = models.predict(vectorizer.transform([processed_text]))
print(f"Predicted priority: {prediction}")from bert_model import BertTriageClassifier
# Load BERT model
bert_model = BertTriageClassifier()
bert_model.load_model('bert_model')
# Make prediction
text = "Patient experiencing severe chest pain"
prediction, probabilities = bert_model.predict([text])
print(f"BERT prediction: {prediction[0]}")
print(f"Confidence: {max(probabilities[0]):.3f}")# Run the complete training and evaluation pipeline
from preprocessing import load_and_preprocess_data
from baseline_models import BaselineModels
from bert_model import BertTriageClassifier
# Load and preprocess data
X_train, X_test, y_train, y_test, vectorizer = load_and_preprocess_data()
# Train baseline models
baseline_models = BaselineModels()
baseline_models.train_models(X_train, y_train)
results = baseline_models.evaluate_models(X_test, y_test)
# Train BERT model
bert_model = BertTriageClassifier()
train_dataset, test_dataset = bert_model.prepare_data(X_train, y_train, X_test, y_test)
bert_model.train(train_dataset, test_dataset)
bert_results = bert_model.evaluate(test_dataset)This is a demonstration project using synthetic data. It should NOT be used for actual medical triage decisions without:
- Proper validation with real medical data
- Regulatory approval from healthcare authorities
- Extensive testing in clinical environments
- Integration with proper medical protocols
- The dataset is synthetic and generated for demonstration purposes
- Real-world triage data would likely have different characteristics
- Model performance on synthetic data may not reflect real-world performance
- BERT model requires significant computational resources
- Model accuracy depends on data quality and preprocessing
- Performance may vary with different text styles or medical terminology
-
Real Data Integration
- Collect actual triage data (with proper permissions)
- Validate models on real-world scenarios
-
Advanced Models
- Experiment with medical-specific BERT variants
- Implement ensemble methods
- Try few-shot learning approaches
-
Enhanced Features
- Multi-modal inputs (text + structured data)
- Temporal features for symptom progression
- Integration with electronic health records
-
Deployment Improvements
- Model versioning and A/B testing
- Real-time monitoring and alerting
- API development for integration
-
User Experience
- Mobile application
- Voice input capabilities
- Integration with existing healthcare systems
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes: Follow the existing code style
- Add tests: Ensure your changes work correctly
- Submit a pull request: Describe your changes clearly
- Follow PEP 8 style guidelines
- Add docstrings to all functions and classes
- Include type hints where appropriate
- Write comprehensive tests
- Update documentation for new features
- Issues: Report bugs or request features via GitHub issues
- Documentation: Check this README and code comments
- Community: Join discussions in the project repository
-
Model Loading Errors
- Error: "Could not load BERT model" or "No such file or directory: 'models/tfidf_vectorizer.joblib'"
- Solution: Train the models first by running
python train_models.py - Models must be trained before running the Streamlit app
-
PyTorch CUDA Issues
- Error: PyTorch not detecting GPU
- Solution: Install CUDA-enabled PyTorch:
pip uninstall torch torchvision torchaudio -y pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
- Verify with:
python -c "import torch; print(torch.cuda.is_available())"
-
BERT Training Errors
- Error: "accelerate not found" or "WANDB_DISABLED" warnings
- Solution: Install required package:
pip install accelerate>=0.26.0 - Wandb warnings can be safely ignored
-
Memory Issues
- BERT requires 8GB+ RAM for training
- Baseline models work fine with 4GB+ RAM
- If out of memory, train only baseline models (skip BERT)
-
Dependency Conflicts
- Use virtual environments to isolate dependencies
- Recommended: Create a fresh conda/venv environment
This project is for educational and demonstration purposes. Please ensure compliance with local regulations and healthcare data protection laws if adapting for real-world use.
- HuggingFace: For the transformers library and pre-trained BERT models
- scikit-learn: For comprehensive machine learning tools
- Streamlit: For the intuitive web application framework
- NHS: For inspiration in healthcare triage systems
Built with โค๏ธ for healthcare innovation and education