Triage Text Classifier

An end-to-end machine learning project for predicting patient triage priority from free-text symptom descriptions using both traditional ML models and modern transformer-based approaches.

🎯 Project Overview

This project demonstrates a complete ML pipeline for healthcare triage classification, featuring:

Goal: Predict patient triage priority (Immediate vs Urgent vs Routine) from symptom descriptions
Data: Real-world triage dataset with 2,000 patient records
Models: Logistic Regression (98.8%), Random Forest, Naive Bayes, and BERT
Evaluation: Comprehensive metrics including accuracy, F1-score, precision, recall, and confusion matrices
Demo: Interactive Streamlit web application with all 4 models
GPU Support: Train BERT 10-20x faster with NVIDIA GPU (RTX 3080 tested)

⚡ Quick 3-Step Setup

# 1. Install dependencies (with GPU support for NVIDIA users)
pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 2. Train all models
python train_models.py

# 3. Launch the app
streamlit run app.py

That's it! The app will open in your browser with all 4 trained models ready to use.

📁 Project Structure

Triage Text Classifier/
├── real_triage_data.csv          # Real triage dataset (2,000 records)
├── preprocessing.py              # Text preprocessing module
├── baseline_models.py            # Traditional ML models (LR, RF, NB)
├── bert_model.py                 # BERT transformer model
├── train_models.py               # 🆕 Easy training script for all models
├── app.py                        # Streamlit demo application
├── notebook.ipynb                # Complete analysis notebook (optional)
├── requirements.txt              # Python dependencies
├── models/                       # 📁 Saved baseline models
│   ├── logistic_regression.joblib
│   ├── random_forest.joblib
│   ├── naive_bayes.joblib
│   └── tfidf_vectorizer.joblib
├── bert_model/                   # 📁 Saved BERT model
│   ├── model.safetensors
│   ├── config.json
│   ├── tokenizer files
│   └── label_encoder.joblib
└── README.md                     # This file

🚀 Quick Start

1. Installation

# Clone or download the project
cd "Triage Text Classifier"

# Install dependencies
pip install -r requirements.txt

# For GPU support (NVIDIA GPU users - RECOMMENDED for BERT):
# Uninstall CPU-only PyTorch first
pip uninstall torch torchvision torchaudio -y
# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Download required NLTK data (will be done automatically)
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"

💡 GPU Note: If you have an NVIDIA GPU, installing PyTorch with CUDA support will make BERT training 10-20x faster (50 seconds vs 5-15 minutes).

2. Train All Models (Recommended - Easiest Method)

python train_models.py

This script will:

Load the real triage data (2,000 records)
Preprocess the text
Train all baseline models (Logistic Regression, Random Forest, Naive Bayes)
Optionally train BERT (you'll be prompted)
Save all models for use in the app

Expected Results:

Logistic Regression: ~98% accuracy
Naive Bayes: ~94% accuracy
Random Forest: ~90% accuracy
BERT: High accuracy with deep understanding

3. Launch the Demo App

streamlit run app.py

This launches an interactive web application where you can:

Enter symptom descriptions and get priority predictions
Compare all 4 models (Logistic Regression, Random Forest, Naive Bayes, BERT)
View confidence scores and probability distributions
Explore data analysis and model performance

The app will automatically load all trained models.

4. (Alternative) Run the Complete Pipeline in Jupyter

For detailed analysis and experimentation:

jupyter notebook notebook.ipynb

The notebook includes:

Data exploration and visualization
Text preprocessing pipeline
Training of baseline models
BERT model training and fine-tuning
Comprehensive model evaluation and comparison
Performance visualizations

📊 Dataset Information

Real-World Triage Data Characteristics

Total Samples: 2,000 triage records
Priority Distribution:
- Routine: 1,100 samples (55%)
- Urgent: 600 samples (30%)
- Immediate: 300 samples (15%)

Sample Data Format

patient_id	age	gender	presenting_complaint	triage_category	arrival_time	disposition
1	69	M	Chest pain radiating to left arm with diaphoresis	Immediate	2025-10-05	Admitted
2	32	F	Mild chest pain, musculoskeletal	Routine	2025-09-19	Discharged

Text Characteristics

Average text length: ~50 characters
Average word count: ~8 words
Text preprocessing: Lowercasing, tokenization, stopword removal, lemmatization

🤖 Models Implemented

1. Baseline Models (scikit-learn)

Logistic Regression

Purpose: Fast, interpretable baseline
Features: TF-IDF vectorization
Strengths: Fast training, good interpretability
Use case: Quick predictions, feature importance analysis

Random Forest

Purpose: Robust ensemble method
Features: TF-IDF vectorization
Strengths: Handles overfitting well, feature importance
Use case: Stable predictions, feature analysis

Naive Bayes

Purpose: Probabilistic text classifier
Features: TF-IDF vectorization
Strengths: Fast, good for text classification
Use case: Quick text classification

2. Advanced Model (Transformers)

BERT (Bidirectional Encoder Representations from Transformers)

Model: bert-base-uncased
Purpose: State-of-the-art text understanding
Features: Pre-trained embeddings, fine-tuned for triage
Strengths: Deep contextual understanding, high accuracy
Use case: Best performance, complex text understanding

📈 Performance Metrics

Actual Model Performance (on 2,000 real triage records)

Baseline Models:

Logistic Regression: 98.8% accuracy, 98.7% F1 score ⭐ Best baseline
Naive Bayes: 94.0% accuracy, 93.9% F1 score
Random Forest: 90.0% accuracy, 89.9% F1 score

Advanced Model:

BERT: High accuracy with deep contextual understanding
- Training time: ~50 seconds (with RTX 3080 GPU)
- Training time: 5-15 minutes (CPU only)

Evaluation Metrics

Accuracy: Overall correct predictions
F1 Score: Harmonic mean of precision and recall
Precision: True positives / (True positives + False positives)
Recall: True positives / (True positives + False negatives)
Confusion Matrix: Detailed classification breakdown

Cross-Validation

All models are evaluated using 5-fold cross-validation to ensure robust performance estimates.

🛠️ Technical Implementation

Text Preprocessing Pipeline

Text Cleaning: Remove special characters, normalize case
Tokenization: Split text into individual words
Stopword Removal: Remove common words (with medical-specific additions)
Lemmatization: Reduce words to their base forms
Vectorization: Convert to numerical features (TF-IDF or BERT embeddings)

Model Training Process

Data Splitting: 80% training, 20% testing with stratification
Feature Engineering: TF-IDF for baseline models, BERT embeddings for transformer
Model Training: Optimized hyperparameters for each model type
Evaluation: Comprehensive metrics and visualizations
Model Persistence: Save trained models for deployment

🎨 Visualizations

The project includes comprehensive visualizations:

Data Distribution: Priority levels, text length distributions
Model Performance: Accuracy, F1-score comparisons
Confusion Matrices: Detailed classification results
Feature Importance: Most important words/features
Cross-Validation: Model stability analysis

🌐 Web Application Features

Streamlit Demo App

The interactive web application provides:

🔮 Prediction Page

Text input for symptom descriptions
Model selection (Logistic Regression, Random Forest, Naive Bayes, BERT)
Real-time priority prediction
Confidence scores and probability distributions
Sample examples to try

📊 Data Analysis Page

Dataset overview and statistics
Priority distribution visualizations
Text characteristics analysis
Interactive data exploration

🤖 Model Performance Page

Performance metrics comparison
Cross-validation results
Detailed evaluation tables

ℹ️ About Page

Project overview and features
Technical specifications
Usage guidelines and disclaimers

🔧 Dependencies

Core Libraries

pandas==2.1.4              # Data manipulation
numpy==1.24.3               # Numerical computing
scikit-learn==1.3.2         # Machine learning
transformers==4.36.2        # HuggingFace transformers
torch>=2.5.0               # PyTorch for BERT (CUDA version recommended)
accelerate>=0.26.0         # Required for BERT training
nltk==3.8.1                 # Natural language processing
spacy==3.7.2                # Advanced NLP
matplotlib==3.8.2           # Plotting
seaborn==0.13.0             # Statistical visualization
streamlit==1.29.0           # Web application
jupyter==1.0.0              # Notebook environment
plotly==5.17.0              # Interactive plots
wordcloud==1.9.2            # Word cloud generation
tqdm==4.66.1                # Progress bars
joblib==1.3.2               # Model serialization

GPU Support (Optional but Highly Recommended)

For NVIDIA GPU users:

BERT training is 10-20x faster with GPU support:

With GPU (RTX 3080): ~50 seconds for 3 epochs
Without GPU (CPU only): 5-15 minutes for 3 epochs

To enable GPU support, install PyTorch with CUDA:

# Uninstall CPU-only version
pip uninstall torch torchvision torchaudio -y

# Install CUDA-enabled version (for CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Check GPU availability:

import torch
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

📋 Usage Examples

1. Basic Prediction

from preprocessing import TextPreprocessor
from baseline_models import BaselineModels

# Load trained model
models = BaselineModels()
models.load_models('models')

# Preprocess text
preprocessor = TextPreprocessor()
text = "Patient experiencing severe chest pain"
processed_text = preprocessor.preprocess_text(text)

# Make prediction
prediction = models.predict(vectorizer.transform([processed_text]))
print(f"Predicted priority: {prediction}")

2. BERT Prediction

from bert_model import BertTriageClassifier

# Load BERT model
bert_model = BertTriageClassifier()
bert_model.load_model('bert_model')

# Make prediction
text = "Patient experiencing severe chest pain"
prediction, probabilities = bert_model.predict([text])
print(f"BERT prediction: {prediction[0]}")
print(f"Confidence: {max(probabilities[0]):.3f}")

3. Complete Pipeline

# Run the complete training and evaluation pipeline
from preprocessing import load_and_preprocess_data
from baseline_models import BaselineModels
from bert_model import BertTriageClassifier

# Load and preprocess data
X_train, X_test, y_train, y_test, vectorizer = load_and_preprocess_data()

# Train baseline models
baseline_models = BaselineModels()
baseline_models.train_models(X_train, y_train)
results = baseline_models.evaluate_models(X_test, y_test)

# Train BERT model
bert_model = BertTriageClassifier()
train_dataset, test_dataset = bert_model.prepare_data(X_train, y_train, X_test, y_test)
bert_model.train(train_dataset, test_dataset)
bert_results = bert_model.evaluate(test_dataset)

⚠️ Important Disclaimers

Medical Use Disclaimer

This is a demonstration project using synthetic data. It should NOT be used for actual medical triage decisions without:

Proper validation with real medical data
Regulatory approval from healthcare authorities
Extensive testing in clinical environments
Integration with proper medical protocols

Data Disclaimer

The dataset is synthetic and generated for demonstration purposes
Real-world triage data would likely have different characteristics
Model performance on synthetic data may not reflect real-world performance

Technical Limitations

BERT model requires significant computational resources
Model accuracy depends on data quality and preprocessing
Performance may vary with different text styles or medical terminology

🔮 Future Enhancements

Potential Improvements

Real Data Integration
- Collect actual triage data (with proper permissions)
- Validate models on real-world scenarios
Advanced Models
- Experiment with medical-specific BERT variants
- Implement ensemble methods
- Try few-shot learning approaches
Enhanced Features
- Multi-modal inputs (text + structured data)
- Temporal features for symptom progression
- Integration with electronic health records
Deployment Improvements
- Model versioning and A/B testing
- Real-time monitoring and alerting
- API development for integration
User Experience
- Mobile application
- Voice input capabilities
- Integration with existing healthcare systems

🤝 Contributing

How to Contribute

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes: Follow the existing code style
Add tests: Ensure your changes work correctly
Submit a pull request: Describe your changes clearly

Development Guidelines

Follow PEP 8 style guidelines
Add docstrings to all functions and classes
Include type hints where appropriate
Write comprehensive tests
Update documentation for new features

📞 Support

Getting Help

Issues: Report bugs or request features via GitHub issues
Documentation: Check this README and code comments
Community: Join discussions in the project repository

Common Issues

Model Loading Errors
- Error: "Could not load BERT model" or "No such file or directory: 'models/tfidf_vectorizer.joblib'"
- Solution: Train the models first by running python train_models.py
- Models must be trained before running the Streamlit app
PyTorch CUDA Issues
- Error: PyTorch not detecting GPU
- Solution: Install CUDA-enabled PyTorch:
```
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```
- Verify with: python -c "import torch; print(torch.cuda.is_available())"
BERT Training Errors
- Error: "accelerate not found" or "WANDB_DISABLED" warnings
- Solution: Install required package: pip install accelerate>=0.26.0
- Wandb warnings can be safely ignored
Memory Issues
- BERT requires 8GB+ RAM for training
- Baseline models work fine with 4GB+ RAM
- If out of memory, train only baseline models (skip BERT)
Dependency Conflicts
- Use virtual environments to isolate dependencies
- Recommended: Create a fresh conda/venv environment

📄 License

This project is for educational and demonstration purposes. Please ensure compliance with local regulations and healthcare data protection laws if adapting for real-world use.

🙏 Acknowledgments

HuggingFace: For the transformers library and pre-trained BERT models
scikit-learn: For comprehensive machine learning tools
Streamlit: For the intuitive web application framework
NHS: For inspiration in healthcare triage systems

Built with ❤️ for healthcare innovation and education

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
archive		archive
bert_model		bert_model
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
REAL_DATA_INTEGRATION_COMPLETE.md		REAL_DATA_INTEGRATION_COMPLETE.md
app.py		app.py
archive.zip		archive.zip
baseline_models.py		baseline_models.py
bert_model.py		bert_model.py
create_realistic_data.py		create_realistic_data.py
examine_real_data.py		examine_real_data.py
generate_data.py		generate_data.py
load_real_data.py		load_real_data.py
notebook.ipynb		notebook.ipynb
preprocessing.py		preprocessing.py
real_data_integration_summary.py		real_data_integration_summary.py
real_triage_data.csv		real_triage_data.csv
requirements.txt		requirements.txt
simple_preprocessing.py		simple_preprocessing.py
synthetic_triage_data.csv		synthetic_triage_data.csv
test_project.py		test_project.py
train_models.py		train_models.py

Folders and files

Latest commit

History

Repository files navigation

Triage Text Classifier

🎯 Project Overview

⚡ Quick 3-Step Setup

📁 Project Structure

🚀 Quick Start

1. Installation

2. Train All Models (Recommended - Easiest Method)

3. Launch the Demo App

4. (Alternative) Run the Complete Pipeline in Jupyter

📊 Dataset Information

Real-World Triage Data Characteristics

Sample Data Format

Text Characteristics

🤖 Models Implemented

1. Baseline Models (scikit-learn)

Logistic Regression

Random Forest

Naive Bayes

2. Advanced Model (Transformers)

BERT (Bidirectional Encoder Representations from Transformers)

📈 Performance Metrics

Actual Model Performance (on 2,000 real triage records)

Evaluation Metrics

Cross-Validation

🛠️ Technical Implementation

Text Preprocessing Pipeline

Model Training Process

🎨 Visualizations

🌐 Web Application Features

Streamlit Demo App

🔮 Prediction Page

📊 Data Analysis Page

🤖 Model Performance Page

ℹ️ About Page

🔧 Dependencies

Core Libraries

GPU Support (Optional but Highly Recommended)

📋 Usage Examples

1. Basic Prediction

2. BERT Prediction

3. Complete Pipeline

⚠️ Important Disclaimers

Medical Use Disclaimer

Data Disclaimer

Technical Limitations

🔮 Future Enhancements

Potential Improvements

🤝 Contributing

How to Contribute

Development Guidelines

📞 Support

Getting Help

Common Issues

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages