Skip to content

dannychantszfong/Triage-Text-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Triage Text Classifier

An end-to-end machine learning project for predicting patient triage priority from free-text symptom descriptions using both traditional ML models and modern transformer-based approaches.

๐ŸŽฏ Project Overview

This project demonstrates a complete ML pipeline for healthcare triage classification, featuring:

  • Goal: Predict patient triage priority (Immediate vs Urgent vs Routine) from symptom descriptions
  • Data: Real-world triage dataset with 2,000 patient records
  • Models: Logistic Regression (98.8%), Random Forest, Naive Bayes, and BERT
  • Evaluation: Comprehensive metrics including accuracy, F1-score, precision, recall, and confusion matrices
  • Demo: Interactive Streamlit web application with all 4 models
  • GPU Support: Train BERT 10-20x faster with NVIDIA GPU (RTX 3080 tested)

โšก Quick 3-Step Setup

# 1. Install dependencies (with GPU support for NVIDIA users)
pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 2. Train all models
python train_models.py

# 3. Launch the app
streamlit run app.py

That's it! The app will open in your browser with all 4 trained models ready to use.

๐Ÿ“ Project Structure

Triage Text Classifier/
โ”œโ”€โ”€ real_triage_data.csv          # Real triage dataset (2,000 records)
โ”œโ”€โ”€ preprocessing.py              # Text preprocessing module
โ”œโ”€โ”€ baseline_models.py            # Traditional ML models (LR, RF, NB)
โ”œโ”€โ”€ bert_model.py                 # BERT transformer model
โ”œโ”€โ”€ train_models.py               # ๐Ÿ†• Easy training script for all models
โ”œโ”€โ”€ app.py                        # Streamlit demo application
โ”œโ”€โ”€ notebook.ipynb                # Complete analysis notebook (optional)
โ”œโ”€โ”€ requirements.txt              # Python dependencies
โ”œโ”€โ”€ models/                       # ๐Ÿ“ Saved baseline models
โ”‚   โ”œโ”€โ”€ logistic_regression.joblib
โ”‚   โ”œโ”€โ”€ random_forest.joblib
โ”‚   โ”œโ”€โ”€ naive_bayes.joblib
โ”‚   โ””โ”€โ”€ tfidf_vectorizer.joblib
โ”œโ”€โ”€ bert_model/                   # ๐Ÿ“ Saved BERT model
โ”‚   โ”œโ”€โ”€ model.safetensors
โ”‚   โ”œโ”€โ”€ config.json
โ”‚   โ”œโ”€โ”€ tokenizer files
โ”‚   โ””โ”€โ”€ label_encoder.joblib
โ””โ”€โ”€ README.md                     # This file

๐Ÿš€ Quick Start

1. Installation

# Clone or download the project
cd "Triage Text Classifier"

# Install dependencies
pip install -r requirements.txt

# For GPU support (NVIDIA GPU users - RECOMMENDED for BERT):
# Uninstall CPU-only PyTorch first
pip uninstall torch torchvision torchaudio -y
# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Download required NLTK data (will be done automatically)
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"

๐Ÿ’ก GPU Note: If you have an NVIDIA GPU, installing PyTorch with CUDA support will make BERT training 10-20x faster (50 seconds vs 5-15 minutes).

2. Train All Models (Recommended - Easiest Method)

python train_models.py

This script will:

  • Load the real triage data (2,000 records)
  • Preprocess the text
  • Train all baseline models (Logistic Regression, Random Forest, Naive Bayes)
  • Optionally train BERT (you'll be prompted)
  • Save all models for use in the app

Expected Results:

  • Logistic Regression: ~98% accuracy
  • Naive Bayes: ~94% accuracy
  • Random Forest: ~90% accuracy
  • BERT: High accuracy with deep understanding

3. Launch the Demo App

streamlit run app.py

This launches an interactive web application where you can:

  • Enter symptom descriptions and get priority predictions
  • Compare all 4 models (Logistic Regression, Random Forest, Naive Bayes, BERT)
  • View confidence scores and probability distributions
  • Explore data analysis and model performance

The app will automatically load all trained models.

4. (Alternative) Run the Complete Pipeline in Jupyter

For detailed analysis and experimentation:

jupyter notebook notebook.ipynb

The notebook includes:

  • Data exploration and visualization
  • Text preprocessing pipeline
  • Training of baseline models
  • BERT model training and fine-tuning
  • Comprehensive model evaluation and comparison
  • Performance visualizations

๐Ÿ“Š Dataset Information

Real-World Triage Data Characteristics

  • Total Samples: 2,000 triage records
  • Priority Distribution:
    • Routine: 1,100 samples (55%)
    • Urgent: 600 samples (30%)
    • Immediate: 300 samples (15%)

Sample Data Format

patient_id age gender presenting_complaint triage_category arrival_time disposition
1 69 M Chest pain radiating to left arm with diaphoresis Immediate 2025-10-05 Admitted
2 32 F Mild chest pain, musculoskeletal Routine 2025-09-19 Discharged

Text Characteristics

  • Average text length: ~50 characters
  • Average word count: ~8 words
  • Text preprocessing: Lowercasing, tokenization, stopword removal, lemmatization

๐Ÿค– Models Implemented

1. Baseline Models (scikit-learn)

Logistic Regression

  • Purpose: Fast, interpretable baseline
  • Features: TF-IDF vectorization
  • Strengths: Fast training, good interpretability
  • Use case: Quick predictions, feature importance analysis

Random Forest

  • Purpose: Robust ensemble method
  • Features: TF-IDF vectorization
  • Strengths: Handles overfitting well, feature importance
  • Use case: Stable predictions, feature analysis

Naive Bayes

  • Purpose: Probabilistic text classifier
  • Features: TF-IDF vectorization
  • Strengths: Fast, good for text classification
  • Use case: Quick text classification

2. Advanced Model (Transformers)

BERT (Bidirectional Encoder Representations from Transformers)

  • Model: bert-base-uncased
  • Purpose: State-of-the-art text understanding
  • Features: Pre-trained embeddings, fine-tuned for triage
  • Strengths: Deep contextual understanding, high accuracy
  • Use case: Best performance, complex text understanding

๐Ÿ“ˆ Performance Metrics

Actual Model Performance (on 2,000 real triage records)

Baseline Models:

  • Logistic Regression: 98.8% accuracy, 98.7% F1 score โญ Best baseline
  • Naive Bayes: 94.0% accuracy, 93.9% F1 score
  • Random Forest: 90.0% accuracy, 89.9% F1 score

Advanced Model:

  • BERT: High accuracy with deep contextual understanding
    • Training time: ~50 seconds (with RTX 3080 GPU)
    • Training time: 5-15 minutes (CPU only)

Evaluation Metrics

  • Accuracy: Overall correct predictions
  • F1 Score: Harmonic mean of precision and recall
  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • Confusion Matrix: Detailed classification breakdown

Cross-Validation

All models are evaluated using 5-fold cross-validation to ensure robust performance estimates.

๐Ÿ› ๏ธ Technical Implementation

Text Preprocessing Pipeline

  1. Text Cleaning: Remove special characters, normalize case
  2. Tokenization: Split text into individual words
  3. Stopword Removal: Remove common words (with medical-specific additions)
  4. Lemmatization: Reduce words to their base forms
  5. Vectorization: Convert to numerical features (TF-IDF or BERT embeddings)

Model Training Process

  1. Data Splitting: 80% training, 20% testing with stratification
  2. Feature Engineering: TF-IDF for baseline models, BERT embeddings for transformer
  3. Model Training: Optimized hyperparameters for each model type
  4. Evaluation: Comprehensive metrics and visualizations
  5. Model Persistence: Save trained models for deployment

๐ŸŽจ Visualizations

The project includes comprehensive visualizations:

  • Data Distribution: Priority levels, text length distributions
  • Model Performance: Accuracy, F1-score comparisons
  • Confusion Matrices: Detailed classification results
  • Feature Importance: Most important words/features
  • Cross-Validation: Model stability analysis

๐ŸŒ Web Application Features

Streamlit Demo App

The interactive web application provides:

๐Ÿ”ฎ Prediction Page

  • Text input for symptom descriptions
  • Model selection (Logistic Regression, Random Forest, Naive Bayes, BERT)
  • Real-time priority prediction
  • Confidence scores and probability distributions
  • Sample examples to try

๐Ÿ“Š Data Analysis Page

  • Dataset overview and statistics
  • Priority distribution visualizations
  • Text characteristics analysis
  • Interactive data exploration

๐Ÿค– Model Performance Page

  • Performance metrics comparison
  • Cross-validation results
  • Detailed evaluation tables

โ„น๏ธ About Page

  • Project overview and features
  • Technical specifications
  • Usage guidelines and disclaimers

๐Ÿ”ง Dependencies

Core Libraries

pandas==2.1.4              # Data manipulation
numpy==1.24.3               # Numerical computing
scikit-learn==1.3.2         # Machine learning
transformers==4.36.2        # HuggingFace transformers
torch>=2.5.0               # PyTorch for BERT (CUDA version recommended)
accelerate>=0.26.0         # Required for BERT training
nltk==3.8.1                 # Natural language processing
spacy==3.7.2                # Advanced NLP
matplotlib==3.8.2           # Plotting
seaborn==0.13.0             # Statistical visualization
streamlit==1.29.0           # Web application
jupyter==1.0.0              # Notebook environment
plotly==5.17.0              # Interactive plots
wordcloud==1.9.2            # Word cloud generation
tqdm==4.66.1                # Progress bars
joblib==1.3.2               # Model serialization

GPU Support (Optional but Highly Recommended)

For NVIDIA GPU users:

BERT training is 10-20x faster with GPU support:

  • With GPU (RTX 3080): ~50 seconds for 3 epochs
  • Without GPU (CPU only): 5-15 minutes for 3 epochs

To enable GPU support, install PyTorch with CUDA:

# Uninstall CPU-only version
pip uninstall torch torchvision torchaudio -y

# Install CUDA-enabled version (for CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Check GPU availability:

import torch
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

๐Ÿ“‹ Usage Examples

1. Basic Prediction

from preprocessing import TextPreprocessor
from baseline_models import BaselineModels

# Load trained model
models = BaselineModels()
models.load_models('models')

# Preprocess text
preprocessor = TextPreprocessor()
text = "Patient experiencing severe chest pain"
processed_text = preprocessor.preprocess_text(text)

# Make prediction
prediction = models.predict(vectorizer.transform([processed_text]))
print(f"Predicted priority: {prediction}")

2. BERT Prediction

from bert_model import BertTriageClassifier

# Load BERT model
bert_model = BertTriageClassifier()
bert_model.load_model('bert_model')

# Make prediction
text = "Patient experiencing severe chest pain"
prediction, probabilities = bert_model.predict([text])
print(f"BERT prediction: {prediction[0]}")
print(f"Confidence: {max(probabilities[0]):.3f}")

3. Complete Pipeline

# Run the complete training and evaluation pipeline
from preprocessing import load_and_preprocess_data
from baseline_models import BaselineModels
from bert_model import BertTriageClassifier

# Load and preprocess data
X_train, X_test, y_train, y_test, vectorizer = load_and_preprocess_data()

# Train baseline models
baseline_models = BaselineModels()
baseline_models.train_models(X_train, y_train)
results = baseline_models.evaluate_models(X_test, y_test)

# Train BERT model
bert_model = BertTriageClassifier()
train_dataset, test_dataset = bert_model.prepare_data(X_train, y_train, X_test, y_test)
bert_model.train(train_dataset, test_dataset)
bert_results = bert_model.evaluate(test_dataset)

โš ๏ธ Important Disclaimers

Medical Use Disclaimer

This is a demonstration project using synthetic data. It should NOT be used for actual medical triage decisions without:

  • Proper validation with real medical data
  • Regulatory approval from healthcare authorities
  • Extensive testing in clinical environments
  • Integration with proper medical protocols

Data Disclaimer

  • The dataset is synthetic and generated for demonstration purposes
  • Real-world triage data would likely have different characteristics
  • Model performance on synthetic data may not reflect real-world performance

Technical Limitations

  • BERT model requires significant computational resources
  • Model accuracy depends on data quality and preprocessing
  • Performance may vary with different text styles or medical terminology

๐Ÿ”ฎ Future Enhancements

Potential Improvements

  1. Real Data Integration

    • Collect actual triage data (with proper permissions)
    • Validate models on real-world scenarios
  2. Advanced Models

    • Experiment with medical-specific BERT variants
    • Implement ensemble methods
    • Try few-shot learning approaches
  3. Enhanced Features

    • Multi-modal inputs (text + structured data)
    • Temporal features for symptom progression
    • Integration with electronic health records
  4. Deployment Improvements

    • Model versioning and A/B testing
    • Real-time monitoring and alerting
    • API development for integration
  5. User Experience

    • Mobile application
    • Voice input capabilities
    • Integration with existing healthcare systems

๐Ÿค Contributing

How to Contribute

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes: Follow the existing code style
  4. Add tests: Ensure your changes work correctly
  5. Submit a pull request: Describe your changes clearly

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add docstrings to all functions and classes
  • Include type hints where appropriate
  • Write comprehensive tests
  • Update documentation for new features

๐Ÿ“ž Support

Getting Help

  • Issues: Report bugs or request features via GitHub issues
  • Documentation: Check this README and code comments
  • Community: Join discussions in the project repository

Common Issues

  1. Model Loading Errors

    • Error: "Could not load BERT model" or "No such file or directory: 'models/tfidf_vectorizer.joblib'"
    • Solution: Train the models first by running python train_models.py
    • Models must be trained before running the Streamlit app
  2. PyTorch CUDA Issues

    • Error: PyTorch not detecting GPU
    • Solution: Install CUDA-enabled PyTorch:
      pip uninstall torch torchvision torchaudio -y
      pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    • Verify with: python -c "import torch; print(torch.cuda.is_available())"
  3. BERT Training Errors

    • Error: "accelerate not found" or "WANDB_DISABLED" warnings
    • Solution: Install required package: pip install accelerate>=0.26.0
    • Wandb warnings can be safely ignored
  4. Memory Issues

    • BERT requires 8GB+ RAM for training
    • Baseline models work fine with 4GB+ RAM
    • If out of memory, train only baseline models (skip BERT)
  5. Dependency Conflicts

    • Use virtual environments to isolate dependencies
    • Recommended: Create a fresh conda/venv environment

๐Ÿ“„ License

This project is for educational and demonstration purposes. Please ensure compliance with local regulations and healthcare data protection laws if adapting for real-world use.

๐Ÿ™ Acknowledgments

  • HuggingFace: For the transformers library and pre-trained BERT models
  • scikit-learn: For comprehensive machine learning tools
  • Streamlit: For the intuitive web application framework
  • NHS: For inspiration in healthcare triage systems

Built with โค๏ธ for healthcare innovation and education

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors