A comprehensive machine learning pipeline for predicting loan defaults using ensemble stacking with multiple feature engineering approaches and diverse model architectures.
- Multiple Feature Engineering Pipelines: 3 different feature sets (v1, v2, v3)
- Diverse Base Models: LightGBM, XGBoost, CatBoost, MLP, TabM
- Out-of-Fold (OOF) Predictions: Used as meta-features for stacking
- Level-2 Stacking Models: Trained on meta-features
- Weighted Ensemble: Final predictions from all models
.
├── data/
│ ├── processed/ # Preprocessed data (v1, v2, v3 feature sets)
│ ├── feature_selection/ # Feature selection results
│ ├── models/ # Trained models and predictions
│ ├── stacking/ # Stacking dataset with meta features
│ └── final_submission/ # Final ensemble predictions
├── main.py # Main workflow orchestrator (run complete pipeline)
├── predict_real_test.py # Standalone prediction pipeline for real test data
├── src/
│ ├── data/ # Data processing modules
│ │ ├── load_data.py # Data loading utilities
│ │ ├── clean_data.py # Data cleaning functions
│ │ ├── preprocessing.py # Preprocessing pipeline
│ │ ├── data_insights.py # Data analysis and insights
│ │ ├── prepare_data.py # Single feature set preparation
│ │ └── prepare_data_multi.py # Multiple feature sets preparation
│ ├── features/ # Feature engineering modules
│ │ ├── feature_engineering.py # Base feature engineering (v1)
│ │ ├── feature_engineering_v2.py # Aggregation-heavy features (v2)
│ │ ├── feature_engineering_v3.py # Domain-specific simple features (v3)
│ │ ├── feature_selection.py # Feature selection functions
│ │ └── run_feature_selection.py # Feature selection orchestration
│ ├── models/ # Model definitions
│ │ └── nn_models.py # Neural network architectures
│ ├── training/ # Training utilities and scripts
│ │ ├── model_training.py # Tree model training functions
│ │ ├── nn_training.py # Neural network training functions
│ │ ├── optuna_optimization.py # Optuna optimization for tree models
│ │ ├── nn_optimization.py # Optuna optimization for NN models
│ │ ├── train_models.py # Train tree models (single feature set)
│ │ ├── train_nn_models.py # Train NN models (single feature set)
│ │ └── train_all_feature_sets.py # Train all models on all feature sets
│ └── ensemble/ # Stacking and ensemble modules
│ ├── create_meta_features.py # Create meta features from OOF predictions
│ ├── train_stacking_models.py # Train Level-2 stacking models
│ ├── ensemble.py # Weighted ensemble creation
│ └── generate_final_predictions.py # Final submission generation
└── requirements.txt # Python dependencies
-
Clone the repository (or navigate to project directory)
-
Create a virtual environment:
python -m venv .venv
.venv\Scripts\activate # Windows
# or
source .venv/bin/activate # Linux/Mac- Install dependencies:
pip install -r requirements.txtThe complete pipeline consists of several phases:
python -m src.data.prepare_data_multiThis script:
- Loads raw data from
data/directory - Cleans and merges datasets
- Applies 3 different feature engineering pipelines:
- v1: Comprehensive features (temporal, interaction, aggregation, geographic, loan-specific, rank, advanced ratios, binned categoricals, population statistics)
- v2: Aggregation-heavy features (customer/loan aggregations, interaction aggregations)
- v3: Domain-specific simple features (simple ratios, binary flags, categorical counts)
- Preprocesses each feature set (standardization, one-hot encoding)
- Saves processed data to
data/processed/v1/,data/processed/v2/,data/processed/v3/
Outputs:
data/processed/v[1-3]/X_train.csv,X_val.csv,X_test.csvdata/processed/v[1-3]/y_train.csv,y_val.csv,y_test.csvdata/processed/v[1-3]/preprocessing_pipeline.pkldata/processed/v[1-3]/feature_names.txtdata/processed/v[1-3]/data_report.txt
python -m src.features.run_feature_selectionThis script:
- Loads preprocessed data
- Performs feature selection using:
- Zero importance filtering
- Permutation importance filtering
- Stepped permutation selection
- Forward feature selection (optional)
- Saves selected features to
data/feature_selection/selected_features.txt
Note: Feature selection is performed on v1 feature set. Selected features are then applied to all feature sets.
Tree Models:
python -m src.training.train_modelsNeural Network Models:
python -m src.training.train_nn_modelsThese scripts:
- Load preprocessed data and selected features
- Perform Optuna hyperparameter optimization (5-fold CV)
- Train final models with best hyperparameters
- Generate Out-of-Fold (OOF) predictions
- Save models, metrics, and predictions
Outputs:
data/models/[model_name]/model.pklormodel.pthdata/models/[model_name]/best_params.jsondata/models/[model_name]/metrics.jsondata/models/[model_name]/oof_predictions.csvdata/models/[model_name]/[train|val|test]_predictions.csvdata/models/[model_name]/optuna_study.pkl
python -m src.training.train_all_feature_setsThis script:
- Trains all 5 models (LightGBM, XGBoost, CatBoost, MLP, TabM) on all 3 feature sets
- Reuses best hyperparameters from v1 training (or optimizes if not available)
- Generates OOF predictions for each model-feature combination
- Saves results to
data/models/[version]_[model]/
Outputs:
- 15 model directories (3 feature sets × 5 models)
- Each directory contains: model, metrics, predictions, OOF predictions
Resume Capability:
- Script automatically skips already completed models
- Can resume from interruptions (KeyboardInterrupt, errors)
- Checks for existing Optuna studies and continues optimization
python -m src.ensemble.create_meta_featuresThis script:
- Loads OOF predictions from all base models across all feature sets
- Creates aggregated meta features:
- Per model type: mean, std, min, max, last predictions
- Individual predictions from each model-feature combination
- Overall statistics across all models
- Combines base features (v1) with meta features
- Saves stacking dataset to
data/stacking/
Outputs:
data/stacking/X_train.csv,X_val.csv,X_test.csvdata/stacking/y_train.csv,y_val.csv,y_test.csvdata/stacking/feature_names.txt
python -m src.ensemble.train_stacking_modelsThis script:
- Trains Level-2 models (LightGBM, XGBoost) on meta features
- Uses simple hyperparameters to prevent overfitting
- Generates OOF predictions for stacking models
- Saves results to
data/models/stacking/[model]/
Outputs:
data/models/stacking/[model]/model.pkldata/models/stacking/[model]/metrics.jsondata/models/stacking/[model]/oof_predictions.csvdata/models/stacking/[model]/[train|val|test]_predictions.csv
python -m src.ensemble.ensembleThis script:
- Loads predictions from all base models and stacking models
- Computes ensemble weights based on validation AUC-ROC performance
- Creates weighted ensemble predictions
- Evaluates ensemble performance
- Saves ensemble results to
data/final_submission/
Outputs:
data/final_submission/ensemble_val_predictions.csvdata/final_submission/ensemble_test_predictions.csvdata/final_submission/ensemble_metrics.jsondata/final_submission/ensemble_report.txt
python -m src.ensemble.generate_final_predictionsThis script:
- Compares performance of all individual models and ensemble
- Generates final submission file
- Creates comprehensive performance report
Outputs:
data/final_submission/submission.csvdata/final_submission/final_report.txt
python predict_real_test.py --data-dir data_test --output predictions/submission_real_test.csvThis standalone script:
- Loads real test data from a specified directory (default:
data_test/) - Applies the same data cleaning and feature engineering pipelines used in training
- Generates predictions from all trained models (15 base + 2 stacking)
- Creates weighted ensemble predictions using saved ensemble weights
- Outputs final predictions in the required format:
customer_id, default
Arguments:
--data-dir: Directory containing test data files (default:data_test)--output: Output path for submission file (default:predictions/submission_real_test.csv)
Test Data Requirements:
Place your test data files in the specified directory with the same structure as training data:
data_test/
├── application_metadata.csv
├── credit_history.parquet
├── demographics.csv
├── financial_ratios.jsonl
├── geographic_data.xml
└── loan_details.xlsx
Note: The test data should NOT include the default column (target variable).
What the Script Does:
- Data Loading: Loads all test data files and merges them on
customer_id - Data Cleaning: Applies the same cleaning pipeline as training data
- Feature Engineering: Generates features for all 3 versions (v1, v2, v3)
- Preprocessing: Loads saved preprocessing pipelines and transforms test data
- Model Loading: Loads all 17 trained models:
- 15 base models (v1/v2/v3 × lightgbm/xgboost/catboost/mlp/tabm)
- 2 stacking models (lightgbm/xgboost)
- Prediction Generation:
- Generates predictions from all base models
- Creates stacking meta-features from base predictions
- Generates predictions from stacking models
- Ensemble: Creates weighted ensemble using saved weights from training
- Output: Saves final predictions as CSV with columns:
customer_id, default
Outputs:
predictions/submission_real_test.csv(or specified output path)
Example Usage:
# Use default data directory and output path
python predict_real_test.py
# Specify custom data directory and output path
python predict_real_test.py --data-dir /path/to/test/data --output /path/to/output/submission.csvImportant Notes:
- Ensure all models are trained and saved before running this script
- The script automatically handles feature alignment with training data
- Missing features are filled with zeros, extra features are removed
- Stacking features are automatically created from base model predictions
- Ensemble weights are loaded from
data/final_submission/ensemble_metrics.json
Execute all steps in sequence with a single command:
python main.pyThis will run the complete pipeline:
- Data preparation (multiple feature sets)
- Feature selection (optional)
- Model training (all models on all feature sets)
- Meta features creation
- Stacking models training
- Ensemble creation
- Final predictions generation
Skip Optional Steps:
# Skip feature selection (optional step)
python main.py --skip-feature-selection
# Resume from a specific step (e.g., skip data prep and training)
python main.py --skip-data-prep --skip-trainingSee all options:
python main.py --helpIf you prefer to run each step manually:
# Step 1: Prepare multiple feature sets
python -m src.data.prepare_data_multi
# Step 2: (Optional) Feature selection
python -m src.features.run_feature_selection
# Step 3: Train all models on all feature sets
python -m src.training.train_all_feature_sets
# Step 4: Create meta features
python -m src.ensemble.create_meta_features
# Step 5: Train stacking models
python -m src.ensemble.train_stacking_models
# Step 6: Create weighted ensemble
python -m src.ensemble.ensemble
# Step 7: Generate final submission
python -m src.ensemble.generate_final_predictionsAll training scripts support resuming from interruptions:
- Completed models: Automatically skipped
- Incomplete Optuna studies: Resumed with remaining trials
- Partial training: Continues from last checkpoint
Tree Models:
- LightGBM: Gradient boosting with leaf-wise tree growth
- XGBoost: Gradient boosting with regularization
- CatBoost: Gradient boosting with categorical handling
Neural Networks:
- MLP: Multi-layer perceptron with BatchNorm, Dropout, residual connections
- TabM: Tabular model with multiplicative interactions and feature embeddings
All models are evaluated using:
- AUC-ROC (primary metric)
- Log Loss
- Accuracy
- Precision
- Recall
- F1 Score
- Framework: Optuna
- Strategy: K-fold cross-validation (5 folds)
- Trials: 10 for tree models, 5 for neural networks
- Metric: AUC-ROC (maximized)
Place your raw training data files in the data/ directory with the following structure:
data/
├── application_metadata.csv
├── credit_history.parquet
├── demographics.csv
├── financial_ratios.jsonl
├── geographic_data.xml
└── loan_details.xlsx
Required columns:
customer_id: Customer identifier (will be normalized)default: Target variable (binary: 0/1)
For generating predictions on new test data, place files in a separate directory (e.g., data_test/) with the same structure:
data_test/
├── application_metadata.csv
├── credit_history.parquet
├── demographics.csv
├── financial_ratios.jsonl
├── geographic_data.xml
└── loan_details.xlsx
Required columns:
customer_id: Customer identifier (will be normalized)- Note: Test data should NOT include the
defaultcolumn
data/
├── processed/
│ ├── v1/ # Feature set v1
│ ├── v2/ # Feature set v2
│ └── v3/ # Feature set v3
├── feature_selection/
│ └── selected_features.txt
├── models/
│ ├── lightgbm/ # v1 LightGBM
│ ├── xgboost/ # v1 XGBoost
│ ├── catboost/ # v1 CatBoost
│ ├── mlp/ # v1 MLP
│ ├── tabm/ # v1 TabM
│ ├── v2_lightgbm/ # v2 LightGBM
│ ├── ... # All other combinations
│ └── stacking/ # Level-2 stacking models
├── stacking/ # Stacking dataset
└── final_submission/ # Final predictions
Error: "No models loaded!"
- Ensure all models are trained first by running the complete training pipeline
- Check that model files exist in
data/models/directories
Error: "Feature mismatch"
- The script automatically handles feature alignment
- Missing features are filled with zeros
- Extra features are removed
- If issues persist, ensure preprocessing pipelines are saved correctly
Error: "Stacking features mismatch"
- Ensure base model predictions are generated successfully
- Check that
data/stacking/feature_names.txtexists - Verify that meta-features are created correctly
Simply re-run the training script:
python -m src.training.train_all_feature_setsThe script will automatically:
- Skip completed models
- Resume incomplete Optuna studies
- Continue from where it left off
If you encounter memory issues:
- Reduce
n_trialsin optimization scripts - Reduce
n_folds(though 5 is recommended) - Process feature sets sequentially instead of in parallel
- Use GPU: Neural network training benefits significantly from GPU acceleration
- Parallel Processing: Optuna can use multiple workers (set
n_jobsparameter) - Early Stopping: All models use early stopping to prevent overfitting
- Feature Selection: Reduces training time and memory usage
Each model saves comprehensive metrics:
- Cross-validation metrics (5-fold CV)
- Validation set metrics
- Test set metrics
- Best hyperparameters
- Optuna optimization history
View metrics:
import json
with open('data/models/lightgbm/metrics.json', 'r') as f:
metrics = json.load(f)
print(f"AUC-ROC: {metrics['test_metrics']['auc_roc']:.6f}")# Complete training pipeline
python main.py
# Or step by step:
python -m src.data.prepare_data_multi # Prepare data
python -m src.training.train_all_feature_sets # Train models
python -m src.ensemble.create_meta_features # Create meta features
python -m src.ensemble.train_stacking_models # Train stacking
python -m src.ensemble.ensemble # Create ensemble# Generate predictions for real test data
python predict_real_test.py --data-dir data_test --output predictions/submission.csv[Add your license here]
[Add contact information here]