Executive Summary

I developed a machine learning classifier that predicts fraud accounts with an 11% false positive rate (FPR). The model performed better than the baseline, however, it has yet to achieve the desired FPR of 5%. Further model development is recommended to improve predictive ability before considering deployment.

Problem Statement

To reduce synthetic identities from causing financial losses, a detecion algorithm was built to flag fraudulent personas to reduce manual decision-making and improve operational efficiency for approval of account openings. False positive rates (FPRs) were observed as denied legitimate accounts (false positives) lead to loss of company trust, which harms company reputation. Company reputation is valued more than company losses, thus, false positives should be minimized with the target FPR being 5%.

Data Acquisition

The dataset used comes from the Bank Account Fraud suite of datasets by Jesus et al. (2022). This is published at the NeurIPS 2022. The base variant was chosen for best representation of the original datasets used for fraud detection. The dataset is already in a structured format with no empty values. Full details can be found at the Kaggle Page.

Data Preparation

Month-based Train-test Split: As recommended by the paper, the first 6 months served as the training set and the remaining 2 months as the test set.
Imputation using Medians: The training set itself has no empty values, however, numeric columns have their missing values coded as "-1." As this encoding impacts model estimates, these values are replaced by the column-wise median ignoring the "-1" values. The median is chosen due to the skewed distributions of all columns using this encoding.
Removal of Redundant Columns: Checking the column-wise variabilities, device_fraud_count only has a single value present for the entire training set. With this in mind, I dropped this column to reduce the number of features to be considered by the machine learning models.
Fraud & Non-Fraud Balancing: There is a severe imbalance of fraud and non-fraud cases. To address this, I utilized randomized undersampling to match the number of fraud and non-fraud cases. I understand the loss of information with this method, with its effects seen later on, but this method was chosen over other methods (randomized oversampling or SMOTE) due to computational constraints.
Scaling of Continuous Columns: Lastly, with the balanced training set, numeric columns are scaled using StandardScaler() for better model performance.

Model Building

Utilization of Various Algorithms: With many classifiers available in the scikit-learn package, I chose to build base models with default parameters using logistic regression, stochastic gradient descent (SGD), k-nearest neighbors (KNN), decision trees, AdaBoost, random forests, gradient boosted trees, and support vector machines (SVM). For this project, I will refer to the untuned logistic regression classifier as the reference classifier due to its simplicity and interpretability over other models.
Evaluation of Fit of Base Models: 5-fold cross-validation was done to determine the in-sample false positive rate (FPR) with AdaBoost showing the lowest median FPR.
Evaluation of Predictive Ability of Base Models: Test FPRs were also observed for the models with gradient boosting showing the best performance. It can also be noticed that overfitting occured for the SVM model due to the massive difference in percentage points between the cross-validation performance and the test performance.

Hyperparameter Tuning

Selection of Hyperparameters: Using RandomizedSearchCV, various parameters were explored where the parameters containing the lowest 5-fold cross-validation FPR was selected. RandomizedSearchCV was used for its computational efficiency, especially with multiple algorithms used in the project which can drastically increase running time if GridSearchCV was used.
Addition of Voting Classifier: With the best performing tuned models, a voting ensemble classifier was also built using the predictions from each tuned model.
Evaluation of Fit of Tuned Models: AdaBoost still remains the best performing model when it comes to the 5-fold cross-validation.
Evaluation of Predictive Ability of Tuned Models: Gradient boosting still remains the best performing model when it comes to test performance.
Comparisons between Tuned & Base Model Performance: A surprising observation is that the base gradient boosted model performed better than the tuned model, albeit being close in FPR.

Final Model

With the Gradient Boosting Decision Trees having best performance rates, This is the best learning algorithm for this project. When it comes to parameters, I chose the tuned parameters despite the higher FPR due to the small differences that can be seen between the tuned and test metrics.
Even 5% FPR was not achieved by any of the classifiers, a model was built that reduced the FPR of the reference classifier (base logistic regression) by approximately 40%.

Recommendations

Explore other Time Splits: The project uses the split done in the paper done by the dataset authors. Thus, other time splits (different time splits, rolling window, expanding window) could be explored.
Feature Engineering: With the only modification done the columns being scaling and dummy encoding, creation of other features e.g. PCA could be done.
Use other balancing methods: Since randomized undersampling was used to balance the positive and negative cases, which led to a great amount of information lost, other balancing methods could be considered.
Use other machine learning algoirthms: All classifiers are in the scikit-learn package and familiar with me. More complex models from other packages (e.g. neural networks) can be utilized.
Perform more exhaustive tuning: With the tuning algorithm being RandomizedSearchCV and the hyperparameters being the most common parameters used in the machine learning algorithms, GridSearchCV and addition of more hyperparameters of consideration could be utilized.

Dataset Citation

Jesus, S., Pombal, J., Alves, D., Cruz, A., Saleiro, P., Ribeiro, R., Gama, J., & Bizarro, P. (2022). Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation. Advances in Neural Information Processing Systems.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
1_preprocessing.ipynb		1_preprocessing.ipynb
2_modeling.ipynb		2_modeling.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Executive Summary

Problem Statement

Data Acquisition

Data Preparation

Model Building

Hyperparameter Tuning

Final Model

Recommendations

Dataset Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Executive Summary

Problem Statement

Data Acquisition

Data Preparation

Model Building

Hyperparameter Tuning

Final Model

Recommendations

Dataset Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages