This repository contains all relevant code and information to reproduce the results in the master's thesis:
Comparing Bayesian Post-estimation Variable Selection Methods: Projection Predictive Variable Selection Versus Stochastic Search Variable Selection.
We compare the performance of two Bayesian variable selection methods Projection Predictive Variable Selection (PPVS) and Stochastic Search Variable Selection (SSVS) through two simulation studies: one in a low-dimensional setting and another in a high-dimensional setting. Two empirical datasets are also included to illustrate practical application. The full thesis is provided in this repository.
├── data
│ ├── p50
│ └── simdata
├── LICENSE
├── output
│ ├── figures
│ ├── output_study1
│ │ ├── ppvs
│ │ │ ├── cvvs4suggest_size_failed
│ │ │ ├── df_out
│ │ │ ├── prj_pred
│ │ │ ├── ppvs_out
│ │ │ ├── time_taken
│ │ │ └── summ_ref
│ │ └── ssvs
│ │ │ ├── df_out
│ │ │ ├── time_taken
│ │ └── ssvs_pred_results
│ ├── output_study2
│ │ ├── ppvs
│ │ │ ├── cvvs4suggest_size_failed
│ │ │ ├── df_out
│ │ │ ├── prj_pred
│ │ │ ├── ppvs_out
│ │ │ ├── time_taken
│ │ │ └── summ_ref
│ │ └── ssvs
│ │ │ ├── df_out
│ │ │ ├── time_taken
│ │ └── ssvs_pred_results
│ ├── result_objects_study1.RData
│ └── result_objects_study2.RData
├── master_thesis_5041333.pdf
├── projpred_vs_spikeslab.Rproj
├── README.md
├── renv
├── renv.lock
└── scripts
├── data_generation.R
├── empirical_examples.qmd
├── ssvs_run.R
├── study1Analysis.qmd
├── study2Analysis.qmd
├── workflow_functions.R
└── workflow_run.R
data_generation.R: Generates the simulation datasets.- Running this without changes will create the datasets in
data/p50(Study 1) anddata/simdata(Study 2).
- Running this without changes will create the datasets in
workflow_run.R: Runs the full PPVS pipeline.ssvs_run.R: Runs the SSVS method.workflow_functions.R: Contains helper functions used in the scripts above.
Note: These scripts contain clearly annotated blocks to indicate which parts to run for Study 1 vs. Study 2. You may need to manually comment/uncomment sections depending on the study you're running.
- Simulation results will be saved in
output/output_study1oroutput/output_study2, depending on the run. - The full result folders are excluded from the repository due to size limits, but can be available upon request, and key post-processed results are available:
result_objects_study1.RDataresult_objects_study2.RData
These files contain aggregated results after post-processing and can be used to regenerate all figures without rerunning the simulations.
-
Running the full simulations, especially the PPVS method, can be extremely resource-intensive. Although personal laptops may have many CPU cores, PPVS is highly memory-demanding, which limits how many cores can be used effectively, and may cause RStudio to crash. As a result, the full pipeline can take more than two weeks to complete on a typical personal machine.
-
Additionally, the resulting .RData files for each condition and replication can be very large. This is one reason why the raw simulation output is not included in this repository—only post-processed summaries are provided.
study1Analysis.qmdandstudy2Analysis.qmd: These scripts combine post-processing and plotting. Running them will:- Load raw simulation results (if available),
- Generate post-processed result objects, and
- Create and save the corresponding plots.
- The code is modular and clearly commented. If you only want to regenerate plots from the result_objects_*.RData files, you can skip the post-processing section and run only the plotting code.
empirical_examples.qmd: Reproduces the empirical data analysis and figures.- Datasets are publicly available and loaded directly within this script.
The R environment and package versions used are managed via renv:
renv/renv.lock
Running renv::restore() will install the correct package versions to reproduce the results.
Both simulation studies and empirical datasets used in the project are approved by the Ethical Review Board of the Faculty of Social and Behavioural Sciences of Utrecht University.
- Simulation studies: FETC 24-2042
- Empirical datasets: FETC 25-1375
- The approval is valid through 12 May 2025.
This repository is licensed under the MIT License. You are free to use, modify, and distribute this work with appropriate attribution.
For any questions or issues, please contact: Zhipei (Kim) Wang via email: wzpskye@gmail.com.