Code for the paper 'FLAIM: AIM-based synthetic data generation in the federated setting'
@inproceedings{flaim,
author = {Maddock, Samuel and Cormode, Graham and Maple, Carsten},
title = {FLAIM: AIM-based Synthetic Data Generation in the Federated Setting},
year = {2024},
isbn = {9798400704901},
url = {https://doi.org/10.1145/3637528.3671990},
doi = {10.1145/3637528.3671990},
booktitle = {Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {2165–2176},
numpages = {12},
location = {Barcelona, Spain},
series = {KDD '24}
}Install the required Python environment via conda and pip
conda create -n "flaim" python=3.9
conda activate flaim
pip install -r ./requirements.txt
AutoDP Dependency: Download the repo here and run in the root of autodp.
pip install .
All datasets are downloaded automatically via PMLB or the Synthetic Data Vault during the first run except for the Adult, Covtype and Marketing datasets which require manual downloading:
- Download 'adult.csv' from here and place under
/synth_fl/data/ - Download 'covtype.csv' from here and place under
/synth_fl/data/ - Download 'marketing.csv' from here and place under
synth_fl/data/
In order to run experiments the partitions of client data into the federated setting must be formed. The partitions are generated and stored under synth_fl/data
Run the following Python code to produce non-IID partitions of the benchmark datasets:
python3.9 launcher.py --sweep-name paper/cache_answers/cache_split_answers1.json --sweep-manager-type local --sweep-backend local
and to produce the SynthFS synthetic dataset:
python3.9 launcher.py --sweep-name paper/cache_answers/cache_split_answers2.json --sweep-manager-type local --sweep-backend local
Note this may take a while (around ~10-20 minutes) and will save client partition splits to synth_fl/data.
Further configs for experiments are contained within sweep_configs/paper/.
These can be run locally (over 4 CPU threads) as follows:
python3.9 launcher.py --sweep-backend local --sweep-manager-type local --sweep-name paper/SWEEP_NAME --workers 4
Data from experiments is saved under slurm/job_results and slurm/sweep_results.
SWEEP_NAME should be one of the following config files contained within sweep_configs/paper:
varying_eps- Used to produce Figure 3(a) in the main paper and Figure 6 in the Appendix.varying_feature_skew- Used to produce Figure 1.varying_local_rounds- Used to produce Figures 3(e,f) in the main paper and Figure 10 in the Appendix.varying_p- Used to produce Figure 3(c) in the main paper and Figure 8 in the Appemndix.varying_t- Used to produce Figures 3(b) and Table 1 in the main paper and Figure 7 and Table 6 in the Appendix.varying_beta- prodcues Figures 3(d) in the main paper and Figure 9 in the Appendix.baselines.json- Will train FLAIM baseline methods, used to produce Table 1communication_tracking- Used to produce Table 3 in the main paper and Table 7 in the Appendix.appendix_non_iid_split- Used to produce Table 5 in the Appendix.
We would like to acknowledge the following code that is used by this repo:
- PMLB - For dataset loading
- Synthetic data vault - For additional datasets
- Private-PGM by Ryan McKenna
- AutoDP by Yu-Xiang Wang
- FLSim by Facebook Research (for running federated CTGAN examples)