This repository contains all code and notebooks used to generate figures used in the RAD51D/XRCC2 SGE paper (Casadei et al. 2026).
Required supplementary data tables and paths are already set for all notebooks. To regenerate figures, simply clone this repository. For figures generated through Python scripts, paths will need to be updated based on your local machine.
The data folder contains sub-directories containing all supplementary tables, and data needed to generate figures. Data used to recompile provided supplementary data tables is also provided.
Contains all external data used for analysis and figure generation.
- BARD1_SGE_data.xlsx - BARD1 SGE data from Woo et al. 2026 (https://doi.org/10.1101/2025.11.03.25339440)
- BRCA1_SGE_data.xlsx - BRCA1 SGE data from Findlay et al. 2018 (https://doi.org/10.1038/s41586-018-0461-z) and Dace et al. 2025 (https://doi.org/10.1101/2025.08.11.25333423)
- Darrah_RAD51D_supplementarytables.xlsx - Supplementary tables from Darrah et al. 2025 (PMID: 41542474)
- RAD51C_SGE_data.csv - RAD51C SGE data from Olvera-Leon et al. 2024 (10.1016/j.cell.2024.08.039)
- RAD51D_XRCC2_InterestingResidues.xlsx - Curated list of potentially biochemically active residues in RAD51D and XRCC2 from Rawal et al. 2023 (https://doi.org/10.1038/s41586-023-06219-w) and Greenhough et al. 2023 (https://doi.org/10.1038/s41586-023-06179-1)
- VHL_SGE_data.xlsx - VHL SGE data from Buckley et al. 2024 (https://doi.org/10.1038/s41588-024-01800-z)
The case_control_data sub-folder contains data accessed from the BRIDGES and CARRIERS breast cancer case-control studies. Data for each study is organized into its respective folder:
-
BRIDGES_data - Contains all data files from the BRIDGES study
- 20250815_BRIDGES_missense_all.xlsx - All missense variants sequenced in the BRIDGES study
- 20250815_BRIDGES_missense_population.xlsx - Missense variants only from population-based studies that were part of the BRIDGES study
- 20250815_BRIDGES_PTVs_all.xlsx - All PTVs sequenced in the BRIDGES study
- 20250815_BRIDGES_PTVs_pop.xlsx - PTVs only from population-based studies that were part of the BRIDGES study
-
CARRIERS_data - Contains data file from the CARRIERS study
- 20250303_CARRIERS_data.xlsx - Contains all variants sequenced in the CARRIERS case-control study.
Contains all final supplementary tables. Includes final tables for figure generation and key residue analysis.
- supplementary_file_1_RAD51D_SGE_final_table_20260407.xlsx - Final score file for RAD51D. Contains additional tabs with additional orthogonal data and metadata (i.e. variant counts and editing rates)
- supplementary_file_1_XRCC2_SGE_final_table_20260122.xlsx - Analogous final score file for XRCC2.
- supplementary_file_RAD51D_XRCC2_keyresis_20260407.xlsx - Residues from
RAD51D_XRCC2_InterestingResidues.xlsxmerged with final SGE scores
Files used to create the RAD51D and XRCC2 final supplementary tables are found in the supp_table_inputs sub-directory. Files are dated based on when they were originally created/when data was accessed.
RAD51D files:
- 20251106_RAD51D_RegeneronMAF.csv - MAF data from Regeneron's Million Exome study for RAD51D (accessed 2025/11/06)
- 20251106_RAD51D_gnomAD_v4.1.0.csv - MAF data for RAD51D SNVs accessed from gnomAD v4.1.0 (accessed 2025/11/06)
- 20251107_RAD51D_PhyloP.xlsx - PhyloP scores across RAD51D
- 20260102_RAD51Dsnvs_VEP.xlsx - Ensembl VEP annotated SNV file for RAD51D
- 20260102_RAD51Dsnvscores.vcf - .VCF file input for Ensembl VEP annotation of RAD51D SNVs
- 20260102_RAD51Dvep.txt - Raw VEP output for RAD51D SNVs
- 20260407_RAD51D.editrates.tsv - Editing rates generating useable reads for each RAD51D SGE target and replicate
- 20260407_RAD51Dallscores.tsv - Raw RAD51D score file
- 20260407_RAD51Ddelcounts.tsv - Raw counts for RAD51D deletions
- 20260407_RAD51Dmodelparams.tsv - Output parameters from GMM modeling for RAD51D
- 20260407_RAD51Dsnvcounts.tsv - Raw counts for RAD51D SNVs
- RAD51D_unpublished.json - Points thresholds for running ExCALIBR (PMID: 40654914) calibration on RAD51D SGE data
XRCC2 files:
- 20251107_XRCC2_RegeneronMAF.csv - MAF data from Regeneron's Million Exome study for XRCC2 (accessed 2025/11/07)
- 20251107_XRCC2_gnomAD_v4.1.0.csv - MAF data for XRCC2 SNVs accessed from gnomAD v4.1.0 (accessed 2025/11/07)
- 20251107_XRCC2_PhyloP.xlsx - PhyloP scores across XRCC2
- 20251107_XRCC2.editrates.tsv - Editing rates generating useable reads for each XRCC2 SGE target and replicate
- 20251202_XRCC2allscores.tsv - Raw XRCC2 score file
- 20251202_XRCC2delcounts.tsv - Raw counts for XRCC2 deletions
- 20251202_XRCC2modelparams.tsv - Output parameters from GMM modeling for XRCC2
- 20251202_XRCC2snvcounts.tsv - Raw counts for XRCC2 SNVs
- 20260102_XRCC2snvs_VEP.xlsx - Ensembl VEP annotated SNV file for XRCC2
- 20260102_XRCC2snvscores.vcf - .VCF file input for Ensembl VEP annotation of XRCC2 SNVs
- 20260102_XRCC2vep.txt - Raw VEP output for XRCC2 SNVs
- XRCC2_unpublished.json - Points thresholds for running ExCALIBR (PMID: 40654914) calibration on XRCC2 SGE data
Shared files:
- 20251231_DX2_OrthogonalData.xlsx - Curated list of variants previously assayed in orthogonal assays
- 20260101_SGEsubset.xlsx - Subset of SGE data from the annotated dataframe used in https://doi.org/10.64898/2026.02.14.705848
The Notebooks folder contains Python Notebooks that create individual panels used in the final figures and the supplementary tables. The visualization that will be created by each notebook is noted in the name of the notebook. Code in notebooks has been annotated and each contains a Markdown header describing the figure or table it produces.
These notebooks are:
- BCDX2_MakeVCF - Generates .VCF file used as the input to Ensembl's VEP tool to get variant effect predictor annotations for AlphaMissense, REVEL, CADD, and SpliceAI.
- BCDX2_CalibrationFig - Generates plot highlighting number of variants in each evidence points bin after ExCALIBR calibration (Fig. 5e-f and Extended Data Fig. 11)
- BCDX2_ClinVar_analysis - Generates strip plots and ROC-AUC plots for benchmarking generated SGE data against cataloged ClinVar variants (Fig. 5a-d)
- BCDX2_Correlation_analysis - Pearson r correlation heatmap of counts (Extended Data Fig. 1 c & f)
- BCDX2_Darrah_Heatmap - Heatmap of MAVE scores from Darrah et al. 2025 for biochemically interesting variants (Extended Data Fig. 8)
- BCDX2_EditRate_BarPlot - Bar plots displaying proportion of usable reads (Extended Data Fig. 1a & d)
- BCDX2_InteractingResidues - Builds heatmaps at biochemically key RAD51D and XRCC2 residues (Figure 4e, Extended Data Fig. 8a)
- BCDX2_MakeFinalDataTable - Builds the RAD51D and XRCC2 final supplementary tables — the required input for all figure generating notebooks.
- BCDX2_OrthogonalAnalysis - Strip plots comparing variant function in orthogonal biochemical assays performed by Darrah et al. 2025 to SGE fitness scores (Fig. 2c-d)
- BCDX2_RAD51D_XRCC2Heatmap - Stacked amino acid-level heatmap for RAD51D and XRCC2 (Fig. 2a-b)
- BCDX2_ScoresAcrossGene - Scatter plot of fitness scores across the coding sequence of the gene (Extended Data Fig. 2)
- BCDX2_StackedHistos - Stacked histogram and strip plots (Fig. 1d-h)
- BCDX2_VEPs_vs_SGE - Scatter plot of VEP scores vs. fitness score (Extended Data Fig. 4)
The Scripts folder contains scripts used to generate figures in PyMOL and ChimeraX. Scripts are labeled by the figure that will be generated and code is annotated.
These scripts are:
- BCDX2_colorChimeraX_MIS_only - Colors ribbon cartoon protein structures using missense SGE scores in ChimeraX.
- BCDX2_RAD51C_PyMOL - Generates colored ribbon cartoon or surface for RAD51C using data from Olvera-Leon et al. 2024 in PyMOL.
- BCDX2_RAD51C_ChimeraX - Analogous figure generated for RAD51C as
BCDX2_RAD51C_PyMOLbut in ChimeraX.