This repository hosts the analysis scripts and pipeline associated with the paper published in:
Jia Zhang, Ho Yi Wong, Lingchen Liu, Lambros T Koufariotis, Scott Wood, Nadine Fitzpatrick, Jenny Quiatchon, Paul Collins, John V Pearson, Nicola Waddell, Cancer genome standards for long-read sequencing using cancer cell line mixtures, GigaScience, Volume 15, 2026, giag037, https://doi.org/10.1093/gigascience/giag037
In this study, we evaluated the performance of long-read sequencing (LRS) for detecting somatic variants across a range of tumor purities and sequencing depths, comparing results to short-read sequencing. We generated 22 whole-genome sequencing datasets from controlled mixtures of cancer and matched normal cell lines (0%–100% tumor purity). This design enabled benchmarking of LRS-based somatic variant detection under realistic scenarios.
- Base calling and read alignment for cell line mixtures
- Sequencing depth combinations
- Variant calling
- QC and purity check: Rmd file purity_and_qc.Rmd
- Construct gold standard: Rmd file gold_standard.Rmd
- Tumour purity affects SNV and indel calling: Rmd file benchmark_snv_calling.Rmd
- Tumour purity affects SV calling: Rmd file benchmark_sv_calling.Rmd
- Sequencing depth affects SNV and indel calling: Rmd file benchmark_snv_calling.Rmd
- Sequencing depth affects SV calling: Rmd file benchmark_sv_calling.Rmd
- Mutational signature analysis: Rmd file snv_mutational_signature.Rmd
- Genomic regions of variants: Rmd file genome_regions.Rmd
- Germline leakage against tumour purity and read depth: Rmd file germline_leakage.Rmd
- SV type and length in LRS: Rmd file lr_unique_length_type.Rmd
- Sequencing depth check: Rmd file depth_check.Rmd
- Methylation analysis: Rmd file methylation_analysis.Rmd
- Circos plot: Rmd file circos_plots.Rmd
- IGV check
- SNV calling Intersection: Rmd file ensembl_snv.Rmd
- Align with CHM13-T2T assembly: Rmd file t2t_snv.Rmd
- Simulation from synthetic genome: Rmd file simulation.Rmd
Each section above is available as a processed Markdown (.md) file.
Clicking on the links will open web-readable pages that include
explanatory text, selected commands, plots, and tables. The underlying
code used to generate these outputs is provided in the corresponding R
Markdown (.Rmd) files. Specifically, code and input data used to
generate each figure in the manuscript are listed here
For R environment details, see the R Session Info page.
All necessary data required to run these notebooks are available via GigaDB. This dataset is dedicated to the public domain under the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, allowing unrestricted reuse.
Note: The BAM file is not distributed under CC0, as it contains human germline genomic information and may be subject to ethical or legal restrictions. However, other shared files, including processed results and somatic variant calls, are free to reuse without restriction.
1. Clone the repository:
git clone https://github.com/bakeronit/nanopore_celllines_benchmark.git2. Download the processed data from GigaDB:
Click the data_essential.zip to download all files necessary to make tables and figures.
3. Extract the data
Unzip the file at the same folder as the cloned repository, files will populate the expected folders used by the R Markdown analysis.
unzip data_essential.zip4. Render the analysis notebooks
Notes:
- The shared data are processed outputs from the Snakemake pipeline, which requires high computational resources.
- Raw sequencing data for full reprocessing are available via EGA study EGAS00001008107
⚠️ Refer todata_LICENSE.txtfor licensing terms and data use conditions.
This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.
