Cancer genome standards for long-read sequencing using cancer cell line mixtures

Summary

This repository hosts the analysis scripts and pipeline associated with the paper published in:

Jia Zhang, Ho Yi Wong, Lingchen Liu, Lambros T Koufariotis, Scott Wood, Nadine Fitzpatrick, Jenny Quiatchon, Paul Collins, John V Pearson, Nicola Waddell, Cancer genome standards for long-read sequencing using cancer cell line mixtures, GigaScience, Volume 15, 2026, giag037, https://doi.org/10.1093/gigascience/giag037

In this study, we evaluated the performance of long-read sequencing (LRS) for detecting somatic variants across a range of tumor purities and sequencing depths, comparing results to short-read sequencing. We generated 22 whole-genome sequencing datasets from controlled mixtures of cancer and matched normal cell lines (0%–100% tumor purity). This design enabled benchmarking of LRS-based somatic variant detection under realistic scenarios.

Construct gold standard: Rmd file gold_standard.Rmd
Tumour purity affects SNV and indel calling: Rmd file benchmark_snv_calling.Rmd
Tumour purity affects SV calling: Rmd file benchmark_sv_calling.Rmd
Sequencing depth affects SNV and indel calling: Rmd file benchmark_snv_calling.Rmd
Sequencing depth affects SV calling: Rmd file benchmark_sv_calling.Rmd
Mutational signature analysis: Rmd file snv_mutational_signature.Rmd
Genomic regions of variants: Rmd file genome_regions.Rmd
Germline leakage against tumour purity and read depth: Rmd file germline_leakage.Rmd
SV type and length in LRS: Rmd file lr_unique_length_type.Rmd

Miscellaneous

Sequencing depth check: Rmd file depth_check.Rmd
Methylation analysis: Rmd file methylation_analysis.Rmd
Circos plot: Rmd file circos_plots.Rmd
IGV check
SNV calling Intersection: Rmd file ensembl_snv.Rmd
Align with CHM13-T2T assembly: Rmd file t2t_snv.Rmd
Simulation from synthetic genome: Rmd file simulation.Rmd

Reproducibility

Each section above is available as a processed Markdown (.md) file. Clicking on the links will open web-readable pages that include explanatory text, selected commands, plots, and tables. The underlying code used to generate these outputs is provided in the corresponding R Markdown (.Rmd) files. Specifically, code and input data used to generate each figure in the manuscript are listed here

For R environment details, see the R Session Info page.

All necessary data required to run these notebooks are available via GigaDB. This dataset is dedicated to the public domain under the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, allowing unrestricted reuse.

Note: The BAM file is not distributed under CC0, as it contains human germline genomic information and may be subject to ethical or legal restrictions. However, other shared files, including processed results and somatic variant calls, are free to reuse without restriction.

Getting started

1. Clone the repository:

git clone https://github.com/bakeronit/nanopore_celllines_benchmark.git

2. Download the processed data from GigaDB:

Click the data_essential.zip to download all files necessary to make tables and figures.

3. Extract the data

Unzip the file at the same folder as the cloned repository, files will populate the expected folders used by the R Markdown analysis.

unzip data_essential.zip

4. Render the analysis notebooks

Notes:

The shared data are processed outputs from the Snakemake pipeline, which requires high computational resources.
Raw sequencing data for full reprocessing are available via EGA study EGAS00001008107

⚠️ Refer to data_LICENSE.txt for licensing terms and data use conditions.

License

This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
1.dna_mixing_celllines		1.dna_mixing_celllines
2.simulate_sequencing_depth		2.simulate_sequencing_depth
3.igv_check		3.igv_check
gs		gs
x.revision		x.revision
x.revision2		x.revision2
.gitignore		.gitignore
LICENSE		LICENSE
README.Rmd		README.Rmd
README.md		README.md
data_LICENSE.txt		data_LICENSE.txt
data_big.list		data_big.list
data_essential.list		data_essential.list
data_readme.md		data_readme.md
generate_md5sum.sh		generate_md5sum.sh
rsessioninfo.Rmd		rsessioninfo.Rmd
rsessioninfo.md		rsessioninfo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cancer genome standards for long-read sequencing using cancer cell line mixtures

Summary

Table of Contents

Raw data processing

Downstream analysis and plots

Miscellaneous

Reproducibility

Getting started

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cancer genome standards for long-read sequencing using cancer cell line mixtures

Summary

Table of Contents

Raw data processing

Downstream analysis and plots

Miscellaneous

Reproducibility

Getting started

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages