Skip to content

bakeronit/nanopore_celllines_benchmark

Repository files navigation

Cancer genome standards for long-read sequencing using cancer cell line mixtures

CC0 made-with-R Snakemake GitHub last commit Docker Image

Summary

This repository hosts the analysis scripts and pipeline associated with the paper published in:

Jia Zhang, Ho Yi Wong, Lingchen Liu, Lambros T Koufariotis, Scott Wood, Nadine Fitzpatrick, Jenny Quiatchon, Paul Collins, John V Pearson, Nicola Waddell, Cancer genome standards for long-read sequencing using cancer cell line mixtures, GigaScience, Volume 15, 2026, giag037, https://doi.org/10.1093/gigascience/giag037

In this study, we evaluated the performance of long-read sequencing (LRS) for detecting somatic variants across a range of tumor purities and sequencing depths, comparing results to short-read sequencing. We generated 22 whole-genome sequencing datasets from controlled mixtures of cancer and matched normal cell lines (0%–100% tumor purity). This design enabled benchmarking of LRS-based somatic variant detection under realistic scenarios.

Table of Contents

Raw data processing

Downstream analysis and plots

Miscellaneous

Reproducibility

Each section above is available as a processed Markdown (.md) file. Clicking on the links will open web-readable pages that include explanatory text, selected commands, plots, and tables. The underlying code used to generate these outputs is provided in the corresponding R Markdown (.Rmd) files. Specifically, code and input data used to generate each figure in the manuscript are listed here

For R environment details, see the R Session Info page.

All necessary data required to run these notebooks are available via GigaDB. This dataset is dedicated to the public domain under the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, allowing unrestricted reuse.

Note: The BAM file is not distributed under CC0, as it contains human germline genomic information and may be subject to ethical or legal restrictions. However, other shared files, including processed results and somatic variant calls, are free to reuse without restriction.

Getting started

1. Clone the repository:

git clone https://github.com/bakeronit/nanopore_celllines_benchmark.git

2. Download the processed data from GigaDB:

Click the data_essential.zip to download all files necessary to make tables and figures.

3. Extract the data

Unzip the file at the same folder as the cloned repository, files will populate the expected folders used by the R Markdown analysis.

unzip data_essential.zip

4. Render the analysis notebooks

Notes:

  • The shared data are processed outputs from the Snakemake pipeline, which requires high computational resources.
  • Raw sequencing data for full reprocessing are available via EGA study EGAS00001008107

⚠️ Refer to data_LICENSE.txt for licensing terms and data use conditions.

License

This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.

About

Scripts and analysis for cancer cell line mixtures nanopore long read sequencing

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages