Um pipeline completo e modular para anΓ‘lise de dados genΓ΄micos, incluindo processamento de dados de sequenciamento de prΓ³xima geraΓ§Γ£o (NGS), anΓ‘lise multi-Γ΄mica, e visualizaΓ§Γ£o avanΓ§ada de resultados. Este projeto implementa fluxos de trabalho reproduzΓveis para anΓ‘lise de DNA-seq, RNA-seq, single-cell, e ChIP-seq utilizando tecnologias de ponta em bioinformΓ‘tica.
- VisΓ£o Geral
- Funcionalidades
- Tecnologias Utilizadas
- Estrutura do Projeto
- InstalaΓ§Γ£o
- Uso
- Fluxos de Trabalho
- VisualizaΓ§Γ΅es
- Exemplos
- ContribuiΓ§Γ£o
- LicenΓ§a
- Contato
Este pipeline de anΓ‘lise genΓ΄mica foi desenvolvido para processar e analisar diversos tipos de dados de sequenciamento de prΓ³xima geraΓ§Γ£o (NGS), incluindo DNA-seq, RNA-seq, single-cell RNA-seq e ChIP-seq. O sistema Γ© altamente modular e escalΓ‘vel, permitindo execuΓ§Γ£o em ambientes HPC (High-Performance Computing) e nuvem, com suporte para processamento paralelo e distribuΓdo.
O projeto implementa as melhores prΓ‘ticas em bioinformΓ‘tica e utiliza ferramentas state-of-the-art para cada etapa do processamento, desde o controle de qualidade inicial atΓ© a visualizaΓ§Γ£o final dos resultados. Todos os fluxos de trabalho sΓ£o implementados usando sistemas de gerenciamento de workflows (Nextflow, Snakemake e CWL), garantindo reprodutibilidade e portabilidade.
- DNA-seq: Chamada de variantes (SNPs, indels, CNVs, SVs), anotaΓ§Γ£o funcional, anΓ‘lise de impacto
- RNA-seq: QuantificaΓ§Γ£o de expressΓ£o gΓͺnica, anΓ‘lise diferencial, splicing alternativo
- Single-cell RNA-seq: Clustering celular, trajetΓ³rias de diferenciaΓ§Γ£o, identificaΓ§Γ£o de marcadores
- ChIP-seq: IdentificaΓ§Γ£o de picos, anΓ‘lise de motivos, integraΓ§Γ£o com dados de expressΓ£o
- ImplementaΓ§Γ£o em mΓΊltiplos sistemas (Nextflow, Snakemake, CWL)
- Rastreamento completo de proveniΓͺncia de dados
- Reprodutibilidade garantida via containerizaΓ§Γ£o (Docker, Singularity)
- Suporte para execuΓ§Γ£o em ambientes HPC e nuvem (AWS, GCP, Azure)
- Modelos de deep learning para prediΓ§Γ£o de fenΓ³tipos
- AnΓ‘lise de associaΓ§Γ£o genΓ΄mica (GWAS)
- IntegraΓ§Γ£o multi-Γ΄mica via tΓ©cnicas de aprendizado de mΓ‘quina
- SeleΓ§Γ£o de features biolΓ³gicas relevantes
- Dashboards interativos com R Shiny
- VisualizaΓ§Γ΅es genΓ΄micas com IGV.js
- GrΓ‘ficos circulares com Circos
- Heatmaps, PCA, t-SNE, UMAP para anΓ‘lise exploratΓ³ria
- R: AnΓ‘lise estatΓstica, visualizaΓ§Γ£o, pacotes Bioconductor
- Python: Processamento de dados, machine learning, pipelines
- Bash: Scripts de automaΓ§Γ£o e integraΓ§Γ£o
- Nextflow/Groovy: DefiniΓ§Γ£o de workflows principais
- CWL/YAML: DefiniΓ§Γ£o de workflows alternativos
- Bioconductor: DESeq2, edgeR, limma, GenomicRanges
- Scikit-learn/TensorFlow/PyTorch: Modelos de machine learning
- Scanpy/Seurat: AnΓ‘lise de dados single-cell
- Biopython/Bioperl: Processamento de sequΓͺncias
- BWA/Bowtie2/STAR: Alinhamento de sequΓͺncias
- GATK/FreeBayes/Strelka2: Chamada de variantes
- Salmon/Kallisto: QuantificaΓ§Γ£o de RNA
- MACS2/Homer: AnΓ‘lise de ChIP-seq
- VEP/SnpEff/ANNOVAR: AnotaΓ§Γ£o de variantes
- Docker/Singularity: ContainerizaΓ§Γ£o
- Kubernetes: OrquestraΓ§Γ£o de containers
- AWS Batch/GCP/Azure: ComputaΓ§Γ£o em nuvem
- Slurm/PBS/SGE: Gerenciamento de jobs em HPC
genomic-data-analysis-pipeline/
βββ src/
β βββ preprocessing/ # MΓ³dulos de prΓ©-processamento e QC
β βββ alignment/ # MΓ³dulos de alinhamento
β βββ variant_calling/ # MΓ³dulos de chamada de variantes
β βββ annotation/ # MΓ³dulos de anotaΓ§Γ£o
β βββ visualization/ # MΓ³dulos de visualizaΓ§Γ£o
β βββ workflows/ # DefiniΓ§Γ΅es de workflows
βββ scripts/ # Scripts utilitΓ‘rios
βββ workflows/
β βββ nextflow/ # Workflows em Nextflow
β βββ snakemake/ # Workflows em Snakemake
β βββ cwl/ # Workflows em CWL
βββ containers/ # DefiniΓ§Γ΅es de containers
βββ config/ # Arquivos de configuraΓ§Γ£o
βββ data/ # Dados de exemplo
βββ docs/ # DocumentaΓ§Γ£o
βββ results/ # DiretΓ³rio para resultados
βββ tests/ # Testes automatizados
βββ environment.yml # Ambiente Conda
βββ nextflow.config # ConfiguraΓ§Γ£o Nextflow
βββ Snakefile # Arquivo principal Snakemake
βββ README.md # Este arquivo
- Git
- Conda/Miniconda
- Docker ou Singularity (opcional, mas recomendado)
- Java 8+ (para Nextflow)
# Clone o repositΓ³rio
git clone https://github.com/galafis/genomic-data-analysis-pipeline.git
cd genomic-data-analysis-pipeline
# Crie e ative o ambiente Conda
conda env create -f environment.yml
conda activate genomic-pipeline
# Instale o Nextflow
curl -s https://get.nextflow.io | bash
# Pull da imagem Docker
docker pull galafis/genomic-pipeline:latest
# Execute o container
docker run -it -v $(pwd):/data galafis/genomic-pipeline:latest
# Workflow de DNA-seq
nextflow run workflows/nextflow/dna_seq.nf \
--reads "data/samples/*/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--outdir "results/dna_seq"
# Workflow de RNA-seq
nextflow run workflows/nextflow/rna_seq.nf \
--reads "data/samples/*/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--annotation "data/reference/genes.gtf" \
--outdir "results/rna_seq"
# Workflow de single-cell RNA-seq
nextflow run workflows/nextflow/scrna_seq.nf \
--reads "data/samples/*/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--annotation "data/reference/genes.gtf" \
--outdir "results/scrna_seq"
# Workflow de DNA-seq
snakemake --configfile config/dna_seq_config.yaml --cores 8
# Workflow de RNA-seq
snakemake --configfile config/rna_seq_config.yaml --cores 8
# Workflow de ChIP-seq
snakemake --configfile config/chip_seq_config.yaml --cores 8
# ExecuΓ§Γ£o em cluster Slurm
nextflow run workflows/nextflow/dna_seq.nf \
-profile slurm \
--reads "data/samples/*/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--outdir "results/dna_seq"
# ExecuΓ§Γ£o na AWS
nextflow run workflows/nextflow/dna_seq.nf \
-profile aws \
--reads "s3://my-bucket/samples/*/fastq/*.fastq.gz" \
--genome "s3://my-bucket/reference/genome.fa" \
--outdir "s3://my-bucket/results/dna_seq"
- Controle de qualidade (FastQC)
- Trimagem de adaptadores (Trimmomatic/fastp)
- Alinhamento ao genoma de referΓͺncia (BWA-MEM)
- Processamento de alinhamentos (SAMtools, Picard)
- Chamada de variantes (GATK HaplotypeCaller, FreeBayes)
- AnotaΓ§Γ£o de variantes (VEP, SnpEff)
- AnΓ‘lise de impacto funcional
- VisualizaΓ§Γ£o e relatΓ³rios
- Controle de qualidade (FastQC)
- Trimagem de adaptadores (Trimmomatic/fastp)
- Alinhamento ao genoma/transcriptoma (STAR, Salmon)
- QuantificaΓ§Γ£o de expressΓ£o gΓͺnica
- AnΓ‘lise de expressΓ£o diferencial (DESeq2, edgeR)
- AnΓ‘lise de enriquecimento funcional (GO, KEGG)
- VisualizaΓ§Γ£o e relatΓ³rios
- Controle de qualidade (FastQC)
- DemultiplexaΓ§Γ£o de cΓ©lulas
- QuantificaΓ§Γ£o de expressΓ£o por cΓ©lula
- Filtragem e normalizaΓ§Γ£o
- ReduΓ§Γ£o de dimensionalidade (PCA, t-SNE, UMAP)
- Clustering e identificaΓ§Γ£o de tipos celulares
- AnΓ‘lise de trajetΓ³rias celulares
- VisualizaΓ§Γ£o e relatΓ³rios
- Controle de qualidade (FastQC)
- Trimagem de adaptadores (Trimmomatic/fastp)
- Alinhamento ao genoma (Bowtie2)
- Chamada de picos (MACS2)
- AnΓ‘lise de motivos (HOMER)
- IntegraΓ§Γ£o com dados de expressΓ£o
- VisualizaΓ§Γ£o e relatΓ³rios
O pipeline gera diversas visualizaΓ§Γ΅es interativas e estΓ‘ticas:
- Dashboards Shiny: ExploraΓ§Γ£o interativa de resultados
- VisualizaΓ§Γ΅es genΓ΄micas: NavegaΓ§Γ£o de variantes e anotaΓ§Γ΅es
- Heatmaps: ExpressΓ£o gΓͺnica, correlaΓ§Γ΅es
- GrΓ‘ficos de reduΓ§Γ£o de dimensionalidade: PCA, t-SNE, UMAP
- GrΓ‘ficos circulares: VisualizaΓ§Γ£o de variantes no genoma
- Redes de interaΓ§Γ£o: InteraΓ§Γ΅es gene-gene, proteΓna-proteΓna
nextflow run workflows/nextflow/somatic_variant_calling.nf \
--tumor "data/samples/tumor/fastq/*.fastq.gz" \
--normal "data/samples/normal/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--outdir "results/somatic_variants"
nextflow run workflows/nextflow/differential_expression.nf \
--condition1 "data/samples/treatment/fastq/*.fastq.gz" \
--condition2 "data/samples/control/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--annotation "data/reference/genes.gtf" \
--outdir "results/diff_expression"
nextflow run workflows/nextflow/tumor_scrna_seq.nf \
--reads "data/samples/tumor_10x/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--annotation "data/reference/genes.gtf" \
--outdir "results/tumor_scrna"
ContribuiΓ§Γ΅es sΓ£o bem-vindas! Por favor, sinta-se Γ vontade para enviar pull requests, criar issues ou sugerir melhorias.
- FaΓ§a um fork do projeto
- Crie sua branch de feature (
git checkout -b feature/amazing-feature
) - Commit suas mudanΓ§as (
git commit -m 'Add some amazing feature'
) - Push para a branch (
git push origin feature/amazing-feature
) - Abra um Pull Request
Este projeto estΓ‘ licenciado sob a licenΓ§a MIT - veja o arquivo LICENSE para detalhes.
Gabriel Demetrios Lafis - GitHub
Link do projeto: https://github.com/galafis/genomic-data-analysis-pipeline
A comprehensive and modular pipeline for genomic data analysis, including next-generation sequencing (NGS) data processing, multi-omics analysis, and advanced result visualization. This project implements reproducible workflows for DNA-seq, RNA-seq, single-cell, and ChIP-seq analysis using state-of-the-art bioinformatics technologies.
- Overview
- Features
- Technologies Used
- Project Structure
- Installation
- Usage
- Workflows
- Visualizations
- Examples
- Contributing
- License
- Contact
This genomic analysis pipeline was developed to process and analyze various types of next-generation sequencing (NGS) data, including DNA-seq, RNA-seq, single-cell RNA-seq, and ChIP-seq. The system is highly modular and scalable, allowing execution in HPC (High-Performance Computing) environments and cloud, with support for parallel and distributed processing.
The project implements best practices in bioinformatics and uses state-of-the-art tools for each processing step, from initial quality control to final result visualization. All workflows are implemented using workflow management systems (Nextflow, Snakemake, and CWL), ensuring reproducibility and portability.
- DNA-seq: Variant calling (SNPs, indels, CNVs, SVs), functional annotation, impact analysis
- RNA-seq: Gene expression quantification, differential analysis, alternative splicing
- Single-cell RNA-seq: Cell clustering, differentiation trajectories, marker identification
- ChIP-seq: Peak identification, motif analysis, integration with expression data
- Implementation in multiple systems (Nextflow, Snakemake, CWL)
- Complete data provenance tracking
- Guaranteed reproducibility via containerization (Docker, Singularity)
- Support for execution in HPC and cloud environments (AWS, GCP, Azure)
- Deep learning models for phenotype prediction
- Genome-wide association analysis (GWAS)
- Multi-omic integration via machine learning techniques
- Selection of relevant biological features
- Interactive dashboards with R Shiny
- Genomic visualizations with IGV.js
- Circular plots with Circos
- Heatmaps, PCA, t-SNE, UMAP for exploratory analysis
- R: Statistical analysis, visualization, Bioconductor packages
- Python: Data processing, machine learning, pipelines
- Bash: Automation and integration scripts
- Nextflow/Groovy: Main workflow definitions
- CWL/YAML: Alternative workflow definitions
- Bioconductor: DESeq2, edgeR, limma, GenomicRanges
- Scikit-learn/TensorFlow/PyTorch: Machine learning models
- Scanpy/Seurat: Single-cell data analysis
- Biopython/Bioperl: Sequence processing
- BWA/Bowtie2/STAR: Sequence alignment
- GATK/FreeBayes/Strelka2: Variant calling
- Salmon/Kallisto: RNA quantification
- MACS2/Homer: ChIP-seq analysis
- VEP/SnpEff/ANNOVAR: Variant annotation
- Docker/Singularity: Containerization
- Kubernetes: Container orchestration
- AWS Batch/GCP/Azure: Cloud computing
- Slurm/PBS/SGE: HPC job management
genomic-data-analysis-pipeline/
βββ src/
β βββ preprocessing/ # Preprocessing and QC modules
β βββ alignment/ # Alignment modules
β βββ variant_calling/ # Variant calling modules
β βββ annotation/ # Annotation modules
β βββ visualization/ # Visualization modules
β βββ workflows/ # Workflow definitions
βββ scripts/ # Utility scripts
βββ workflows/
β βββ nextflow/ # Nextflow workflows
β βββ snakemake/ # Snakemake workflows
β βββ cwl/ # CWL workflows
βββ containers/ # Container definitions
βββ config/ # Configuration files
βββ data/ # Example data
βββ docs/ # Documentation
βββ results/ # Directory for results
βββ tests/ # Automated tests
βββ environment.yml # Conda environment
βββ nextflow.config # Nextflow configuration
βββ Snakefile # Main Snakemake file
βββ README.md # This file
- Git
- Conda/Miniconda
- Docker or Singularity (optional, but recommended)
- Java 8+ (for Nextflow)
# Clone the repository
git clone https://github.com/galafis/genomic-data-analysis-pipeline.git
cd genomic-data-analysis-pipeline
# Create and activate the Conda environment
conda env create -f environment.yml
conda activate genomic-pipeline
# Install Nextflow
curl -s https://get.nextflow.io | bash
# Pull the Docker image
docker pull galafis/genomic-pipeline:latest
# Run the container
docker run -it -v $(pwd):/data galafis/genomic-pipeline:latest
# DNA-seq workflow
nextflow run workflows/nextflow/dna_seq.nf \
--reads "data/samples/*/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--outdir "results/dna_seq"
# RNA-seq workflow
nextflow run workflows/nextflow/rna_seq.nf \
--reads "data/samples/*/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--annotation "data/reference/genes.gtf" \
--outdir "results/rna_seq"
# Single-cell RNA-seq workflow
nextflow run workflows/nextflow/scrna_seq.nf \
--reads "data/samples/*/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--annotation "data/reference/genes.gtf" \
--outdir "results/scrna_seq"
# DNA-seq workflow
snakemake --configfile config/dna_seq_config.yaml --cores 8
# RNA-seq workflow
snakemake --configfile config/rna_seq_config.yaml --cores 8
# ChIP-seq workflow
snakemake --configfile config/chip_seq_config.yaml --cores 8
# Running on Slurm cluster
nextflow run workflows/nextflow/dna_seq.nf \
-profile slurm \
--reads "data/samples/*/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--outdir "results/dna_seq"
# Running on AWS
nextflow run workflows/nextflow/dna_seq.nf \
-profile aws \
--reads "s3://my-bucket/samples/*/fastq/*.fastq.gz" \
--genome "s3://my-bucket/reference/genome.fa" \
--outdir "s3://my-bucket/results/dna_seq"
- Quality control (FastQC)
- Adapter trimming (Trimmomatic/fastp)
- Alignment to reference genome (BWA-MEM)
- Alignment processing (SAMtools, Picard)
- Variant calling (GATK HaplotypeCaller, FreeBayes)
- Variant annotation (VEP, SnpEff)
- Functional impact analysis
- Visualization and reporting
- Quality control (FastQC)
- Adapter trimming (Trimmomatic/fastp)
- Alignment to genome/transcriptome (STAR, Salmon)
- Gene expression quantification
- Differential expression analysis (DESeq2, edgeR)
- Functional enrichment analysis (GO, KEGG)
- Visualization and reporting
- Quality control (FastQC)
- Cell demultiplexing
- Per-cell expression quantification
- Filtering and normalization
- Dimensionality reduction (PCA, t-SNE, UMAP)
- Clustering and cell type identification
- Cell trajectory analysis
- Visualization and reporting
- Quality control (FastQC)
- Adapter trimming (Trimmomatic/fastp)
- Alignment to genome (Bowtie2)
- Peak calling (MACS2)
- Motif analysis (HOMER)
- Integration with expression data
- Visualization and reporting
The pipeline generates various interactive and static visualizations:
- Shiny Dashboards: Interactive exploration of results
- Genomic Visualizations: Navigation of variants and annotations
- Heatmaps: Gene expression, correlations
- Dimensionality Reduction Plots: PCA, t-SNE, UMAP
- Circular Plots: Visualization of variants across the genome
- Interaction Networks: Gene-gene, protein-protein interactions
nextflow run workflows/nextflow/somatic_variant_calling.nf \
--tumor "data/samples/tumor/fastq/*.fastq.gz" \
--normal "data/samples/normal/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--outdir "results/somatic_variants"
nextflow run workflows/nextflow/differential_expression.nf \
--condition1 "data/samples/treatment/fastq/*.fastq.gz" \
--condition2 "data/samples/control/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--annotation "data/reference/genes.gtf" \
--outdir "results/diff_expression"
nextflow run workflows/nextflow/tumor_scrna_seq.nf \
--reads "data/samples/tumor_10x/fastq/*.fastq.gz" \
--genome "data/reference/genome.fa" \
--annotation "data/reference/genes.gtf" \
--outdir "results/tumor_scrna"
Contributions are welcome! Please feel free to submit pull requests, create issues, or suggest improvements.
- Fork the project
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Gabriel Demetrios Lafis - GitHub
Project Link: https://github.com/galafis/genomic-data-analysis-pipeline