Genomic TE Simulation Toolkit

A comprehensive toolkit for simulating the insertion, amplification, and evolution of transposable elements (TEs) in genomic sequences.

Overview

This toolkit provides a collection of Python scripts for computational genomics research focused on transposable elements. It enables researchers to simulate how TEs insert into genomes, amplify over time, and evolve through various mutation events, producing realistic genomic sequences with annotated TE content for downstream analysis.

Features

Generate structured random genomes with realistic GC content distribution and chromosome features
Simulate TE insertions with preferences for specific genomic regions or GC content
Model genome evolution through SNPs and structural variations (deletions, insertions, inversions)
Analyze TE sequence similarity and estimate evolutionary time
Generate comprehensive statistics and visualizations of TE distribution
Parallel processing support for handling large genomes efficiently
Multi-stage evolutionary simulation with configurable parameters

Requirements

Python 3.6 or higher
Required Python packages:
- NumPy
- Matplotlib
- BioPython
- Pandas (for some visualization scripts)
Optional: R (for advanced visualizations)

Installation

Clone this repository:

git clone https://github.com/username/genomic-te-simulation.git
cd genomic-te-simulation

Install required Python packages:

conda env create -f env/environment.yml

Make the main script executable:
```
chmod +x run_simulation.sh
```

Usage

Quick Start

Run a complete simulation with default parameters:

./run_simulation.sh -t path/to/te_library.fa -o output_directory

Main Workflow

The complete workflow consists of four main steps:

Generate a base genome with realistic chromosome structure
Insert TEs into the genome according to specified preferences
Simulate evolution by introducing mutations and structural variations
Analyze the results to generate annotations, statistics, and visualizations

Command-line Options

./run_simulation.sh [options]

Main options:

-l, --length GENOME_LENGTH: Initial genome length (default: 50,000,000)
-p, --percentage TE_PERCENTAGE: Target TE percentage (default: 30)
-e, --evolution EVOLUTION_LEVEL: Evolution level (low/medium/high, default: medium)
-s, --stages STAGES: Number of evolutionary stages (default: 3)
-t, --te_library TE_LIBRARY: TE library file (required)
-o, --output OUTPUT_DIR: Output directory (default: te_simulation_output)
-d, --distribution TE_DISTRIBUTION: TE insertion stage distribution (comma-separated percentages, default: 60,30,10)
--seed SEED: Random seed (default: current timestamp)
--gc GC_CONTENT: Base GC content (default: 0.5)
--gc_gradient: Enable GC content gradient
--mutation_rate MUTATION_RATE: Per base per generation mutation rate (default: 1e-8)
--num_processes NUM_PROCESSES: Number of processes for parallel processing (default: automatic)
--fast: Use fast TE insertion algorithm (ignores region preferences and GC content)

For full options list:

./run_simulation.sh --help

Using Individual Scripts

You can also run individual components of the workflow:

Generate a base genome:

python generate_base_genome.py -l 50000000 -o base_genome.fa --gc 0.5

Insert TEs:

python optimized_te_insertion.py -g base_genome.fa -t te_library.fa -p 30 -o te_inserted

Simulate evolution:

python simulate_evolution.py -g te_inserted_genome.fa -a te_inserted_te_annotations.bed -l medium -o evolved

Generate annotations and statistics:

python generate_annotation.py -g evolved_genome.fa -a evolved_annotations.bed -o te_annotation

Analyze TE similarity:

python te_similarity.py -t te_library.fa -s evolved_te_sequences.fa -a evolved_annotations.bed -o te_similarity

Script Descriptions

generate_base_genome.py: Generates an initial random genome sequence with specific GC content and chromosome structures
optimized_te_insertion.py: Inserts TEs into a genome with optimized performance for large genomes
simulate_evolution.py: Introduces SNPs and structural variations to simulate genomic evolution
generate_annotation.py: Generates annotation files and statistical reports
te_similarity.py: Analyzes TE sequence similarity and estimates evolutionary time
time_evolution.py: Simulates multiple evolutionary stages including TE insertion and mutation
genomic_partition.py: Provides parallel genomic partition processing
parallel_processing.py: Provides reusable parallel processing functions
parameter_validation.py: Validates input parameters for different scripts
visualization_tools.py: Generates visualizations of chromosome structure and TE distribution
run_simulation.sh: Main script to run the complete simulation workflow

Output Files

For each simulation, the toolkit produces:

Genome sequences (FASTA format)
TE annotations (BED and GFF3 formats)
TE sequences extracted from the genome
Statistical reports on TE content, distribution, and evolution
Visualizations of TE distribution, similarity, and evolutionary time
Evolution logs showing mutation events

Example

Here's a basic example of running a simulation with a custom TE library:

./run_simulation.sh \
  -t example_data/te_library.fa \
  -l 10000000 \
  -p 25 \
  -e medium \
  -s 2 \
  -o example_output \
  --gc 0.45 \
  --gc_gradient \
  --seed 12345

This will:

Generate a 10Mb random genome with 45% GC content and a gradient
Perform 2 stages of TE insertion, targeting 25% overall TE content
Apply medium-level evolutionary changes
Save all results to the example_output directory

Citation

If you use this toolkit in your research, please cite:

PanTE: A Comprehensive Framework for Transposable Element Discovery in Graph-based Pangenomes

License

This project is licensed under the MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
env		env
lib		lib
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomic TE Simulation Toolkit

Overview

Features

Requirements

Installation

Usage

Quick Start

Main Workflow

Command-line Options

Using Individual Scripts

Script Descriptions

Output Files

Example

Citation

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Genomic TE Simulation Toolkit

Overview

Features

Requirements

Installation

Usage

Quick Start

Main Workflow

Command-line Options

Using Individual Scripts

Script Descriptions

Output Files

Example

Citation

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages