diff --git a/docs/gtars/core.md b/docs/gtars/core.md index 5d2a2612..71aef461 100644 --- a/docs/gtars/core.md +++ b/docs/gtars/core.md @@ -1,60 +1,257 @@ # gtars-core -Core library providing fundamental data structures and utilities for genomic interval operations. This is the foundation that all other gtars modules build upon. +Core library providing the fundamental data structures and utilities that every other gtars crate builds on. If you're working directly with genomic regions in Rust, everything starts here. -## Features +!!! info "What's new" + Recent additions to `gtars-core` that pre-existing doc snippets don't cover: -- Common genomic data structures (Region, RegionSet) -- BED file parsing utilities -- Shared constants and helper functions -- Foundation for all gtars modules + - **`RegionSetList`** — a `GRangesList`-style collection of named `RegionSet`s with `concat`, `iter`, and set-level identifier computation. Consumed by `gtars-genomicdist` and `gtars-lola`. + - **`CoordinateMode`** enum — selects BED (0-based half-open) vs. GRanges (1-based closed) conventions for midpoint calculations. + - **`Fragment`** — fragment-file record type for single-cell ATAC-seq workflows. + - **`Interval`** — generic interval with payload, used internally by overlap indexes. + - **Typed error enum** `RegionSetError` — replaces the previous panicking parse paths. + - **`to_polars`** + `dataframe` feature flag — zero-copy conversion to a Polars DataFrame. + - **`bigbed`** and **`http`** feature flags — optional bigBed writing and URL-backed `RegionSet::try_from`. -## Core Data Types +## Core data types + +The `gtars_core::models` module re-exports all six core types at the top level, so you typically import them directly: + +```rust +use gtars_core::models::{ + Region, RegionSet, RegionSetList, Interval, Fragment, CoordinateMode, +}; +``` + +### `Region` + +A single genomic interval. `start` and `end` are 0-based half-open by BED convention; `rest` carries any trailing tab-separated fields from the original line verbatim. -### Region -Represents a genomic interval with chromosome, start, and end coordinates: ```rust use gtars_core::models::Region; -// Create a region -let region = Region::new("chr1", 1000, 2000); +let region = Region { + chr: "chr1".to_string(), + start: 1000, + end: 2000, + rest: None, +}; -// Access properties -println!("Chr: {}", region.chr); -println!("Start: {}", region.start); -println!("End: {}", region.end); +assert_eq!(region.width(), 1000); +println!("{}", region.as_string()); // chr1\t1000\t2000 ``` -### RegionSet -Collection of genomic regions: +Methods: + +| method | returns | notes | +|---|---|---| +| `width()` | `u32` | `end - start` | +| `as_string()` | `String` | tab-separated BED line | +| `digest()` | `String` | MD5 digest of `"chr,start,end"` | +| `mid_point()` | `u32` | `start + width() / 2` (BED/floor) | +| `mid_point_with_mode(mode)` | `u32` | BED or GRanges convention — see `CoordinateMode` below | + +`Region` implements `Display` (as tab-separated text), `Clone`, `Debug`, `Eq`, `Hash`, and — under the `serde` feature — `Serialize`/`Deserialize`. + +### `RegionSet` + +An ordered collection of `Region`s. `RegionSet::try_from` accepts `&Path`, `&str`, `String`, `PathBuf`, or `Vec`, auto-detects gzip by extension, and with the `http` feature will fetch from URLs. Construction always sorts by `(chr, start)`. + ```rust use gtars_core::models::RegionSet; use std::path::Path; -// Load from BED file +// From a local BED (or BED.gz) file let rs = RegionSet::try_from(Path::new("peaks.bed"))?; -// Access regions -println!("Number of regions: {}", rs.regions.len()); +assert!(!rs.is_empty()); +println!("{} regions, {} bp total", rs.len(), rs.nucleotides_length()); -// Iterate over regions -for region in &rs.regions { +for region in &rs { println!("{}: {}-{}", region.chr, region.start, region.end); } +# Ok::<(), gtars_core::errors::RegionSetError>(()) +``` + +Key methods: + +**Construction** + +- `RegionSet::try_from(path)` — accepts `&Path`, `&str`, `String`, `PathBuf`. Sorts on load. +- `RegionSet::from(regions: Vec)` — in-memory construction. +- `RegionSet::from(bytes: &[u8])` — parse from an in-memory byte slice (no gzip handling). + +**Iteration** + +- `for region in &rs { ... }` — `IntoIterator` is implemented for `&RegionSet`. +- `iter_chroms()` → unique chromosomes in insertion order (post-sort). +- `iter_chr_regions(chr)` → all regions on a specific chromosome. + +**Summaries** + +- `len()`, `is_empty()`, `nucleotides_length()` — count and total bp. +- `region_widths()` → `Vec`. +- `mean_region_width()` → `f64`, rounded to 2 decimals. +- `get_max_end_per_chr()` → `HashMap`. +- `calc_mid_points()` → `HashMap>` (BED convention). +- `calc_mid_points_with_mode(CoordinateMode)` → same, with mode control. + +**Identifiers** + +- `identifier()` — MD5 digest over the first-three-column layout; the canonical BEDbase identifier. +- `file_digest()` — MD5 digest over the full serialized file content. + +**I/O** + +- `to_bed(path)` / `to_bed_gz(path)` — write plain or gzipped BED. +- `to_bigbed(path, chrom_sizes)` — under the `bigbed` feature, write a bigBed file. +- `to_polars()` — under the `dataframe` feature, return a `PolarsResult`. + +**Mutation** + +- `sort()` — in-place sort by `(chr, start)`. + +### `RegionSetList` + +A collection of `RegionSet`s — the gtars equivalent of Bioconductor's `GRangesList`. This is the type that downstream crates (genomicdist, lola) use to pass multiple region sets across FFI boundaries without paying N × clone costs. + +```rust +use gtars_core::models::{RegionSet, RegionSetList}; + +let peaks1 = RegionSet::try_from("peaks_rep1.bed")?; +let peaks2 = RegionSet::try_from("peaks_rep2.bed")?; +let peaks3 = RegionSet::try_from("peaks_rep3.bed")?; + +// Unnamed — names are optional +let rsl = RegionSetList::new(vec![peaks1, peaks2, peaks3]); + +// Or with explicit names +let rsl = RegionSetList::with_names( + rsl.region_sets, + vec!["rep1".into(), "rep2".into(), "rep3".into()], +); + +// Iterate +for rs in &rsl { + println!("{} regions", rs.len()); +} + +// Flatten all regions into a single RegionSet (no merge/dedup) +let combined: RegionSet = rsl.concat(); + +// Stable identifier over the full set (order-independent) +let id = rsl.identifier(); +# Ok::<(), gtars_core::errors::RegionSetError>(()) +``` + +`RegionSetList::try_from` also accepts: + +- A path to a **bedset manifest file** — a text file listing one BED path per line (`read_bedset_file` under the hood). +- A `Vec<&Path>`, `Vec<&str>`, `Vec`, or `Vec` — each is loaded as its own `RegionSet`. + +`concat()` flattens without merging; if you need a reduced union, call `.reduce()` on the result (the `reduce` method lives in `gtars-genomicdist` via the `IntervalRanges` trait). + +Key methods: `new`, `with_names`, `add`, `get(i)`, `iter`, `len`, `is_empty`, `concat`, `identifier`. + +### `CoordinateMode` + +Switches between BED (0-based half-open, floor division) and GRanges (1-based closed, banker's rounding) conventions for midpoint calculation. BED is the default and matches Python/numpy conventions; GRanges exists for exact bit-compatibility with R GenomicDistributions output. + +```rust +use gtars_core::models::{Region, CoordinateMode}; + +let r = Region { chr: "chr1".into(), start: 100, end: 206, rest: None }; + +assert_eq!(r.mid_point_with_mode(CoordinateMode::Bed), 153); // 100 + 53 +assert_eq!(r.mid_point_with_mode(CoordinateMode::GRanges), 152); // banker's rounding +``` + +For widths `w` where `w % 4 == 2`, BED and GRanges midpoints differ by 1 bp — this affects approximately 2.6% of typical peak distance calculations. Use `GRanges` mode only when you need byte-for-byte parity with R output. + +### `Fragment` + +A fragment-file record (chromatin accessibility / scATAC-seq). Implements `FromStr` for parsing lines from 10x-style `fragments.tsv` files and `From for Region` for dropping the barcode/support metadata when you only care about coordinates. + +```rust +use gtars_core::models::{Fragment, Region}; +use std::str::FromStr; + +let f = Fragment::from_str("chr1\t1000\t1200\tAAACCTGAGAAACCAT-1\t3")?; +assert_eq!(f.barcode, "AAACCTGAGAAACCAT-1"); +assert_eq!(f.read_support, 3); + +let as_region: Region = f.into(); +# Ok::<(), anyhow::Error>(()) +``` + +### `Interval` + +A generic `[start, end)` range with a payload `T`, parameterized over any unsigned integer type. This is primarily a building block for overlap indexes (consumed by `gtars-overlaprs` and `gtars-igd`); most user code should prefer `Region`/`RegionSet`. + +```rust +use gtars_core::models::Interval; + +let a: Interval = Interval { start: 10, end: 50, val: 0 }; +let b: Interval = Interval { start: 40, end: 80, val: 1 }; + +assert!(a.overlap(b.start, b.end)); +assert_eq!(a.intersect(&b), 10); // overlap width in bp +``` + +## Error handling + +Parse and I/O errors from `RegionSet::try_from` come back as the typed `RegionSetError` enum (no panics on malformed input): + +```rust +use gtars_core::errors::RegionSetError; +use gtars_core::models::RegionSet; + +match RegionSet::try_from("not_a_real_file.bed") { + Ok(rs) => println!("loaded {} regions", rs.len()), + Err(RegionSetError::FileReadError(msg)) => eprintln!("read failed: {msg}"), + Err(RegionSetError::RegionParseError(msg)) => eprintln!("parse failed: {msg}"), + Err(RegionSetError::EmptyRegionSet(path)) => eprintln!("no regions in {path}"), + Err(e) => eprintln!("other: {e}"), +} ``` -## Available Modules +Variants: `FileReadError`, `InvalidPathOrUrl`, `InvalidBedbaseIdentifier`, `BedbaseFetchError`, `RegionParseError`, `EmptyRegionSet`, `HttpFeatureDisabled`, `BigBedError`, and a transparent `Io` wrapper around `std::io::Error`. -- `models` - Core data structures (Region, RegionSet) -- `utils` - Utility functions for file handling and parsing -- `consts` - Shared constants +## Feature flags + +| flag | effect | +|---|---| +| *(default)* | Pure in-memory BED reading/writing, no optional dependencies. | +| `serde` | Derives `Serialize`/`Deserialize` on `Region`, `RegionSet`. | +| `http` | Enables `RegionSet::try_from(&Path)` to fetch from HTTP(S) URLs via `ureq`. Without this feature, non-file paths return `HttpFeatureDisabled`. | +| `dataframe` | Enables `RegionSet::to_polars()` (pulls in `polars`). Required transitively by `gtars-genomicdist`'s `bedclassifier` feature. | +| `bigbed` | Enables `RegionSet::to_bigbed()` via `bigtools`. | + +Enable them in `Cargo.toml` like so: + +```toml +[dependencies] +gtars-core = { version = "0.5", features = ["serde", "dataframe"] } +``` -## Dependencies +## Available modules -Minimal external dependencies: +- **`models`** — all core data types (`Region`, `RegionSet`, `RegionSetList`, `Interval`, `Fragment`, `CoordinateMode`). Re-exported at the crate root. +- **`errors`** — `RegionSetError` enum. +- **`utils`** — readers, file-type detection, chromosome-sizes parsing, and `Region` ↔ id hash-map helpers: + - `get_dynamic_reader(&Path)` / `get_dynamic_reader_w_stdin(&str)` — transparent gzip/stdin handling. + - `get_dynamic_reader_from_url(&Path)` — under the `http` feature. + - `get_file_info(&Path) -> FileInfo` — detect type (BED, BAM, NARROWPEAK, UNKNOWN) and gzip. + - `parse_bedlike_file(line)` → `(chr, start, end)` tuple from a single line. + - `get_chrom_sizes(path)` → `HashMap`. + - `read_bedset_file(path)` → `Vec` of BED paths from a bedset manifest. + - `generate_region_to_id_map` / `generate_id_to_region_map` and string variants — stable id assignment for tokenizer vocabularies. + - `remove_all_extensions(&Path)` → stem with *all* extensions stripped (handles `.bed.gz`). +- **`consts`** — column-name constants (`CHR_COL_NAME`, `START_COL_NAME`, `END_COL_NAME`, `DELIMITER`) and file-extension constants (`BED_FILE_EXTENSION`, `BAM_FILE_EXTENSION`, `GZ_FILE_EXTENSION`, `IGD_FILE_EXTENSION`, `GTOK_EXT`). -- `anyhow` - Error handling -- `flate2` - Gzip compression support -- Other standard bioinformatics libraries +## Where to go next -This module serves as the foundation for all other gtars modules and maintains backward compatibility within major versions. \ No newline at end of file +- **[Core models tour](regionSet.md)** — a cross-language (Python + Rust) walkthrough of `Region`, `RegionSet`, and friends. +- **[gtars-overlaprs](overlaprs.md)** — high-performance overlap queries that operate on `RegionSet`. +- **[gtars-genomicdist](genomicdist.md)** — the `IntervalRanges` and `GenomicIntervalSetStatistics` traits extend `RegionSet` with R GenomicDistributions–style set algebra and summary stats. +- **[gtars-lola](lola.md)** — LOLA enrichment built on top of the IGD index and `RegionSetList`. diff --git a/docs/gtars/genomicdist.md b/docs/gtars/genomicdist.md new file mode 100644 index 00000000..e57c66f6 --- /dev/null +++ b/docs/gtars/genomicdist.md @@ -0,0 +1,509 @@ +# gtars-genomicdist + +Rust port of the R [GenomicDistributions](https://code.databio.org/GenomicDistributions/) package — plus a handful of extra calculations and binary serialization formats used by BEDbase. The crate provides: + +- **Summary statistics** and per-chromosome distributions (`GenomicIntervalSetStatistics` trait). +- **Interval set algebra** in the GRanges/IRanges idiom (`IntervalRanges` trait, 26+ methods). +- **Genomic partitioning** using a gene model (promoter / UTR / exon / intron / intergenic). +- **Signal matrix overlap and summary** (`calc_summary_signal`, TSV + packed-binary I/O). +- **Consensus region calling** across multiple input sets. +- **Nearest-TSS / nearest-feature distance** calculations (`TssIndex`). +- **GC content** and **dinucleotide frequency** against a reference genome. +- **BED file classification** (optional `bedclassifier` feature, polars-backed). +- The **GDA** binary gene-model format and the **.fab** binary FASTA format for fast mmap-backed sequence lookup. + +All operations take a `&RegionSet` (from `gtars-core`) as input and either extend it via trait implementations or run as free functions. If you're new to the crate, start with the `IntervalRanges` and `GenomicIntervalSetStatistics` traits — those cover the GRanges/GenomicDistributions surface area most users need. + +!!! tip "Coming from R GenomicDistributions?" + Most function names follow the pattern `calc_*` where R used `calc*`, and behavior is matched where practical. A few deliberate divergences are documented inline — the most notable is that midpoint calculations default to BED (floor) conventions rather than GRanges banker's rounding, and spacing/nearest-neighbor calculations *exclude* single-region chromosomes rather than emitting sentinel values. + +## Installation + +```toml +[dependencies] +gtars-genomicdist = "0.7" +``` + +### Feature flags + +| flag | effect | +|---|---| +| *(default)* | All traits, statistics, partitions, signal, consensus, and TSS calculations. | +| `bedclassifier` | Enables `classify_bed()` and pulls in `polars`. Transitively enables `gtars-core/dataframe`. | + +Enable the classifier feature: + +```toml +[dependencies] +gtars-genomicdist = { version = "0.7", features = ["bedclassifier"] } +``` + +## Core types + +All of these are re-exported at the crate root, so you typically import them directly: + +```rust +use gtars_genomicdist::{ + // Statistics + GenomicIntervalSetStatistics, + // Interval algebra + IntervalRanges, RegionSetListOps, pairwise_jaccard, + // Strand-aware wrappers + SortedRegionSet, Strand, StrandedRegionSet, + // Partitions + GeneModel, PartitionList, PartitionResult, + ExpectedPartitionResult, ExpectedPartitionRow, + calc_expected_partitions, calc_partitions, genome_partition_list, + // Signal + SignalMatrix, SignalSummaryResult, ConditionStats, + calc_summary_signal, + // Consensus + ConsensusRegion, consensus, + // Sequence stats + calc_gc_content, calc_dinucl_freq, DINUCL_ORDER, + // Utilities + chrom_karyotype_key, median_abs_distance, + // Gene model serialization + GenomicDistAnnotation, + // BED classifier (feature-gated) + #[cfg(feature = "bedclassifier")] classify_bed, + // Re-export from gtars-core + CoordinateMode, +}; +``` + +The submodules (`statistics`, `interval_ranges`, `partitions`, `signal`, `consensus`, `models`, `asset`, `bed_classifier`, `utils`) are all `pub` as well, so both `gtars_genomicdist::calc_partitions` and `gtars_genomicdist::partitions::calc_partitions` resolve to the same symbol. + +## Statistics + +The `GenomicIntervalSetStatistics` trait extends `RegionSet` with per-region and per-pair quantities — direct ports of the R GenomicDistributions functions (`calcWidth`, `calcNeighborDist`, `calcNearestNeighbors`, `calcChromBins`). These return vectors or per-chromosome tables, one number per peak or per gap. Use them when you want to *see* each value — plot a histogram, feed them to a downstream test, render a per-chromosome heatmap. + +Bring the trait into scope and the methods appear on any `RegionSet`. + +### Per-region and per-pair quantities + +Direct ports of R GenomicDistributions. Each returns a vector or per-chromosome table. + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::GenomicIntervalSetStatistics; + +let peaks = RegionSet::try_from("peaks.bed")?; + +// Per-chromosome summary: count, min/max/mean/median width, chr bounds +for (chr, stats) in peaks.chromosome_statistics() { + println!( + "{chr}: {} regions, median width {}", + stats.number_of_regions, + stats.median_region_length + ); +} + +// Region-width distribution (one value per region) +let widths: Vec = peaks.calc_widths(); + +// Signed inter-region gaps on each chromosome (only positive gaps are returned) +let gaps: Vec = peaks.calc_neighbor_distances()?; + +// Nearest-neighbor distance per region (multi-region chromosomes only) +let nn: Vec = peaks.calc_nearest_neighbors()?; +# Ok::<(), Box>(()) +``` + +| method | question it answers | returns | +|---|---|---| +| `chromosome_statistics()` | "What do the per-chromosome counts and width distributions look like?" | `HashMap` | +| `region_distribution_with_bins(n_bins)` | "How many peaks fall in each of `n_bins` bins sized by the longest chromosome?" | `HashMap` | +| `region_distribution_with_chrom_sizes(n_bins, chrom_sizes)` | Same, but bins are sized per-chromosome by actual chromosome length — matches R `getGenomeBins` | `HashMap` | +| `calc_widths()` | "How wide is each peak?" — port of R `calcWidth()` | `Vec` | +| `calc_neighbor_distances()` | "What's the bp gap between every pair of consecutive peaks on each chromosome?" | `Result>` | +| `calc_nearest_neighbors()` | "How far is each peak from its closest neighbor?" — port of R `calcNearestNeighbors()` | `Result>` | + +### Output structs + +```rust +pub struct ChromosomeStatistics { + pub chromosome: String, + pub number_of_regions: u32, + pub start_nucleotide_position: u32, // leftmost start + pub end_nucleotide_position: u32, // rightmost end + pub minimum_region_length: u32, + pub maximum_region_length: u32, + pub mean_region_length: f64, + pub median_region_length: f64, +} + +pub struct RegionBin { pub chr: String, pub start: u32, pub end: u32, pub n: u32, pub rid: u32 } +``` + +!!! warning "Output length for spacing/nearest calculations" + `calc_neighbor_distances` and `calc_nearest_neighbors` **skip single-region chromosomes** (matching R GenomicDistributions behavior). The output length is therefore **not 1:1 with the input region count** — it's the number of multi-region chromosomes' regions. No sentinel values are emitted. If you need 1:1 alignment, filter your input to multi-region chromosomes first. + +!!! warning "`n_bins` is a target, not a total" + In `region_distribution_with_chrom_sizes`, `n_bins` is the target bin count for the **longest** chromosome in `chrom_sizes`. Bin width is derived from that, and every chromosome is tiled at the same bp width — so shorter chromosomes get fewer bins and the total bin count is `sum(ceil(chrom_size / bin_width))`, which can exceed `n_bins` substantially. To target a specific bin width in bp, set `n_bins = max_chrom_len / desired_bp`. + +## Interval set algebra + +`IntervalRanges` is a second trait on `RegionSet` that provides GRanges/IRanges-style set algebra. All operations return new `RegionSet`s (immutable pattern) and are strand-unaware by default — use `StrandedRegionSet` if you need strand-aware promoters, reduce, or setdiff. + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::IntervalRanges; + +let a = RegionSet::try_from("peaks_a.bed")?; +let b = RegionSet::try_from("peaks_b.bed")?; + +let merged = a.reduce(); // merge overlapping intervals +let common = a.intersect(&b); // range-level intersection +let a_only = a.setdiff(&b); // remove b from a +let combined = a.union(&b); // union + reduce +let jac = a.jaccard(&b); // bp-Jaccard similarity +let closest = a.closest(&b); // Vec<(a_idx, b_idx, dist)> +let clustered = a.cluster(1000); // per-region cluster id +# Ok::<(), Box>(()) +``` + +### Method reference + +| method | description | +|---|---| +| `trim(chrom_sizes)` | clamp regions to `[0, chrom_size)`; drop regions on unknown chromosomes | +| `promoters(upstream, downstream)` | `[start - upstream, start + downstream)` per region | +| `reduce()` | merge overlapping/adjacent intervals per chromosome | +| `setdiff(other)` / `subtract(other)` | remove `other` from self | +| `pintersect(other)` | *pairwise* (by index) intersection | +| `intersect(other)` | range-level intersection | +| `intersect_all(other)` | all-vs-all pairwise intersection fragments (AIList-backed) | +| `concat(other)` | concatenate without merging | +| `union(other)` | `concat(other).reduce()` | +| `jaccard(other)` | bp-level Jaccard `|A ∩ B| / |A ∪ B|` | +| `coverage(other)` | fraction of `self` bp covered by `other` | +| `overlap_coefficient(other)` | `|A ∩ B| / min(|A|, |B|)` | +| `shift(offset)` | translate by signed bp offset (saturating at 0) | +| `flank(width, use_start, both)` | upstream/downstream/both-side flanks | +| `resize(width, fix)` | fixed width anchored at `"start"`, `"end"`, or `"center"` | +| `narrow(start, end, width)` | GRanges-style narrow with 1-based relative coords | +| `disjoin()` | tile into non-overlapping pieces at every boundary | +| `gaps(chrom_sizes)` | peak-free regions, including leading/trailing/whole-chrom gaps | +| `closest(other)` | `Vec<(self_idx, other_idx, signed_dist)>` | +| `cluster(max_gap)` | `Vec` cluster ids in original order | + +!!! note "`rest` fields are dropped" + Operations that merge or synthesize new intervals (reduce, setdiff, promoters, etc.) produce regions with `rest: None`. There is no unambiguous way to carry the original metadata through a merge, so the contract is: use `IntervalRanges` methods for coordinate-only work. + +### `pairwise_jaccard` + +A standalone helper for computing the full N×N Jaccard matrix over a slice of region sets, optimized to pre-reduce each set once and walk pairs linearly: + +```rust +use gtars_genomicdist::pairwise_jaccard; + +let sets = vec![rs1, rs2, rs3]; +let jac = pairwise_jaccard(&sets); // Vec of length 9 (row-major) + +for i in 0..3 { + for j in 0..3 { + print!("{:.3} ", jac[i * 3 + j]); + } + println!(); +} +``` + +### RegionSetListOps + +`RegionSetListOps` implements the same set-algebra operations on a `RegionSetList` by index — useful for bindings (wasm/Python/R) that want to operate on pairs without copying whole `RegionSet`s across an FFI boundary. + +```rust +use gtars_core::models::RegionSetList; +use gtars_genomicdist::RegionSetListOps; + +let rsl = RegionSetList::try_from(vec!["a.bed", "b.bed", "c.bed"])?; + +let jac_01 = rsl.jaccard_at(0, 1); // Option +let union_02 = rsl.union_at(0, 2); // Option +let drop_one = rsl.union_except(1); // union of all but index 1 +let (full_union, loo) = rsl.bulk_union_except() // all leave-one-out unions in O(n) + .ok_or("empty list")?; +# Ok::<(), Box>(()) +``` + +Methods: `pintersect_at`, `pintersect_count`, `jaccard_at`, `union_at`, `setdiff_at`, `region_count`, `union_except`, `bulk_union_except`, `union_all`, `intersect_all`. + +## Partitions + +Ports of `genomePartitionList()`, `calcPartitions()`, and `calcExpectedPartitions()` from R GenomicDistributions. You load a gene model from BED files or GTF, build an ordered `PartitionList` of promoter / UTR / exon / intron categories, and classify your query regions into mutually exclusive buckets (`intergenic` is added as the implicit remainder). + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::{ + GeneModel, calc_partitions, calc_expected_partitions, genome_partition_list, +}; +use std::collections::HashMap; + +// 1. Load the gene model (choose one) +let model = GeneModel::from_bed_files( + "genes.bed", + "exons.bed", + Some("three_utr.bed"), // optional + Some("five_utr.bed"), // optional +)?; + +// Or, from a GTF: +// let model = GeneModel::from_gtf("gencode.v47.gtf.gz", true, true)?; + +// 2. Build partition list (core/prox promoter sizes in bp; chrom_sizes optional) +let partitions = genome_partition_list(&model, 100, 2000, None); + +// 3. Classify query regions +let query = RegionSet::try_from("peaks.bed")?; + +let result = calc_partitions(&query, &partitions, /* bp_proportion = */ false); +for (name, count) in &result.counts { + let pct = *count as f64 / result.total as f64 * 100.0; + println!("{:<15} {:>6} ({:.1}%)", name, count, pct); +} + +// 4. Observed vs expected, with chi-square p-value +let chrom_sizes: HashMap = [("chr1".into(), 248_956_422)].into_iter().collect(); +let expected = calc_expected_partitions(&query, &partitions, &chrom_sizes, false); +for row in &expected.rows { + println!( + "{:<15} obs={:>6} exp={:>10.1} log10(O/E)={:+.2} p={:.2e}", + row.partition, row.observed, row.expected, row.log10_oe, row.chi_sq_pval + ); +} +# Ok::<(), Box>(()) +``` + +### Priority order + +`genome_partition_list` produces partitions in this order, which is also the priority for classification: + +1. **`promoterCore`** — `core_prom_size` bp upstream of each gene start (strand-aware). +2. **`promoterProx`** — `prox_prom_size` bp upstream, minus core. +3. **`threeUTR`** — if UTR files are provided. +4. **`fiveUTR`** — minus 3'UTR (R gives 3'UTR priority). +5. **`exon`** — minus UTRs. +6. **`intron`** — gene bodies minus UTRs and exons. +7. **`intergenic`** — implicit remainder, emitted by `calc_partitions`. + +`calc_partitions` has two modes: + +- **Priority mode (`bp_proportion = false`)** — each query region is assigned to the first partition it overlaps; remainders are counted as `intergenic`. Mutually exclusive. +- **BP proportion mode (`bp_proportion = true`)** — for each partition, computes total overlapping base pairs of query. Not mutually exclusive: a query region overlapping multiple partitions contributes bp to each (matching R behavior). + +!!! note "Chi-square p-values differ from R" + `calc_expected_partitions` uses a goodness-of-fit `(O-E)²/E` formula. R's `chisq.test()` computes a 2×2 test of independence with optional Yates correction, so p-values will not match R GenomicDistributions output byte-for-byte. The `log10_oe` column is directly comparable. + +### `GeneModel::from_gtf` + +GTF loading handles GENCODE-style files that use an undifferentiated `UTR` feature type by parsing `CDS` records to infer which UTRs are 5' vs 3'. Two flags: + +- `filter_protein_coding` — keep only features with `gene_biotype "protein_coding"` or `gene_type "protein_coding"` in the attributes column. +- `convert_ensembl_ucsc` — prepend `chr` to chromosome names that don't already have it (so Ensembl `1` becomes UCSC `chr1`). + +## Signal matrix overlap + +`SignalMatrix` holds a matrix of signal values across genomic regions × conditions (e.g. a peak × cell-type matrix of ChIP-seq intensities). `calc_summary_signal` overlaps a query region set against the matrix, aggregates by MAX per query region, and returns Tukey boxplot statistics per condition. + +```rust +use gtars_core::models::{RegionSet, CoordinateMode}; +use gtars_genomicdist::{SignalMatrix, calc_summary_signal}; + +// Load from a TSV where col 0 is "chr_start_end", remaining cols are condition values +let sm = SignalMatrix::from_tsv("signal_matrix.tsv")?; + +// Or load from the packed binary format (produced by sm.save_bin()) +let sm = SignalMatrix::load_bin("signal_matrix.sigm")?; +// ...or from an in-memory byte slice (for wasm): +// let sm = SignalMatrix::load_bin_from_bytes(&bytes)?; + +let query = RegionSet::try_from("peaks.bed")?; +let summary = calc_summary_signal(&query, &sm, CoordinateMode::Bed)?; + +for stats in &summary.matrix_stats { + println!( + "{}: median={:.2} [{:.2}–{:.2}]", + stats.condition, stats.median, stats.lower_hinge, stats.upper_hinge + ); +} +# Ok::<(), Box>(()) +``` + +`SignalSummaryResult` contains per-query-region max signal vectors (`signal_matrix`), per-condition Tukey stats (`matrix_stats`), and `condition_names`. The packed binary format (`.sigm`, magic `SIGM`) is produced by `sm.save_bin()` — substantially faster to load than TSV for large matrices. + +## Consensus regions + +Given N region sets, `consensus` produces the reduced union annotated with how many input sets overlap each union region. Useful for filtering by replicate support threshold: + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::{consensus, ConsensusRegion}; + +let reps = [ + RegionSet::try_from("rep1.bed")?, + RegionSet::try_from("rep2.bed")?, + RegionSet::try_from("rep3.bed")?, +]; + +let cons: Vec = consensus(&reps); + +// Keep regions present in ≥ 2/3 replicates +let robust: Vec<_> = cons.into_iter().filter(|c| c.count >= 2).collect(); +println!("{} robust regions", robust.len()); +# Ok::<(), Box>(()) +``` + +`ConsensusRegion` has `chr`, `start`, `end`, and `count` — the number of input sets overlapping that union region. + +## TSS / nearest-feature distances + +`TssIndex` builds a per-chromosome sorted vector of TSS midpoints from a `RegionSet`, enabling O(R · log M) distance queries instead of the naive O(R · M): + +```rust +use gtars_core::models::{RegionSet, CoordinateMode}; +use gtars_genomicdist::models::TssIndex; + +let tss = TssIndex::try_from("gencode_tss.bed")?; + +let peaks = RegionSet::try_from("peaks.bed")?; + +// Unsigned nearest-feature distance per query region +let dists: Vec = tss.calc_tss_distances(&peaks, CoordinateMode::Bed)?; + +// Signed distances (positive = feature downstream, negative = upstream) +let signed: Vec = tss.calc_feature_distances(&peaks, CoordinateMode::Bed)?; +# Ok::<(), Box>(()) +``` + +Regions on chromosomes missing from the TSS index are padded with `u32::MAX` / `i64::MAX` sentinels (one per query region). Use `median_abs_distance(&signed)` from the `utils` module to summarize while filtering those sentinels. + +## GC content and dinucleotide frequencies + +Both functions take anything implementing `SequenceAccess` — currently `GenomeAssembly` (in-memory HashMap, built from a FASTA) or `BinaryGenomeAssembly` (mmap-backed `.fab` binary). + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::{calc_gc_content, calc_dinucl_freq, DINUCL_ORDER}; +use gtars_genomicdist::models::{GenomeAssembly, BinaryGenomeAssembly}; + +// In-memory loader — ~2s to build for hg38 but then gives zero-copy slices +let genome = GenomeAssembly::try_from("hg38.fa")?; + +// Or the mmap .fab loader — instant construction, zero-copy per region +let genome = BinaryGenomeAssembly::from_file("hg38.fab".as_ref())?; + +let peaks = RegionSet::try_from("peaks.bed")?; + +// GC content per region, 0.0–1.0 +let gc: Vec = calc_gc_content(&peaks, &genome, /* ignore_unk_chroms = */ true)?; + +// Dinucleotide frequencies: 16 columns per region in DINUCL_ORDER +// raw_counts=false → percentages (matches R default) +let (labels, matrix) = calc_dinucl_freq(&peaks, &genome, false, true)?; +# Ok::<(), Box>(()) +``` + +Create a `.fab` file from a regular FASTA once up-front: + +```rust +use gtars_genomicdist::models::BinaryGenomeAssembly; + +BinaryGenomeAssembly::write_from_fasta( + "hg38.fa".as_ref(), + "hg38.fab".as_ref(), +)?; +# Ok::<(), Box>(()) +``` + +Or via the CLI: `gtars prep --fasta hg38.fa` (see the [gtars-cli](cli.md) page). + +## BED classification + +Under the `bedclassifier` feature, `classify_bed` inspects the column layout of a BED file and assigns one of several UCSC / ENCODE format classifications (BED3, BED6+, narrowPeak, broadPeak, gappedPeak, RNA elements, etc.). + +```rust +#[cfg(feature = "bedclassifier")] +{ + use gtars_core::models::RegionSet; + use gtars_genomicdist::bed_classifier::classify_bed; + + let rs = RegionSet::try_from("unknown_format.bed").unwrap(); + let c = classify_bed(&rs).unwrap(); + println!( + "format={} bed_compliance={} compliant_cols={}", + c.data_format, c.bed_compliance, c.compliant_columns + ); +} +``` + +## GDA binary format + +`GenomicDistAnnotation` is a serializable wrapper around `GeneModel` in the GDA (Genomic Dist Annotation) binary format. Chrom sizes are **not** stored in the GDA file — they come from a separate source (refgenie, `.chrom.sizes` file, or an API). + +```rust +use gtars_genomicdist::GenomicDistAnnotation; + +// Build from GTF once, save as GDA +let gda = GenomicDistAnnotation::from_gtf("gencode.v47.gtf.gz")?; +gda.save_bin("gencode.v47.gda")?; + +// Load from disk +let gda = GenomicDistAnnotation::load_bin("gencode.v47.gda")?; + +// Or from memory (wasm / API) +// let gda = GenomicDistAnnotation::load_bin_from_bytes(&bytes)?; + +let partitions = gtars_genomicdist::genome_partition_list(&gda.gene_model, 100, 2000, None); +# Ok::<(), Box>(()) +``` + +## Utilities + +From `gtars_genomicdist::utils`: + +- **`partition_genome_into_bins(chrom_sizes, n_bins)`** — tile all chromosomes with fixed-width bins (bin width = longest chrom / n_bins, floored, min 1 bp). Returns a `RegionSet`. +- **`median_abs_distance(&[i64])`** — median absolute value, filtering `i64::MAX` sentinels produced by `TssIndex::calc_feature_distances`. +- **`chrom_karyotype_key(chr)`** — sort key producing the standard karyotype order: `chr1, chr2, …, chr22, chrX, chrY, chrM, chrUn_*`. Works with or without the `chr` prefix. + +## Strand-aware wrappers + +Two wrappers extend `RegionSet` with stronger invariants: + +- **`SortedRegionSet`** — a newtype guaranteeing `(chr, start)` sort order. Constructed via `SortedRegionSet::new(rs)`, which sorts in place (move, no clone). Downstream code that requires sorted input can accept `&SortedRegionSet` to avoid re-sorting on every call. +- **`StrandedRegionSet`** — pairs a `RegionSet` with a parallel `Vec`. Strand-aware `promoters_stranded`, `reduce`, `setdiff`, and `trim` are methods on this type and are used internally by `genome_partition_list` to produce correct partitions for minus-strand genes. + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::{SortedRegionSet, StrandedRegionSet, Strand}; + +let rs = RegionSet::try_from("peaks.bed")?; +let sorted = SortedRegionSet::new(rs); // in-place sort + +let stranded = StrandedRegionSet::new( + sorted.0.clone(), + vec![Strand::Plus; sorted.0.regions.len()], +); +# Ok::<(), Box>(()) +``` + +`Strand::from_char('+' | '-' | _)` converts a single character to `Plus` / `Minus` / `Unstranded`. + +## Errors + +All operations that can fail return `Result`. Variants: + +- `CustomError(String)` — general I/O and parsing wrapper. +- `GCContentError(chr, start, end, msg)` — sequence lookup failed during GC calculation. +- `TSSContentError(msg)` — no TSS features found for the region set. +- `SignalMatrixError(msg)` — signal-matrix parse failure. +- `Io(std::io::Error)` — transparent I/O error. + +The `bedclassifier` feature has its own `BedClassifierError` enum for format-classification failures. + +## Where to go next + +- **[gtars-core](core.md)** — `RegionSet`, `RegionSetList`, and `CoordinateMode`, which this crate consumes. +- **[gtars-lola](lola.md)** — LOLA enrichment is built on `IntervalRanges` (for universe construction) and the IGD index. +- **[gtars-overlaprs](overlaprs.md)** — the overlap-detection engine used internally by `calc_partitions` and `consensus`. +- **[gtars CLI](cli.md)** — `gtars genomicdist` subcommands for running these analyses from the command line. diff --git a/docs/gtars/lola.md b/docs/gtars/lola.md new file mode 100644 index 00000000..88e7809c --- /dev/null +++ b/docs/gtars/lola.md @@ -0,0 +1,358 @@ +# gtars-lola + +Rust port of the R [LOLA](http://lola.databio.org/) (Locus Overlap Analysis) package: region-set enrichment testing against a database of reference region sets, using Fisher's exact test on 2×2 contingency tables. + +Given a **user set** of regions (your peaks), a **universe** (the background you consider sampleable — typically a union of publicly available peak sets), and a **region database** (LOLA core or custom), `gtars-lola` computes for every database set the probability that your user set's overlap with that database set is larger (or smaller) than expected under the null. + +The crate is built on top of: + +- **[gtars-igd](igd.md)** — the interval overlap index. A LOLA database is exposed as a single `Igd` index over all its files, so one call can score a user set against thousands of reference sets simultaneously. +- **[gtars-genomicdist](genomicdist.md)** — the `IntervalRanges` trait (for `concat`, `disjoin`, `union`, etc.) is used to build restricted universes and redefine user sets in terms of universe regions. +- **[gtars-core](core.md)** — all region types come from here; `RegionSetList` is used as the FFI-friendly return type for extracting region sets from a database. + +## Installation + +```toml +[dependencies] +gtars-lola = "0.2" +``` + +No feature flags — the crate has a single default configuration. + +## Module layout + +`gtars-lola` does not re-export symbols at the crate root; import from the submodules directly: + +- **`gtars_lola::database`** — `RegionDB`, `CollectionAnno`, `RegionSetAnno` +- **`gtars_lola::enrichment`** — `run_lola`, `ContingencyTable` impls +- **`gtars_lola::universe`** — `check_universe_appropriateness`, `redefine_user_sets`, `build_restricted_universe`, `UniverseReport`, `UserSetReport` +- **`gtars_lola::output`** — `annotate_results`, `apply_fdr_correction`, `results_to_columns`, `write_results_tsv`, `LolaColumnar` +- **`gtars_lola::models`** — `Direction`, `LolaConfig`, `ContingencyTable`, `LolaResult` +- **`gtars_lola::errors`** — `LolaError` + +## End-to-end workflow + +```rust +use std::fs::File; +use std::io::BufWriter; +use std::path::Path; + +use gtars_core::models::RegionSet; +use gtars_lola::database::RegionDB; +use gtars_lola::enrichment::run_lola; +use gtars_lola::models::LolaConfig; +use gtars_lola::output::{annotate_results, apply_fdr_correction, write_results_tsv}; + +// 1. Load the region database from a LOLA-format folder +let db = RegionDB::from_lola_folder( + Path::new("LOLACore/hg38"), + None, // collections filter (None = all) + None, // per-collection file limit (None = all) +)?; + +// 2. Load the user set(s) and universe +let user_sets = vec![ + RegionSet::try_from("peaks_condition_a.bed")?, + RegionSet::try_from("peaks_condition_b.bed")?, +]; +let universe = RegionSet::try_from("universe.bed")?; + +// 3. Run Fisher's exact test for every (user_set, db_set) pair +let mut results = run_lola( + &db.igd, + &user_sets, + &universe, + &LolaConfig::default(), +)?; + +// 4. Attach database metadata (collection, cell type, tissue, …) +annotate_results(&mut results, &db); + +// 5. Benjamini-Hochberg FDR correction, per user set +apply_fdr_correction(&mut results); + +// 6. Write results to TSV (format matches R LOLA's writeCombinedEnrichment) +let mut out = BufWriter::new(File::create("lola_results.tsv")?); +write_results_tsv(&mut out, &results)?; +# Ok::<(), Box>(()) +``` + +The steps correspond 1:1 to the R LOLA workflow: `loadRegionDB` → `runLOLA` → `getLolaResults` → `writeCombinedEnrichment`. + +## Loading a region database + +`RegionDB::from_lola_folder` expects a directory laid out in the standard R LOLA format: + +```text +db_path/ +├── collection1/ +│ ├── collection.txt # collector, date, source, description (TSV) +│ ├── index.txt # per-file annotations (TSV or CSV) +│ └── regions/ +│ ├── file1.bed +│ └── file2.bed +├── collection2/ +│ ├── collection.txt +│ ├── index.txt +│ └── regions/ +│ └── … +``` + +`collection.txt` and `index.txt` are both auto-detected TSV or CSV (R's `fread` convention). Recognized columns in `index.txt`: `filename`, `cellType`, `description`, `tissue`, `dataSource`, `antibody`, `treatment`. Unknown columns are ignored; missing optional columns are preserved as `None`. + +### Filters and limits + +```rust +// Load only specific collections +let db = RegionDB::from_lola_folder( + path, + Some(&["encode_segmentation", "roadmap_epigenomics"]), + None, +)?; + +// Limit files per collection (useful for smoke tests) +let db = RegionDB::from_lola_folder(path, None, Some(10))?; +# Ok::<(), Box>(()) +``` + +### In-memory construction (for API integrations) + +If you're serving region sets from an API rather than a folder, build the `RegionDB` directly from a pre-constructed `Igd`: + +```rust +use gtars_lola::database::{RegionDB, RegionSetAnno}; + +let db = RegionDB::from_igd_with_regions(igd, region_sets, region_annos); +``` + +This is the path BEDbase uses to expose LOLA over the web — the IGD is built server-side from the bedset collection and shipped to the client. + +### Convenience accessors + +- **`db.num_region_sets()`** — total region sets in the database. +- **`db.list_region_sets(collections)`** → `Vec` — filenames, optionally filtered by collection. +- **`db.get_region_set(filenames, collections)`** → `Vec<&RegionSet>` — look up by filename. +- **`db.get_region_set_list(&[usize])`** → `RegionSetList` — extract indexed region sets as a named `RegionSetList` (for FFI, plotting, or downstream analysis). Out-of-bounds indices are silently skipped. +- **`RegionDB::merge(a, b)`** — combine two databases into one; rebuilds the IGD. + +## Checking and shaping the universe + +LOLA results are only meaningful if your universe **contains** your user set: every user region should overlap at least one universe region, with no many-to-many mappings. The `universe` module provides three diagnostics/fixers: + +### `check_universe_appropriateness` + +```rust +use gtars_lola::universe::check_universe_appropriateness; +use gtars_igd::igd::Igd; + +// Pre-build the universe IGD once — you can reuse it across calls +let universe_igd = Igd::from_single_region_set(&universe); + +let report = check_universe_appropriateness(&user_sets, &universe_igd); + +for r in &report.user_set_reports { + println!( + "user set {}: {:.1}% coverage ({} of {} regions), {} many-to-many", + r.user_set_index, + r.coverage * 100.0, + r.regions_in_universe, + r.total_regions, + r.many_to_many_count, + ); + for w in &r.warnings { + eprintln!(" ⚠ {w}"); + } +} +``` + +Warnings fire when coverage is under 50% (severe) or under 90% (moderate), or when any user region overlaps more than one universe region (many-to-many). + +### `redefine_user_sets` + +Rewrite each user set in terms of universe regions: for every user region, find the universe regions it overlaps and emit those as the new user set. This eliminates many-to-many mapping artifacts and is the Rust equivalent of R LOLA's `redefineUserSets()`. + +```rust +use gtars_lola::universe::redefine_user_sets; + +let universe_igd = Igd::from_single_region_set(&universe); +let redefined: Vec = + redefine_user_sets(&user_sets, &universe, &universe_igd); + +// Now run enrichment on `redefined` instead of the raw user_sets +let results = run_lola(&db.igd, &redefined, &universe, &LolaConfig::default())?; +# Ok::<(), Box>(()) +``` + +### `build_restricted_universe` + +For differential enrichment analysis — build a universe that is exactly the union of all user sets, disjoined into non-overlapping pieces. This is R LOLA's `disjoin(unlist(userSets))`. + +```rust +use gtars_lola::universe::build_restricted_universe; + +let restricted = build_restricted_universe(&user_sets); +// Use `restricted` as the universe in a differential run +``` + +## Running enrichment + +`run_lola(igd, user_sets, universe, config)` is the core engine. For every `(user_set, db_set)` pair it builds a 2×2 contingency table and runs a one-sided Fisher exact test: + +```text + In DB set Not in DB set +In user set a c +Not in user set b d +``` + +- **a** = overlap count between user set and DB set (the *support* — shows up as `support` in results). +- **b** = universe-DB overlap − user-DB overlap. +- **c** = user set size − user-DB overlap. +- **d** = universe size − a − b − c. + +### `LolaConfig` + +```rust +pub struct LolaConfig { + pub min_overlap: i32, // default 1 + pub direction: Direction, // default Enrichment +} + +pub enum Direction { + Enrichment, // P(X ≥ a), alternative = "greater" + Depletion, // P(X ≤ a), alternative = "less" +} +``` + +```rust +use gtars_lola::models::{LolaConfig, Direction}; + +let config = LolaConfig { + min_overlap: 1, + direction: Direction::Enrichment, +}; +``` + +### p-values and odds ratios + +The p-value is computed from the hypergeometric distribution using the survival function (`sf`) rather than `1 - cdf` to avoid catastrophic cancellation when the tail probability is very small. It is reported as `-log10(p)` (capped at ~322 via a `1e-322` floor, matching R LOLA). + +The odds ratio is the **conditional maximum likelihood estimate** — the noncentrality parameter of Fisher's noncentral hypergeometric distribution — solved by Brent's method. This matches `fisher.test()$estimate` in R exactly, not the simple `(a·d)/(b·c)` point estimate. + +### Negative contingency values + +If your user set contains regions *outside* the universe, the contingency table can produce negative `b`, `c`, or `d`. `run_lola` matches R LOLA's behavior: it prints a warning to stderr, stores the signed values on the `LolaResult` row, and returns `p_value_log = 0.0`, `odds_ratio = NaN` for that row. To avoid this entirely, pre-process your inputs with `redefine_user_sets`. + +## Results + +Each `(user_set, db_set)` pair produces a `LolaResult`: + +```rust +pub struct LolaResult { + pub user_set: usize, // 0-based user set index + pub db_set: usize, // 0-based db set index + pub p_value_log: f64, // -log10(p), ≥ 0 + pub odds_ratio: f64, // CMLE, matches R fisher.test()$estimate + pub support: u64, // contingency table cell 'a' + pub rnk_pv: usize, // rank by p-value (1-based, ascending) + pub rnk_or: usize, // rank by odds ratio (1-based, descending) + pub rnk_sup: usize, // rank by support (1-based, descending) + pub max_rnk: usize, // max(rnk_pv, rnk_or, rnk_sup) + pub mean_rnk: f64, // mean of the three ranks + pub b: i64, pub c: i64, pub d: i64, // signed contingency values + pub q_value: Option, // BH-adjusted p (None until apply_fdr_correction) + pub filename: String, + pub collection: Option, + pub description: Option, + pub cell_type: Option, + pub tissue: Option, + pub antibody: Option, + pub treatment: Option, + pub data_source: Option, + pub db_set_size: u64, +} +``` + +`run_lola` returns results sorted by descending `p_value_log`, then ascending `mean_rnk` — identical to R LOLA's output order. Ranks are assigned per user set so the rank columns within a user set are independent. + +## Annotation and FDR + +Two post-processing steps, both in `gtars_lola::output`: + +- **`annotate_results(&mut results, &db)`** — fills in `collection`, `description`, `cell_type`, `tissue`, `antibody`, `treatment`, `data_source`, and `db_set_size` from the database. `description` is truncated to 80 chars to match R LOLA. +- **`apply_fdr_correction(&mut results)`** — Benjamini-Hochberg q-value computation, applied **independently per user set**. Writes to the `q_value: Option` field. + +## Output + +### TSV (R LOLA–compatible) + +```rust +use gtars_lola::output::write_results_tsv; +use std::fs::File; +use std::io::BufWriter; + +let mut out = BufWriter::new(File::create("lola_results.tsv")?); +write_results_tsv(&mut out, &results)?; +# Ok::<(), Box>(()) +``` + +The header exactly matches R LOLA's `writeCombinedEnrichment`: + +```text +userSet dbSet collection pValueLog oddsRatio support +rnkPV rnkOR rnkSup maxRnk meanRnk b c d +description cellType tissue antibody treatment dataSource +filename qValue size +``` + +`userSet` and `dbSet` are 1-based in the TSV (R convention), even though they are 0-based in memory. + +### Columnar (for FFI bindings) + +`results_to_columns(&results) -> LolaColumnar` pivots the row-oriented `Vec` into parallel column vectors. This is what the Python, WASM, and R bindings consume to build native data frames without reimplementing the row→column transpose: + +```rust +use gtars_lola::output::{results_to_columns, LolaColumnar}; + +let cols: LolaColumnar = results_to_columns(&results); +println!("{} rows", cols.p_value_log.len()); +// cols.p_value_log[i], cols.odds_ratio[i], cols.cell_type[i], … all align on row i +``` + +Every field of `LolaResult` has a corresponding `Vec` on `LolaColumnar`; the ordering is preserved from the input slice. + +## Direct contingency-table use + +If you want to compute a single Fisher test outside of the full `run_lola` pipeline: + +```rust +use gtars_lola::models::{ContingencyTable, Direction}; + +let ct = ContingencyTable { a: 45, b: 155, c: 155, d: 9645 }; + +let p = ct.fisher_pvalue(Direction::Enrichment); +let log_p = ct.p_value_log(Direction::Enrichment); +let or = ct.odds_ratio(); + +println!("p = {p:.2e} (-log10 = {log_p:.2}) odds = {or:.2}"); +``` + +`fisher_pvalue`, `p_value_log`, and `odds_ratio` are `impl ContingencyTable` methods — the same functions `run_lola` calls internally. + +## Errors + +```rust +pub enum LolaError { + EmptyUniverse, // universe has 0 regions + EmptyDatabase, // IGD has 0 files + Other(anyhow::Error), // wrapped I/O etc. +} +``` + +`run_lola` returns early with `EmptyUniverse` / `EmptyDatabase` so you can distinguish configuration problems from per-row negative-contingency warnings (which print to stderr and continue). + +## Where to go next + +- **[gtars-igd](igd.md)** — the overlap index that backs `RegionDB.igd`. Worth reading if you're tuning universe construction or wrapping your own database. +- **[gtars-genomicdist](genomicdist.md)** — `IntervalRanges` operations used during `build_restricted_universe` and `redefine_user_sets`. +- **[gtars-core](core.md)** — `RegionSet` and `RegionSetList`, the input/output types throughout this page. +- **R LOLA** ([docs](http://lola.databio.org/)) — the reference implementation this port targets. diff --git a/docs/gtars/modules.md b/docs/gtars/modules.md index 47ce2f9a..c30ea86c 100644 --- a/docs/gtars/modules.md +++ b/docs/gtars/modules.md @@ -6,7 +6,7 @@ gtars is organized as a workspace of independent Rust crates, each providing spe ## Core Infrastructure -- **[gtars-core](core.md)** - Fundamental data structures and utilities +- **[gtars-core](core.md)** - Fundamental data structures (`Region`, `RegionSet`, `RegionSetList`, `Interval`, `Fragment`, `CoordinateMode`) and utilities - **[gtars-io](io.md)** - I/O operations and file format parsers ## Genomic Analysis @@ -17,6 +17,11 @@ gtars is organized as a workspace of independent Rust crates, each providing spe - **[gtars-scoring](scoring.md)** - Fragment overlap scoring - **[gtars-fragsplit](fragsplit.md)** - Fragment file splitting for pseudobulk +## Distribution Analysis & Enrichment + +- **[gtars-genomicdist](genomicdist.md)** - Rust port of R GenomicDistributions: summary stats, interval set algebra, partitions, signal matrices, consensus +- **[gtars-lola](lola.md)** - LOLA (Locus Overlap Analysis) enrichment testing built on IGD + genomicdist + ## Machine Learning - **[gtars-tokenizers](tokenizers.md)** - Genomic region tokenizers for ML diff --git a/docs/gtars/python-overview.md b/docs/gtars/python-overview.md index cad45506..2ebb2e31 100644 --- a/docs/gtars/python-overview.md +++ b/docs/gtars/python-overview.md @@ -1,6 +1,6 @@ # gtars-python -Python bindings for gtars functionality, providing native Python API for genomic interval analysis. +Python bindings for gtars — native Python API for genomic interval analysis, built on the Rust core. ## Installation @@ -8,52 +8,93 @@ Python bindings for gtars functionality, providing native Python API for genomic pip install gtars ``` -## Available Modules +For development installs from source, `uv pip install -e ./gtars-python` from the repo root. -### tokenizers -```python -from gtars.tokenizers import Tokenizer +## Submodules -# Create tokenizer from BED file -tokenizer = Tokenizer.from_bed("regions.bed") +The `gtars` Python package exposes several submodules, each wrapping a gtars Rust crate: -# Or from configuration -tokenizer = Tokenizer.from_config("config.toml") +| submodule | Rust crate | purpose | dedicated doc | +|---|---|---|---| +| `gtars.models` | `gtars-core` + `gtars-genomicdist::models` | `Region`, `RegionSet`, `RegionSetList`, reference genomes, gene models, partition lists, signal matrices — plus all the per-`RegionSet` statistics and interval algebra methods | [models](python/models.md) | +| `gtars.genomic_distributions` | `gtars-genomicdist` | GC content, dinucleotide freq, partition classification, signal summaries, consensus regions (free functions that need a reference genome or a partition list) | [genomic_distributions](python/genomic_distributions.md) | +| `gtars.lola` | `gtars-lola` | LOLA enrichment testing: `RegionDB`, `run_lola`, universe helpers | [lola](python/lola.md) | +| `gtars.tokenizers` | `gtars-tokenizers` | Genomic region tokenizers for ML | *(see in-package docs)* | +| `gtars.refget` | `gtars-refget` | GA4GH refget protocol client + local store | [digests](python/digests.md), [refget API](python/refget-api.md), [RefgetStore](python/refget-store.md) | +| `gtars.utils` | `gtars-core::utils` | File-reading and parsing helpers | *(see in-package docs)* | -# Tokenize regions -tokens = tokenizer.tokenize(["chr1:1000-2000", "chr2:3000-4000"]) -``` +## Quick start -### models (RegionSet) ```python from gtars.models import RegionSet -# Create from BED file -rs = RegionSet.from_bed("peaks.bed") +# Load a BED file (local path, or URL if the http feature is enabled) +rs = RegionSet("peaks.bed") -# Access properties -print(f"Number of regions: {len(rs)}") -print(f"Total coverage: {rs.coverage()}") +print(len(rs)) # region count +print(rs.identifier) # canonical bedbase MD5 id +print(rs.mean_region_width()) # average width -# Operations -rs.sort() -merged = rs.merge() +# Iterate +for region in rs: + print(region.chr, region.start, region.end) + +# Set algebra +merged = rs.reduce() # merge overlapping +promoters = rs.promoters(2000, 0) # 2kb upstream ``` -### refget +See [`gtars.models`](python/models.md) for the full `RegionSet` surface — including `reduce`, `setdiff`, `union`, `intersect_all`, `jaccard`, `coverage`, `neighbor_distances`, `nearest_neighbors`, and more. + +## Extended example — genomic distributions + ```python -from gtars.refget import RefgetStore, compute_digest +from gtars.models import RegionSet, BinaryGenomeAssembly, PartitionList +from gtars.genomic_distributions import calc_gc_content, calc_expected_partitions + +peaks = RegionSet("peaks.bed") +genome = BinaryGenomeAssembly("hg38.fab") +pl = PartitionList.from_gtf("gencode.v47.gtf.gz", core_prom=100, prox_prom=2000) +chrom_sizes = {"chr1": 248_956_422, "chr2": 242_193_529} + +# GC content per region +gc = calc_gc_content(peaks, genome, ignore_unk_chroms=True) +print(f"Mean GC: {sum(gc) / len(gc):.3f}") + +# Observed vs expected partition enrichment +result = calc_expected_partitions(peaks, pl, chrom_sizes) +for p, o, e, lo, pv in zip( + result["partition"], result["observed"], + result["expected"], result["log10OE"], result["pvalue"], +): + print(f" {p:<15} obs={o:>6.0f} exp={e:>10.1f} log10(O/E)={lo:+.2f} p={pv:.2e}") +``` -# Compute sequence digest -digest = compute_digest(b"ACGTACGT") +## Extended example — LOLA enrichment -# Use RefgetStore for sequence management -store = RefgetStore() +```python +import pandas as pd +from gtars.models import RegionSet +from gtars.lola import RegionDB, run_lola + +def to_tuples(rs): + return [(r.chr, r.start, r.end) for r in rs] + +db = RegionDB.from_folder("LOLACore/hg38") + +user_sets = [to_tuples(RegionSet("peaks.bed"))] +universe = to_tuples(RegionSet("universe.bed")) + +results = run_lola(user_sets, universe, db) +df = pd.DataFrame(results).sort_values("pValueLog", ascending=False) +print(df.head(20)) ``` -## Performance +See [`gtars.lola`](python/lola.md) for universe preparation helpers, `RegionDB` accessors, and the full result schema. + +## Performance notes -- Native Rust performance -- Zero-copy data transfer where possible -- NumPy array integration -- Parallel processing support +- `RegionSet` construction sorts on load; repeated operations don't re-sort. +- Overlap queries (`count_overlaps`, `find_overlaps`, `subset_by_overlaps`) go through an AIList index for O(log n) per-query lookups. +- `BinaryGenomeAssembly` uses `mmap` for zero-copy sequence access — use it over `GenomeAssembly` for large genomes unless you specifically need the in-memory version. +- Most numeric-result methods return Python lists; for very large outputs, consider iterating or using numpy wrappers at the call site. diff --git a/docs/gtars/python/genomic_distributions.md b/docs/gtars/python/genomic_distributions.md new file mode 100644 index 00000000..ab892057 --- /dev/null +++ b/docs/gtars/python/genomic_distributions.md @@ -0,0 +1,225 @@ +# `gtars.genomic_distributions` + +Python bindings for [`gtars-genomicdist`](../genomicdist.md) — the Rust port of R GenomicDistributions plus extras. This submodule contains the **free functions** that operate on regions-plus-something-else (a reference genome, a partition list, a signal matrix). Methods that operate only on a `RegionSet` (summary stats, interval algebra, peak clustering) live directly on `RegionSet` in [`gtars.models`](models.md). + +!!! info "Where to look for what" + - Per-region summaries, nearest-neighbor distances, density vectors, set algebra → `RegionSet` methods in [`gtars.models`](models.md). + - GC content / dinucleotide frequencies → **this module** (need a reference genome). + - Partition classification → **this module** (need a `PartitionList`). + - Signal matrix overlap → **this module** (need a `SignalMatrix`). + - Consensus regions across replicates → **this module**. + - LOLA enrichment → [`gtars.lola`](lola.md). + +The Rust reference page, [gtars-genomicdist](../genomicdist.md), has the algorithmic detail, caveats about bin-width semantics, divergences from R GenomicDistributions, and so on. This page is a Python-focused reference. + +## Sequence statistics + +These both take a `RegionSet` and a reference genome (`GenomeAssembly` for in-memory or `BinaryGenomeAssembly` for mmap-backed `.fab` files — see [`gtars.models`](models.md#genomeassembly)). + +### `calc_gc_content` + +```python +from gtars.models import RegionSet, BinaryGenomeAssembly +from gtars.genomic_distributions import calc_gc_content + +peaks = RegionSet("peaks.bed") +genome = BinaryGenomeAssembly("hg38.fab") + +gc = calc_gc_content(peaks, genome, ignore_unk_chroms=True) +# gc: list[float] — one value per region, 0.0–1.0 +``` + +Setting `ignore_unk_chroms=False` raises if any region lives on a chromosome missing from the assembly. Set to `True` (default) to skip them silently. + +### `calc_dinucl_freq` + +```python +from gtars.genomic_distributions import calc_dinucl_freq + +result = calc_dinucl_freq( + peaks, genome, + raw_counts=False, # False (default) → percentages 0–100; True → integer counts + ignore_unk_chroms=True, +) + +# result is a dict: +result["region_labels"] # list[str] — "chr_start_end" per region +result["dinucleotides"] # list[str] — 16 dinucleotide names in canonical order +result["frequencies"] # list[list[float]] — one row per region, 16 values per row +``` + +The 16 columns in `frequencies` correspond one-for-one with the 16 names in `dinucleotides`. For pooled global counts across all regions, run with `raw_counts=True` and sum each column of the `frequencies` matrix. + +This matches R GenomicDistributions `calcDinuclFreq` column-for-column, including the canonical order. + +## Partition classification + +Builds on the `PartitionList` type from [`gtars.models`](models.md#partitionlist). + +### `calc_partitions` + +```python +from gtars.models import RegionSet, PartitionList +from gtars.genomic_distributions import calc_partitions + +peaks = RegionSet("peaks.bed") +pl = PartitionList.from_gtf("gencode.v47.gtf.gz", core_prom=100, prox_prom=2000) + +result = calc_partitions(peaks, pl, bp_proportion=False) + +# result is a dict with: +result["partition"] # list[str] — partition names + "intergenic" +result["count"] # list[int] — region (or bp) count per partition +result["total"] # int — sum of all counts +``` + +- `bp_proportion=False` (priority mode): each query region is assigned to the first partition it overlaps. Mutually exclusive. The `intergenic` bucket collects everything not matched. +- `bp_proportion=True`: for each partition, sum the overlapping base pairs from the query set. Not mutually exclusive — a region that straddles two partitions contributes bp to each (matching R behavior). + +### `calc_expected_partitions` + +Observed vs expected enrichment with chi-square test, given chromosome sizes: + +```python +from gtars.genomic_distributions import calc_expected_partitions + +chrom_sizes = {"chr1": 248_956_422, "chr2": 242_193_529, ...} + +result = calc_expected_partitions( + peaks, pl, chrom_sizes, bp_proportion=False, +) + +# result is a dict with parallel lists: +result["partition"] # list[str] +result["observed"] # list[float] +result["expected"] # list[float] +result["log10OE"] # list[float] — log10(observed / expected) +result["pvalue"] # list[float] — chi-square goodness-of-fit p-value +``` + +!!! warning "p-values will differ slightly from R" + `calc_expected_partitions` uses a goodness-of-fit `(O-E)²/E` formula. R's `chisq.test()` computes a 2×2 test of independence with optional Yates correction, so p-values will not match R GenomicDistributions output byte-for-byte. The `log10OE` column is directly comparable. + +## Signal matrix overlap + +### `calc_summary_signal` + +Overlap a query region set against a pre-loaded `SignalMatrix` and compute Tukey boxplot statistics per condition: + +```python +from gtars.models import RegionSet, SignalMatrix +from gtars.genomic_distributions import calc_summary_signal + +peaks = RegionSet("peaks.bed") +sm = SignalMatrix.from_tsv("signal_matrix.tsv") +# or: sm = SignalMatrix.load_bin("signal_matrix.sigm") + +result = calc_summary_signal(peaks, sm) + +result["condition_names"] # list[str] — columns of the signal matrix +result["region_labels"] # list[str] — "chr_start_end" per matched query +result["signal_matrix"] # list[list[float]] — row per matched query, col per condition (MAX) +result["matrix_stats"] # list of dicts — Tukey boxplot stats per condition +``` + +Each entry in `matrix_stats` has: `condition`, `lower_whisker`, `lower_hinge`, `median`, `upper_hinge`, `upper_whisker`. + +For each query region, the function finds all overlapping rows in the signal matrix and takes the **max** per condition — answering "what's the peak signal at this region in each cell type?". + +## Consensus regions + +### `consensus` + +Given N region sets (e.g. replicates), produce the reduced union with per-region replicate support count: + +```python +from gtars.models import RegionSet +from gtars.genomic_distributions import consensus + +reps = [ + RegionSet("rep1.bed"), + RegionSet("rep2.bed"), + RegionSet("rep3.bed"), +] + +cons = consensus(reps) +# cons: list of dicts with keys: chr, start, end, count + +# Keep regions present in ≥ 2/3 replicates +robust = [c for c in cons if c["count"] >= 2] +print(f"{len(robust)} robust regions") +``` + +The `count` field is how many input sets overlap each union region — not the number of replicate *regions*, but the number of input *sets*. + +## Utilities + +### `median_abs_distance` + +Median of absolute values, filtering sentinels. Pairs well with `TssIndex.feature_distances`: + +```python +from gtars.models import TssIndex, RegionSet +from gtars.genomic_distributions import median_abs_distance + +tss = TssIndex("gencode.v47.tss.bed") +peaks = RegionSet("peaks.bed") + +signed = tss.feature_distances(peaks) # list[float | None] +# Drop None values (regions on chromosomes not in the TSS index) +clean = [d for d in signed if d is not None] + +median = median_abs_distance(clean) # Optional[float] +``` + +Returns `None` if the input is empty or contains only missing/sentinel values. + +## End-to-end example + +```python +from gtars.models import ( + RegionSet, BinaryGenomeAssembly, PartitionList, SignalMatrix, TssIndex, +) +from gtars.genomic_distributions import ( + calc_gc_content, calc_partitions, calc_expected_partitions, + calc_summary_signal, consensus, +) + +# 1. Load inputs once +peaks = RegionSet("peaks.bed") +genome = BinaryGenomeAssembly("hg38.fab") +pl = PartitionList.from_gtf("gencode.v47.gtf.gz", core_prom=100, prox_prom=2000) +chrom_sizes = {"chr1": 248_956_422, "chr2": 242_193_529} + +# 2. Peak-level statistics — live on RegionSet itself +print(f"{len(peaks)} regions, mean width {peaks.mean_region_width():.1f}") +print(f"Per-chromosome: {peaks.chromosome_statistics()}") + +# 3. GC content per region +gc = calc_gc_content(peaks, genome, ignore_unk_chroms=True) +print(f"Mean GC: {sum(gc) / len(gc):.3f}") + +# 4. Partition enrichment (observed vs expected) +exp = calc_expected_partitions(peaks, pl, chrom_sizes) +for p, o, e, lo, pv in zip( + exp["partition"], exp["observed"], exp["expected"], exp["log10OE"], exp["pvalue"] +): + print(f" {p:<15} obs={o:>6.0f} exp={e:>10.1f} log10(O/E)={lo:+.2f} p={pv:.2e}") + +# 5. Nearest-TSS distances +tss = TssIndex("gencode.v47.tss.bed") +tss_dists = tss.calc_tss_distances(peaks) +print(f"Median distance to nearest TSS: {sorted(tss_dists)[len(tss_dists)//2]:,}") + +# 6. Consensus across replicates +reps = [RegionSet(f"rep{i}.bed") for i in (1, 2, 3)] +cons = consensus(reps) +robust = [c for c in cons if c["count"] >= 2] +print(f"{len(robust)} of {len(cons)} union regions present in ≥2 replicates") +``` + +## See also + +- **[`gtars.models`](models.md)** — `RegionSet`, `GenomeAssembly`, `PartitionList`, `SignalMatrix`, and the peak-stats methods attached to `RegionSet`. +- **[`gtars.lola`](lola.md)** — LOLA enrichment testing. +- **[gtars-genomicdist](../genomicdist.md)** — full Rust API reference, algorithmic notes, and caveats that apply identically in Python. diff --git a/docs/gtars/python/lola.md b/docs/gtars/python/lola.md new file mode 100644 index 00000000..4f269732 --- /dev/null +++ b/docs/gtars/python/lola.md @@ -0,0 +1,269 @@ +# `gtars.lola` + +Python bindings for [`gtars-lola`](../lola.md) — LOLA (Locus Overlap Analysis) enrichment testing. Given a user set of regions, a universe (background), and a region database, computes Fisher's exact test enrichment of the user set against every database region set. + +See the [Rust gtars-lola page](../lola.md) for the full statistical reference (Fisher's exact test, CMLE odds ratio, contingency table semantics, R LOLA compatibility notes). This page is a Python-focused reference. + +!!! warning "LOLA uses tuples, not `RegionSet`" + The Python `gtars.lola` API takes **`list[tuple[str, int, int]]`** — lists of `(chr, start, end)` tuples — rather than `RegionSet` objects. This is a short-term quirk of the binding. If you have a `RegionSet` in hand, convert with: + + ```python + def to_tuples(rs): + return [(r.chr, r.start, r.end) for r in rs] + ``` + +## End-to-end workflow + +```python +from gtars.models import RegionSet +from gtars.lola import RegionDB, run_lola + +# 1. Load the region database from a LOLA-format folder +db = RegionDB.from_folder( + "LOLACore/hg38", + collections=None, # None = all; or pass a list of collection names + limit=None, # None = all files; or pass a per-collection cap +) + +# 2. Prepare user sets and universe as lists of (chr, start, end) tuples +peaks = RegionSet("peaks.bed") +universe = RegionSet("universe.bed") + +def to_tuples(rs): + return [(r.chr, r.start, r.end) for r in rs] + +user_sets = [to_tuples(peaks)] # list of user sets; each is a list of tuples +universe_tuples = to_tuples(universe) + +# 3. Run the enrichment +results = run_lola( + user_sets, + universe_tuples, + db, + min_overlap=1, + direction="enrichment", # or "depletion" +) + +# results is a column-oriented dict — DataFrame-friendly +import pandas as pd +df = pd.DataFrame(results) +df = df.sort_values("pValueLog", ascending=False) +print(df.head(20)) +``` + +The 4-step workflow mirrors R LOLA: load database → convert inputs → run enrichment → pandas-ify. + +## `RegionDB` + +Wraps a `gtars-lola` `RegionDB` — an IGD overlap index plus original region sets plus per-file annotations, loaded from the standard LOLA directory layout: + +```text +db_path/ +├── collection1/ +│ ├── collection.txt # collector, date, source, description (TSV/CSV) +│ ├── index.txt # per-file annotations (TSV/CSV) +│ └── regions/ +│ ├── file1.bed +│ └── file2.bed +├── collection2/ +│ └── … +``` + +### `RegionDB.from_folder` + +```python +RegionDB.from_folder(db_path, collections=None, limit=None) -> RegionDB +``` + +- `db_path` — path to the LOLA database root. +- `collections` — optional list of collection names to include (others are skipped). +- `limit` — optional per-collection file cap (useful for smoke tests). + +`collection.txt` and `index.txt` are auto-detected TSV or CSV. Recognized `index.txt` columns: `filename`, `cellType`, `description`, `tissue`, `dataSource`, `antibody`, `treatment`. Unknown columns are ignored; missing optional columns become `None`. + +### `RegionDB.from_bed_files` + +Build a database from a list of BED file paths without the full LOLA layout: + +```python +db = RegionDB.from_bed_files( + ["file1.bed", "file2.bed", "file3.bed"], + filenames=None, # optional; defaults to basenames +) +``` + +Useful for ad-hoc analyses where you have a list of peak files but don't want to set up a collection directory. + +### Accessors + +```python +db.num_region_sets # int — total files in the database + +db.list_region_sets() # list[str] — all filenames +db.list_region_sets(collections=["enc3"]) # filtered + +# Extract region sets as a gtars.models.RegionSetList (names populated from filenames) +rsl = db.get_region_sets() # all sets +rsl = db.get_region_sets(indices=[0, 5, 12]) + +# Annotations as lists of dicts +db.region_anno # per-file dicts with filename/cellType/description/tissue/... +db.collection_anno # per-collection dicts with collector/date/source/description +``` + +`get_region_sets()` returns a [`gtars.models.RegionSetList`](models.md#regionsetlist), which has `names` populated from the database filenames. This is the one path in Python where `RegionSetList.names` is non-None — normal `RegionSetList(...)` construction can't set names. + +## Running enrichment + +### `run_lola` + +```python +run_lola( + user_sets: list[list[tuple[str, int, int]]], + universe: list[tuple[str, int, int]], + region_db: RegionDB, + min_overlap: int = 1, + direction: str = "enrichment", +) -> dict +``` + +**Parameters:** + +- `user_sets` — a **list of user sets**, where each user set is a list of `(chr, start, end)` tuples. Pass one-element list `[peaks_tuples]` for a single-user-set analysis. +- `universe` — a single list of `(chr, start, end)` tuples representing the background. +- `region_db` — a `RegionDB`. +- `min_overlap` — minimum bp overlap to count as overlapping (default 1). +- `direction` — `"enrichment"` (default, P(X ≥ a), alternative "greater") or `"depletion"` (P(X ≤ a), alternative "less"). The strings `"greater"` / `"less"` are accepted as aliases. + +**Returns** a column-oriented dict with these parallel lists (one entry per `(user_set, db_set)` pair): + +| key | type | meaning | +|---|---|---| +| `userSet` | `list[int]` | 0-based user-set index | +| `dbSet` | `list[int]` | 0-based db-set index | +| `collection` | `list[str | None]` | from `index.txt` | +| `pValueLog` | `list[float]` | `-log10(p)` from Fisher's exact test, capped at ~322 | +| `oddsRatio` | `list[float]` | CMLE odds ratio (matches R `fisher.test()$estimate`) | +| `support` | `list[int]` | overlap count between user set and db set — the contingency `a` | +| `rnkPV`, `rnkOR`, `rnkSup` | `list[int]` | 1-based per-metric ranks within the user set | +| `maxRnk` | `list[int]` | max of the three ranks | +| `meanRnk` | `list[float]` | mean of the three ranks | +| `b`, `c`, `d` | `list[int]` | signed contingency values (can be negative if user set extends outside universe) | +| `qValue` | `list[float | None]` | BH-adjusted p-value (applied automatically inside `run_lola`) | +| `description`, `cellType`, `tissue`, `antibody`, `treatment`, `dataSource` | `list[str | None]` | from `index.txt` | +| `filename` | `list[str]` | db set file name | +| `size` | `list[int]` | number of regions in the db set | + +The Python `run_lola` already calls `annotate_results` + `apply_fdr_correction` internally, so `qValue` and the annotation columns come back populated — you don't need to post-process. + +Rows are sorted by descending `pValueLog`, then ascending `meanRnk` (matching R LOLA output order). + +!!! note "Negative contingency values" + If your user set contains regions *outside* the universe, the contingency table produces negative `b`/`c`/`d`. The binding prints a warning to stderr, stores the signed values, and returns `pValueLog = 0.0`, `oddsRatio = NaN` for that row. To avoid this, preprocess with `redefine_user_sets` (below). + +## Universe preparation + +LOLA results are only meaningful when your universe contains your user sets. Three helpers for checking and shaping the universe: + +### `check_universe` + +```python +from gtars.lola import check_universe + +report = check_universe( + user_sets, # list[list[tuple[str, int, int]]] + universe_tuples, # list[tuple[str, int, int]] +) + +# report is a dict with parallel lists: +report["userSet"] # list[int] +report["totalRegions"] # list[int] +report["regionsInUniverse"] # list[int] +report["coverage"] # list[float] — 0.0 to 1.0 +report["manyToMany"] # list[int] — number of user regions overlapping >1 universe region +report["warnings"] # list[str] — free-form warning messages +``` + +Warnings fire when coverage is under 50% (severe) or under 90% (moderate), and when any user region overlaps more than one universe region. + +### `redefine_user_sets` + +Rewrite each user set in terms of universe regions — for every user region, find the universe regions it overlaps and emit *those* as the new user set. Eliminates many-to-many mapping artifacts. This is the Python equivalent of R LOLA's `redefineUserSets()`. + +```python +from gtars.lola import redefine_user_sets + +redefined = redefine_user_sets(user_sets, universe_tuples) +# redefined is list[list[tuple[str, int, int]]] — same shape as input, but rewritten + +# Now run enrichment on the redefined version +results = run_lola(redefined, universe_tuples, db) +``` + +### `build_restricted_universe` + +For differential enrichment analysis — build a universe that is exactly the union of all user sets, disjoined into non-overlapping pieces. This is R LOLA's `disjoin(unlist(userSets))`. + +```python +from gtars.lola import build_restricted_universe + +restricted = build_restricted_universe(user_sets) +# restricted: list[tuple[str, int, int]] + +# Use it as the universe in a differential run +results = run_lola(user_sets, restricted, db) +``` + +## Complete example with pandas + +```python +import pandas as pd +from gtars.models import RegionSet +from gtars.lola import RegionDB, run_lola, check_universe, redefine_user_sets + +# Convert a RegionSet to the tuple form lola expects +def to_tuples(rs): + return [(r.chr, r.start, r.end) for r in rs] + +# Load inputs +db = RegionDB.from_folder("LOLACore/hg38") +peaks_a = to_tuples(RegionSet("condition_a.bed")) +peaks_b = to_tuples(RegionSet("condition_b.bed")) +universe = to_tuples(RegionSet("universe.bed")) + +user_sets = [peaks_a, peaks_b] + +# Universe sanity check +report = check_universe(user_sets, universe) +for us, cov, mm in zip( + report["userSet"], report["coverage"], report["manyToMany"] +): + print(f"user set {us}: {cov:.1%} coverage, {mm} many-to-many") +for w in report["warnings"]: + print(f" ⚠ {w}") + +# Optional: eliminate many-to-many artifacts +user_sets = redefine_user_sets(user_sets, universe) + +# Run enrichment +results = run_lola(user_sets, universe, db, direction="enrichment") + +df = pd.DataFrame(results) +print(df.shape) # (n_user_sets * n_db_sets, 23) + +# Top 10 per user set +for us in df["userSet"].unique(): + top = ( + df[df["userSet"] == us] + .sort_values("pValueLog", ascending=False) + .head(10) + ) + print(f"\n=== User set {us} — top 10 ===") + print(top[["filename", "cellType", "pValueLog", "oddsRatio", "support", "qValue"]]) +``` + +## See also + +- **[`gtars.models`](models.md)** — `RegionSet`, `RegionSetList`, and the types returned by `RegionDB.get_region_sets()`. +- **[`gtars.genomic_distributions`](genomic_distributions.md)** — set algebra on `RegionSet` (which internally powers universe redefinition). +- **[gtars-lola](../lola.md)** — full Rust API reference, including contingency table math, p-value computation via hypergeometric survival function, CMLE odds ratio, and the R LOLA TSV output format. diff --git a/docs/gtars/python/models.md b/docs/gtars/python/models.md new file mode 100644 index 00000000..c7a5c791 --- /dev/null +++ b/docs/gtars/python/models.md @@ -0,0 +1,351 @@ +# `gtars.models` + +The `gtars.models` submodule exposes the core genomic data types and — unlike the Rust layout, where statistics and set algebra live in separate crates — attaches almost the entire `gtars-genomicdist` method surface directly to `RegionSet`. In Python, you generally interact with `RegionSet` and the rest is built up on top of it. + +!!! info "Python ↔ Rust correspondence" + `gtars.models` combines types from three Rust crates: + + - `gtars-core::models` → `Region`, `Interval`, `RegionSet`, `RegionSetList` + - `gtars-genomicdist::models` → `ChromosomeStatistics`, `GenomeAssembly`, `BinaryGenomeAssembly`, `TssIndex`, `GeneModel`, `PartitionList`, `SignalMatrix`, `GenomicDistAnnotation` + - `gtars-genomicdist::{IntervalRanges, GenomicIntervalSetStatistics}` → methods on `RegionSet` + + Free functions that need a reference genome or a partition list live in [`gtars.genomic_distributions`](genomic_distributions.md). + +## Classes + +### `Region` + +A single genomic interval. + +```python +from gtars.models import Region + +r = Region(chr="chr1", start=100, end=200, rest="peak1") +print(r) # chr1\t100\t200\tpeak1 +print(len(r)) # 100 (width) +print(r.chr, r.start, r.end, r.rest) +``` + +Constructor: `Region(chr, start, end, rest=None)`. +Attributes: `chr: str`, `start: int`, `end: int`, `rest: str | None`. +Supports `==`, `!=`, `len()`, `str()`, `repr()`. + +### `RegionSet` + +The main workhorse — a collection of regions loaded from a BED file (or a URL, via the `http` feature on the underlying crate), and the attachment point for ~50 methods covering load/save, iteration, summaries, interval set algebra, overlap queries, and peak statistics. + +#### Construction + +```python +from gtars.models import Region, RegionSet + +# From a local file (or URL if the http feature is compiled in) +rs = RegionSet("peaks.bed") +rs = RegionSet("peaks.bed.gz") +rs = RegionSet("https://example.org/peaks.bed.gz") + +# From a list of Region objects +regions = [ + Region("chr1", 100, 200), + Region("chr1", 300, 400), +] +rs = RegionSet.from_regions(regions) + +# From parallel column vectors (useful when coming from pandas / numpy) +rs = RegionSet.from_vectors( + chrs=["chr1", "chr1", "chr2"], + starts=[100, 300, 500], + ends=[200, 400, 600], +) + +# Optional strand tracking +rs = RegionSet.from_regions(regions, strands=["+", "-"]) +``` + +Regions are auto-sorted by `(chr, start)` on construction. + +#### Properties and iteration + +```python +len(rs) # number of regions +list(rs) # iterate; yields Region objects +rs[0] # indexed access (supports negatives) +rs.is_empty() + +rs.identifier # MD5 of first three columns — canonical bedbase id +rs.file_digest # MD5 of full serialized content +rs.header # original BED header (may be None) +rs.path # source path (str, raises if constructed in-memory) +rs.strands # parallel list of strand strings +``` + +#### Summaries + +```python +rs.widths() # list[int], one width per region +rs.region_widths() # alias for widths() +rs.mean_region_width() # float, rounded to 2 decimals +rs.get_nucleotide_length() # int, total bp across all regions +rs.get_max_end_per_chr() # dict[str, int] +rs.chromosome_statistics() # dict[str, ChromosomeStatistics] +rs.distribution(n_bins=250, chrom_sizes=None) + # list of dicts: chr/start/end/n/rid +``` + +!!! warning "`chrom_sizes` matters" + Without `chrom_sizes`, `distribution()` derives bin size from the BED file's own observed max end — so outputs are **not comparable across files**. Pass `chrom_sizes` to get reference-genome-aligned bins that are comparable. + +#### I/O + +```python +rs.to_bed("out.bed") +rs.to_bed_gz("out.bed.gz") +rs.to_bigbed("out.bb", "chrom.sizes") # requires the bigbed feature +rs.sort() # in-place +``` + +#### Interval set algebra + +Methods that mirror R GRanges/IRanges — all return a new `RegionSet`: + +```python +rs.trim(chrom_sizes) # clamp to [0, chrom_size) +rs.promoters(upstream=2000, downstream=0) +rs.reduce() # merge overlapping/adjacent +rs.disjoin() # tile at every boundary into disjoint pieces +rs.setdiff(other) # remove other from self +rs.subtract(other) # alias for setdiff +rs.pintersect(other) # pairwise (index-aligned) intersect +rs.intersect_all(other) # all-vs-all intersection fragments +rs.concat(other) # combine without merging +rs.union(other) # concat + reduce +rs.gaps(chrom_sizes) # peak-free regions bounded by chrom sizes +``` + +And similarity / coverage metrics: + +```python +rs.jaccard(other) # nucleotide Jaccard |A∩B| / |A∪B| +rs.coverage(other) # fraction of self bp covered by other +rs.overlap_coefficient(other) # |A∩B| / min(|A|, |B|) +``` + +!!! note "`rest` metadata is dropped" + Operations that merge or synthesize new intervals (`reduce`, `setdiff`, `promoters`, etc.) produce regions with `rest = None`. There's no unambiguous way to carry metadata through a merge. + +#### Overlap queries + +These route through the `gtars-overlaprs` AIList index under the hood: + +```python +rs.subset_by_overlaps(other) # RegionSet containing only self-regions that overlap other +rs.count_overlaps(other) # list[int], per-region overlap count +rs.any_overlaps(other) # list[bool], per-region "overlaps anything?" +rs.find_overlaps(other) # list[list[int]], indices into other per self-region +rs.closest(other) # list[(self_idx, other_idx, distance)] +rs.cluster(max_gap=0) # list[int], cluster id per region +``` + +#### Peak statistics + +Ports of the R GenomicDistributions functions, exposed via the Rust `GenomicIntervalSetStatistics` trait as `RegionSet` methods in Python: + +```python +rs.neighbor_distances() # list[int], signed gaps per chromosome (multi-region chroms only) +rs.nearest_neighbors() # list[int], per-region min neighbor distance +``` + +!!! warning "Output length for `neighbor_distances` / `nearest_neighbors`" + Both methods **skip chromosomes with only one region** (matching R GenomicDistributions). The returned list is therefore **not aligned 1:1 with input regions** — it's shorter than `len(rs)` whenever any chromosome has exactly one peak. No sentinel values are emitted. + +!!! warning "`n_bins` is a target, not a total" + In `distribution(chrom_sizes=...)`, `n_bins` is the target bin count for the **longest** chromosome. Bin width is derived from that, and every chromosome is tiled at the same bp width — so the total window count is `sum(ceil(size / bin_width))` across chromosomes, often much larger than `n_bins`. + +### `RegionSetList` + +A collection of `RegionSet`s — the Python wrapper for `gtars-core::models::RegionSetList`. Substantially thinner than the Rust type: only construction from a list of `RegionSet`s, iteration/indexing, `concat`, and pairwise Jaccard. + +```python +from gtars.models import RegionSet, RegionSetList + +rsl = RegionSetList([ + RegionSet("rep1.bed"), + RegionSet("rep2.bed"), + RegionSet("rep3.bed"), +]) + +len(rsl) # 3 +rsl[0] # RegionSet +list(rsl) # iterate + +combined = rsl.concat() # flatten into a single RegionSet (no merge/dedup) +rsl.names # None unless produced by RegionDB.get_region_sets() + +# Full N×N Jaccard similarity matrix +jac = rsl.pairwise_jaccard() # list[list[float]], symmetric, 1.0 on diagonal +``` + +!!! note "Names come from `RegionDB`" + The Python `RegionSetList.__new__` does not accept a `names=` parameter — names are only populated when a `RegionSetList` is produced by `RegionDB.get_region_sets()` (see [`gtars.lola`](lola.md)). If you need named sets from scratch, that's currently only available in the Rust API. + +### `Interval` + +A generic integer interval with a payload. Primarily a helper for overlap indexes — most user code should use `Region` / `RegionSet` instead. + +### Statistics result types + +These are returned by `RegionSet` methods. All are read-only dataclass-like types with named fields. + +#### `ChromosomeStatistics` + +Returned by `rs.chromosome_statistics()` → `dict[str, ChromosomeStatistics]`. + +| field | type | meaning | +|---|---|---| +| `chromosome` | `str` | chromosome name | +| `number_of_regions` | `int` | region count on this chromosome | +| `start_nucleotide_position` | `int` | leftmost start | +| `end_nucleotide_position` | `int` | rightmost end | +| `minimum_region_length` | `int` | | +| `maximum_region_length` | `int` | | +| `mean_region_length` | `float` | | +| `median_region_length` | `float` | | + +## Reference genome and annotation types + +These types are constructed from files on disk and passed into the functions in [`gtars.genomic_distributions`](genomic_distributions.md). + +### `GenomeAssembly` + +In-memory FASTA loader. Slow to construct (~2s for hg38) but fast for per-region lookups. + +```python +from gtars.models import GenomeAssembly + +genome = GenomeAssembly("hg38.fa") +``` + +### `BinaryGenomeAssembly` + +mmap-backed `.fab` binary FASTA — instant construction, zero-copy per-region access. + +```python +from gtars.models import BinaryGenomeAssembly + +genome = BinaryGenomeAssembly("hg38.fab") +``` + +Create a `.fab` file once up front with the `gtars prep --fasta hg38.fa` CLI command (or the Rust `BinaryGenomeAssembly.write_from_fasta`). + +### `TssIndex` + +Sorted-per-chromosome TSS midpoint index for fast nearest-TSS queries. + +```python +from gtars.models import TssIndex, RegionSet + +tss = TssIndex("gencode.v47.tss.bed") +# or: +# tss = TssIndex.from_regionset(RegionSet("gencode.v47.tss.bed")) + +peaks = RegionSet("peaks.bed") + +dists = tss.calc_tss_distances(peaks) # list[int] +signed = tss.feature_distances(peaks) # list[float | None] +``` + +`feature_distances` returns `None` for regions on chromosomes that have no features in the index. + +### `GeneModel` + +Loaded from a GTF file. Used to build a `PartitionList`. + +```python +from gtars.models import GeneModel + +model = GeneModel.from_gtf( + "gencode.v47.annotation.gtf.gz", + filter_protein_coding=True, + convert_ensembl_ucsc=True, +) +print(model.n_genes, model.n_exons) +``` + +- `filter_protein_coding` — keep only features with `gene_biotype "protein_coding"` (or `gene_type`). +- `convert_ensembl_ucsc` — prepend `chr` to chromosome names that don't already have it. + +### `PartitionList` + +Ordered, priority-based genomic partition list: `promoterCore` → `promoterProx` → `threeUTR` → `fiveUTR` → `exon` → `intron` → `intergenic`. + +```python +from gtars.models import GeneModel, PartitionList + +model = GeneModel.from_gtf("gencode.v47.gtf.gz") + +# From an existing GeneModel +pl = PartitionList.from_gene_model( + model, + core_prom=100, + prox_prom=2000, + chrom_sizes=None, # optional; used for trimming to chrom boundaries +) + +# Or directly from GTF in one call +pl = PartitionList.from_gtf( + "gencode.v47.gtf.gz", + core_prom=100, + prox_prom=2000, + filter_protein_coding=True, + convert_ensembl_ucsc=True, +) + +print(pl.partition_names()) # ['promoterCore', 'promoterProx', 'exon', 'intron', ...] +print(len(pl)) # number of partitions +``` + +Pass the resulting `PartitionList` to `calc_partitions` / `calc_expected_partitions` in [`gtars.genomic_distributions`](genomic_distributions.md). + +### `SignalMatrix` + +A region × condition matrix of signal values (e.g. a peak × cell-type ChIP intensity matrix). Load from TSV or the packed binary `.sigm` format: + +```python +from gtars.models import SignalMatrix + +sm = SignalMatrix.from_tsv("signal_matrix.tsv") +sm = SignalMatrix.load_bin("signal_matrix.sigm") + +print(sm.condition_names) # list[str] +print(sm.n_conditions) # int +print(sm.n_regions) # int +print(len(sm)) # n_regions +``` + +Pass it to `calc_summary_signal` in [`gtars.genomic_distributions`](genomic_distributions.md). + +### `GenomicDistAnnotation` + +A serializable wrapper around `GeneModel` in the GDA binary format. Chrom sizes are not stored in a GDA file — they come from a separate source. + +```python +from gtars.models import GenomicDistAnnotation + +# Build once from GTF, save as GDA (save via Rust / CLI) +gda = GenomicDistAnnotation.from_gtf("gencode.v47.gtf.gz") + +# Load from disk (fast) +gda = GenomicDistAnnotation.load_bin("gencode.v47.gda") + +# Derived views +model = gda.gene_model() +pl = gda.partition_list(core_prom=100, prox_prom=2000, chrom_sizes=None) +tss = gda.tss_index() +``` + +## See also + +- **[`gtars.genomic_distributions`](genomic_distributions.md)** — free functions for GC content, dinucleotide frequencies, partition classification, signal summaries, consensus regions. +- **[`gtars.lola`](lola.md)** — LOLA enrichment testing. +- **[gtars-core](../core.md)** — the Rust crate `gtars.models` wraps. Useful if you want to know which methods are "free" vs. which come from `gtars-genomicdist`. +- **[Core models tour](../regionSet.md)** — cross-language walkthrough of `Region` / `RegionSet` / `RegionSetList`. diff --git a/docs/gtars/r.md b/docs/gtars/r.md index 36bdefe9..9b1c8edf 100644 --- a/docs/gtars/r.md +++ b/docs/gtars/r.md @@ -1,44 +1,135 @@ # gtars-r -R bindings for gtars functionality, providing an R API for genomic interval analysis. +R bindings for gtars — high-performance genomic interval analysis in R, backed by Rust via `extendr`. The package is `gtars`. ## Installation -To install the development version, use the `remotes` package: - -``` R +```r install.packages("remotes") -remotes::install_github("databio/gtars", ref = "dev", subdir = "gtars-r") +remotes::install_github("databio/gtars", subdir = "gtars-r") ``` -You can also install the R package locally from the repo directory: +Or install from a local clone of the repo: -``` console +```console R CMD INSTALL gtars-r ``` -## Available Modules +Building the package requires a Rust toolchain (`cargo`, `rustc`). See the [gtars-r README](https://github.com/databio/gtars/blob/dev/gtars-r/README.md) for detailed setup notes. + +## Loading + +```r +library(gtars) +``` + +## Submodule pages + +| page | covers | +|---|---| +| [**RegionSet & RegionSetList**](r/regionset.md) | S4 classes for genomic regions, interval set algebra, overlap queries, conversion to/from `GRanges`, pairwise Jaccard. | +| [**GenomicDistributions**](r/genomicdist.md) | Drop-in replacements for R GenomicDistributions — `calcWidth`, `calcGCContent`, `calcDinuclFreq`, `genomePartitionList`, `calcPartitions`, `calcExpectedPartitions`, `calcSummarySignal`, `calcFeatureDist`, etc. | +| [**LOLA enrichment**](r/lola.md) | Drop-in replacement for R LOLA — `loadRegionDB`, `runLOLA`, `checkUniverseAppropriateness`, `redefineUserSets`, `buildRestrictedUniverse`. | +| [**IGD indexing**](r/igd.md) | Low-level `igd_create` and `igd_search` for building and querying IGD databases directly. | +| [**RefgetStore**](r/refgetstore.md) | GA4GH refget protocol client. | + +## Polymorphic inputs + +A defining feature of the R gtars API: nearly every function that takes a query accepts **any** of these input types interchangeably: + +- a `RegionSet` S4 object, +- a `GRanges` object (from `GenomicRanges`), +- a file path to a BED / BED.gz / narrowPeak file, +- a `data.frame` with `chr` / `start` / `end` columns (and optional `strand`). + +This means you can drop gtars into an existing Bioconductor workflow without explicit conversion: + +```r +library(GenomicRanges) +library(gtars) + +gr <- GRanges(...) + +widths <- calcWidth(gr) # GRanges works directly +merged <- reduce(gr) # overridden S4 method +jac <- jaccard(gr, "other_peaks.bed") # mix GRanges with file path +``` + +## Quick start + +```r +library(gtars) + +# 1. Load regions +peaks <- RegionSet("peaks.bed") +length(peaks) +show(peaks) + +# 2. Summary statistics +calcWidth(peaks) |> summary() +chromosomeStatistics(peaks) + +# 3. Interval set algebra +merged <- reduce(peaks) +proms <- promoters(peaks, upstream = 2000L, downstream = 200L) + +# 4. Convert back to GRanges for downstream Bioconductor work +# (GenomicRanges only needed here — it's a bridge, not a core dep) +library(GenomicRanges) +gr <- as_granges(peaks) +``` + +## LOLA enrichment quick start -### refget -``` R +```r library(gtars) -# Compute sequence digest -readable <- 'ACGTACGT' +regionDB <- loadRegionDB("LOLACore/hg38") + +results <- runLOLA( + userSets = list(RegionSet("peaks_a.bed"), RegionSet("peaks_b.bed")), + userUniverse = "universe.bed", + regionDB = regionDB, + redefineUserSets = TRUE, +) + +head(results[order(-pValueLog)], 20) +``` + +## refget quick start + +```r +library(gtars) + +readable <- "ACGTACGT" gtars::sha512t24u_digest(readable) gtars::md5_digest(readable) -# Use RefgetStore for sequence management -store <- global_refget_store('raw') +store <- global_refget_store("raw") ``` -### IGD -``` R +## IGD quick start + +```r library(gtars) -### Building an Index -igd_create(output_path = '/home/igd_output/', filelist = '/home/my_bedfiles/') +# Build an IGD index from a directory of BED files +igd_create(output_path = "./igd_out", filelist = "./my_beds/", db_name = "my_peaks") -### Querying with a single bed file -igd_search(database_path = 'my_igd_database.igd', query_path = 'my_query.bed') +# Query the index +hits <- igd_search( + database_path = "./igd_out/my_peaks.igd", + query_path = "query.bed", +) ``` + +## Coordinate-system notes + +The R package uses **0-based half-open** coordinates internally (BED convention) to match the Rust core. When you pass a `GRanges` object to a gtars function, coordinates are converted from 1-based closed to 0-based half-open automatically. Converting back via `as_granges(rs)` does the reverse. + +If you're building a `RegionSet` from a `data.frame` directly, the `start` and `end` columns are expected to already be in 0-based half-open form. + +## See also + +- **[gtars-core](core.md)** and **[gtars-genomicdist](genomicdist.md)** — Rust reference pages with the algorithmic detail and caveats that apply identically in R. +- **[R GenomicDistributions](https://code.databio.org/GenomicDistributions/)** and **[R LOLA](https://www.bioconductor.org/packages/release/bioc/html/LOLA.html)** — the original packages gtars R is a port of. diff --git a/docs/gtars/r/genomicdist.md b/docs/gtars/r/genomicdist.md new file mode 100644 index 00000000..bd678746 --- /dev/null +++ b/docs/gtars/r/genomicdist.md @@ -0,0 +1,224 @@ +# GenomicDistributions in R (gtars) + +The `gtars` R package ships **drop-in replacements** for the core functions in the [R GenomicDistributions](https://code.databio.org/GenomicDistributions/) package, backed by the Rust [`gtars-genomicdist`](../genomicdist.md) crate. The function names, signatures, and return shapes match R GenomicDistributions closely enough that most analyses should port over with one-line changes. + +Every function accepts polymorphic query input — `RegionSet`, `GRanges`, file path, or `data.frame` — so you can mix gtars with existing Bioconductor workflows without explicit conversion. + +For algorithmic detail and caveats that apply identically in R, see [gtars-genomicdist](../genomicdist.md). + +## Drop-in replacements for GenomicDistributions + +| gtars R function | R GenomicDistributions equivalent | notes | +|---|---|---| +| `calcWidth(query)` | `calcWidth()` | delegates to `widths()` S4 method | +| `calcNeighborDist(query)` | `calcNeighborDist()` | signed inter-region gaps per chromosome | +| `calcNearestNeighbors(query)` | `calcNearestNeighbors()` | per-region min neighbor distance | +| `regionDistribution(query, nBins, chromSizes)` | `calcChromBins()` | bin counts; pass `chromSizes` for reference-aligned bins | +| `calcGCContent(query, ref, ignoreUnkChroms)` | `calcGCContent()` | requires a `GenomeAssembly` pointer, not BSgenome | +| `calcDinuclFreq(query, ref, rawCounts)` | `calcDinuclFreq()` | returns a data.frame with region + 16 dinucleotide columns | +| `genomePartitionList(...)` | `genomePartitionList()` | build partition list from gene model components | +| `partitionListFromGTF(path, ...)` | *(not in original)* | convenience — loads gene model from GTF in one step | +| `calcPartitions(query, partitionList, bpProportion)` | `calcPartitions()` | priority-based (or bp-proportion) partition classification | +| `calcExpectedPartitions(query, partitionList, genomeSize)` | `calcExpectedPartitions()` | observed vs. expected with chi-square p-values | +| `calcSummarySignal(query, signalMatrix)` | `calcSummarySignal()` | overlap query regions with a signal matrix | +| `calcFeatureDist(query, features)` | `calcFeatureDist()` | signed distance to nearest feature | +| `calcTSSDist(query, features)` | *(not in original)* | absolute distance variant | +| `loadGenomeAssembly(fasta_path)` | — | replaces BSgenome loading (FASTA directly) | + +## Porting an analysis + +Most GenomicDistributions scripts port by adding `library(gtars)` after `library(GenomicDistributions)` — gtars masks the functions with faster implementations: + +```r +library(GenomicDistributions) # optional, for plotting helpers +library(gtars) # masks calcWidth, calcGCContent, etc. +library(GenomicRanges) + +query <- GRanges(...) # or a file path, or a data.frame +widths <- calcWidth(query) # now backed by Rust + +# Use existing plotGenomicDist* helpers on the results +plotChromBins(regionDistribution(query, chromSizes = hg38)) +``` + +## Basic statistics + +```r +library(gtars) + +# Works with any query input: RegionSet, GRanges, file path, or data.frame +query <- "peaks.bed" + +w <- calcWidth(query) # numeric vector of widths +nd <- calcNeighborDist(query) # signed gaps per chromosome +nn <- calcNearestNeighbors(query) # min neighbor distance per region +``` + +!!! warning "`calcNeighborDist` / `calcNearestNeighbors` output length" + Both functions **skip chromosomes with only one region** (matching R GenomicDistributions). The returned vector is shorter than `length(query)` whenever any chromosome has a single peak, and is **not aligned 1:1** with input regions. + +## Region distribution + +```r +# Without chromSizes: bin size derived from the query itself +# (not comparable across files) +rd <- regionDistribution(query, nBins = 250L) + +# With chromSizes: reference-aligned bins (comparable across files, +# aligned with reference genome positions) +chromSizes <- c(chr1 = 248956422L, chr2 = 242193529L) +rd <- regionDistribution(query, nBins = 250L, chromSizes = chromSizes) +``` + +Returns a data.table compatible with R GenomicDistributions' `plotChromBins`. + +!!! warning "`nBins` is a target, not a total" + When `chromSizes` is provided, `nBins` is the target bin count for the **longest** chromosome in `chromSizes`. Bin width is derived as `max(chromSizes) %/% nBins` (floored, minimum 1 bp), and every chromosome is tiled at the same bp width — so shorter chromosomes get proportionally fewer bins. The total bin count returned is `sum(ceiling(chrom_size / bin_width))`, which can substantially exceed `nBins` when `chromSizes` has many entries. To target a specific bin width in bp instead, pass `nBins = max_chrom_len %/% desired_bp`. + +## GC content and dinucleotides + +Unlike the original R GenomicDistributions (which uses BSgenome), gtars loads a reference genome directly from a FASTA file: + +```r +genome <- loadGenomeAssembly("hg38.fa") + +gc <- calcGCContent(query, genome, ignoreUnkChroms = TRUE) +# numeric vector of GC content per region, 0.0 to 1.0 + +dinucl <- calcDinuclFreq(query, genome, rawCounts = FALSE) +# data.frame with columns: region + 16 dinucleotide columns +# rawCounts = FALSE → percentages (each row sums to 100) +# rawCounts = TRUE → integer counts + +# Pooled global counts across all regions +pooled <- colSums(calcDinuclFreq(query, genome, rawCounts = TRUE)[, -1]) +``` + +## Partitions + +Build a `PartitionList` of promoter / UTR / exon / intron / intergenic categories from gene model components, then classify your query. You can either supply each component explicitly or load them in one call from a GTF. + +### From gene model components + +```r +# From GRanges, file paths, or data.frames — any combination works +pl <- genomePartitionList( + genesGR = "genes.bed", + exonsGR = "exons.bed", + threeUTRGR = "three_utr.bed", + fiveUTRGR = "five_utr.bed", + corePromSize = 100L, + proxPromSize = 2000L, + chromSizes = c(chr1 = 248956422L, chr2 = 242193529L), # optional +) +``` + +Strand information from the inputs (GRanges strand column, or the `strand` column of a data.frame) is used for strand-aware promoter and reduce operations — critical for getting correct promoters on minus-strand genes. + +### From a GTF + +```r +# Single-call convenience — loads gene model and builds partitions +pl <- partitionListFromGTF( + "gencode.v47.annotation.gtf.gz", + filterProteinCoding = TRUE, + convertEnsemblUCSC = FALSE, + corePromSize = 100L, + proxPromSize = 2000L, + chromSizes = chromSizes, +) +``` + +### Classifying query regions + +```r +# Priority mode — each query region assigned to first matching partition +counts <- calcPartitions(query, pl, bpProportion = FALSE) +# data.frame with partition, Freq columns + attr(counts, "total") + +total <- attr(counts, "total") +counts$pct <- counts$Freq / total * 100 +print(counts) + +# Observed vs expected with chi-square p-values +chromSizes <- c(chr1 = 248956422L, chr2 = 242193529L) +expected <- calcExpectedPartitions(query, pl, chromSizes, bpProportion = FALSE) +# data.frame with partition, observed, expected, log10OE, pvalue columns +``` + +!!! warning "p-values differ slightly from R GenomicDistributions" + `calcExpectedPartitions` uses a goodness-of-fit `(O-E)²/E` formula. R `chisq.test()` computes a 2×2 test of independence with optional Yates correction, so p-values will not match R GenomicDistributions output byte-for-byte. The `log10OE` column is directly comparable. + +## Signal matrix overlap + +```r +# signalMatrix is a data.frame / data.table where column 1 is the region ID +# in "chr_start_end" format and the remaining columns are condition names +# with numeric signal values. +signalMatrix <- data.table::fread("signal_matrix.tsv") + +result <- calcSummarySignal(query, signalMatrix) +# result is a list compatible with GenomicDistributions' plotSummarySignal: +# $signalSummaryMatrix — data.table with queryPeak + one column per condition +# $matrixStats — data.frame with 5 rows (Tukey stats) and one column per condition + +# Row names of matrixStats are the Tukey boxplot quantities: +# lowerWhisker, lowerHinge, median, upperHinge, upperWhisker +print(result$matrixStats) +``` + +Plug directly into R GenomicDistributions' `plotSummarySignal()` without further conversion. + +## Distances to TSS / nearest features + +```r +tss <- "gencode.v47.tss.bed" + +# Signed distance (positive = feature downstream, negative = upstream) +d_signed <- calcFeatureDist(query, tss) + +# Absolute distance +d_abs <- calcTSSDist(query, tss) +``` + +Both accept RegionSet, GRanges, file path, or data.frame for either argument. + +## End-to-end example + +```r +library(gtars) + +# 1. Load inputs once +peaks <- RegionSet("peaks.bed") +genome <- loadGenomeAssembly("hg38.fa") +chromSizes <- c(chr1 = 248956422L, chr2 = 242193529L) +pl <- partitionListFromGTF("gencode.v47.gtf.gz", corePromSize = 100L, + proxPromSize = 2000L, chromSizes = chromSizes) + +# 2. Widths and GC content +cat(sprintf("%d peaks, median width %.0f\n", + length(peaks), median(calcWidth(peaks)))) + +gc <- calcGCContent(peaks, genome, ignoreUnkChroms = TRUE) +cat(sprintf("Mean GC content: %.3f\n", mean(gc))) + +# 3. Partition enrichment +expected <- calcExpectedPartitions(peaks, pl, chromSizes, bpProportion = FALSE) +print(expected) + +# 4. Distance to nearest TSS +tssDist <- calcTSSDist(peaks, "gencode.v47.tss.bed") +cat(sprintf("Median distance to nearest TSS: %d bp\n", median(tssDist))) + +# 5. Convert to GRanges for downstream Bioconductor work +# (only step that needs GenomicRanges — it's a bridge, not a core dep) +library(GenomicRanges) +gr <- as_granges(peaks) +``` + +## See also + +- **[RegionSet & RegionSetList (R)](regionset.md)** — S4 class reference and method list. +- **[R LOLA interface](lola.md)** — enrichment testing built on top of these types. +- **[gtars-genomicdist](../genomicdist.md)** — Rust reference with all the algorithmic detail. +- **[R GenomicDistributions](https://code.databio.org/GenomicDistributions/)** — the original package gtars is a port of. Plotting helpers from GenomicDistributions still work on gtars return values. diff --git a/docs/gtars/r/igd.md b/docs/gtars/r/igd.md new file mode 100644 index 00000000..948653c8 --- /dev/null +++ b/docs/gtars/r/igd.md @@ -0,0 +1,101 @@ +# IGD in R (gtars) + +Low-level R bindings for IGD ([Indexed Genomic Data](../igd.md)) — a fast binary index over collections of BED files, used internally by gtars LOLA and by the BEDbase retrieval pipeline. The R API exposes two functions: `igd_create` (build an index from BED files on disk) and `igd_search` (query an existing index with a BED file). + +If you're running LOLA enrichment, you usually don't need to touch IGD directly — `loadRegionDB()` in [`r/lola.md`](lola.md) builds an IGD index internally from the region database. Use this page if you want to index a custom collection of BED files and query it directly. + +## Creating an IGD database + +```r +library(gtars) + +igd_create( + output_path = "path/to/output/", + filelist = "path/to/bed/files/", + db_name = "igd_database", # default +) +``` + +**Arguments:** + +- `output_path` — directory where the IGD database files will be written. Must exist. +- `filelist` — one of: + - a path to a text file listing BED file paths (one per line), + - a path to a directory containing BED files (all `.bed` / `.bed.gz` files will be indexed), + - `"-"` or `"stdin"` to read paths from standard input. +- `db_name` — prefix for the output filenames (default `"igd_database"`). The function produces two files: `.igd` (the index) and `_index.tsv` (file metadata). + +Returns `NULL` invisibly on success. Errors are raised via `stop()` on invalid input. + +### Example + +```r +# From a directory of BED files +igd_create( + output_path = "./igd_out", + filelist = "./my_bed_files/", + db_name = "my_peaks", +) + +# Produces: +# ./igd_out/my_peaks.igd +# ./igd_out/my_peaks_index.tsv + +# From an explicit file list +writeLines( + c("./rep1.bed", "./rep2.bed", "./rep3.bed"), + "bed_list.txt", +) +igd_create( + output_path = "./igd_out", + filelist = "bed_list.txt", + db_name = "replicates", +) +``` + +## Searching an IGD database + +```r +hits <- igd_search( + database_path = "path/to/my_peaks.igd", + query_path = "path/to/query.bed", +) +``` + +**Arguments:** + +- `database_path` — path to an existing `.igd` file (produced by `igd_create` or any other IGD-compatible tool). +- `query_path` — path to a BED file containing the query regions. + +**Returns** a `data.frame` of overlap hits. The exact columns depend on the query result schema — typically `file_id`, `filename`, `overlap_count`, etc. Use `colnames(hits)` to discover the structure for your version. + +### Example + +```r +hits <- igd_search( + database_path = "./igd_out/my_peaks.igd", + query_path = "query.bed", +) + +# Inspect the schema +colnames(hits) +head(hits) + +# Typical analysis: count overlaps per database file +aggregate(overlap_count ~ filename, data = hits, sum) +``` + +## Relationship to other gtars R functions + +- **`loadRegionDB()` in `lola.R`** — builds an IGD index internally for a LOLA region database. The IGD is wrapped inside a `RegionDB` pointer and used by `runLOLA()`. Users running enrichment typically don't need to call `igd_create` / `igd_search` separately. +- **`countOverlaps()` / `findOverlaps()` on `RegionSet`** — ad-hoc overlap queries between two `RegionSet` objects use the AIList index from [`gtars-overlaprs`](../overlaprs.md), not IGD. Use those methods when you have both the query and subject in memory. +- **IGD is preferred when** you have a large static collection of BED files that you'll query many times against different user sets, and you want the index persisted on disk. + +!!! note "Naming convention" + Unlike the rest of the R gtars API, which uses camelCase (`runLOLA`, `calcPartitions`, etc.), the IGD functions use snake_case (`igd_create`, `igd_search`). This is a legacy from the direct CLI mapping and may be aligned in a future release. + +## See also + +- **[gtars-igd](../igd.md)** — the underlying Rust crate, with the binary format reference and performance characteristics. +- **[R LOLA interface](lola.md)** — `loadRegionDB()` and `runLOLA()`, which use IGD internally. +- **[gtars CLI](../cli.md)** — `gtars igd create` and `gtars igd search` commands, equivalent to these R functions. diff --git a/docs/gtars/r/lola.md b/docs/gtars/r/lola.md new file mode 100644 index 00000000..7fdab4a0 --- /dev/null +++ b/docs/gtars/r/lola.md @@ -0,0 +1,228 @@ +# LOLA enrichment in R (gtars) + +The `gtars` R package ships a **drop-in replacement** for the [LOLA](https://www.bioconductor.org/packages/release/bioc/html/LOLA.html) R/Bioconductor package, powered by the Rust [`gtars-lola`](../lola.md) crate. Function names and signatures match LOLA closely enough that existing scripts port over with one-line changes. + +For the underlying statistical reference see [gtars-lola](../lola.md) — Fisher's exact test on a hypergeometric survival function, CMLE odds ratio via Brent's method, per-user-set Benjamini-Hochberg FDR correction, and 1-based TSV output compatible with R LOLA's `writeCombinedEnrichment`. + +## Loading the database + +`loadRegionDB()` reads a standard LOLA-format folder (one directory per collection, each with `collection.txt`, `index.txt`, and a `regions/` subdirectory of BED files): + +```r +library(gtars) + +# Load the full database +regionDB <- loadRegionDB("LOLACore/hg38") + +# Load only specific collections +regionDB <- loadRegionDB("LOLACore/hg38", collections = c("encode_tfbs", "roadmap_epigenomics")) + +# Per-collection file cap (useful for smoke tests) +regionDB <- loadRegionDB("LOLACore/hg38", limit = 10L) +``` + +### From a list of BED files + +If you don't have the full LOLA folder layout, `loadRegionDBFromBeds()` builds a minimal region database from an arbitrary list of BED paths: + +```r +regionDB <- loadRegionDBFromBeds( + bedFiles = c("file1.bed", "file2.bed", "file3.bed"), + filenames = c("Condition A", "Condition B", "Condition C"), # optional display names +) +``` + +### Accessing annotations + +```r +# Per-file annotation table (matches R LOLA's regionDB$regionAnno) +regionAnno <- regionDBAnno(regionDB) +# data.table with columns: filename, cellType, description, tissue, +# dataSource, antibody, treatment, collection + +# Collection-level annotation +collectionAnno <- regionDBCollectionAnno(regionDB) +# data.table with columns: collectionname, collector, date, source, description + +# List region set filenames +listRegionSets(regionDB) +listRegionSets(regionDB, collections = "encode_tfbs") # filtered +``` + +### Extracting region sets + +```r +# Get all region sets as a RegionSetList +all_sets <- getRegionSets(regionDB) + +# Or a subset by 1-based indices +subset <- getRegionSets(regionDB, indices = c(1, 5, 12)) + +# RegionSetList is the same S4 class described in the RegionSet page +length(subset) +names(subset) +rs1 <- subset[[1]] # individual RegionSet +``` + +## Running enrichment + +```r +runLOLA( + userSets, # RegionSet, or list of RegionSets (or GRanges / file paths — polymorphic) + userUniverse, # RegionSet or GRanges representing the background + regionDB, # from loadRegionDB + minOverlap = 1, # minimum bp overlap to count as overlapping + cores = 1, # reserved for future parallelism — currently ignored + redefineUserSets = FALSE, # automatically rewrite user sets against universe + direction = "enrichment", # "enrichment" or "depletion" +) +``` + +**Returns a `data.table`** with one row per `(user_set, db_set)` pair, matching the R LOLA schema: `userSet`, `dbSet`, `collection`, `pValueLog`, `oddsRatio`, `support`, `rnkPV`, `rnkOR`, `rnkSup`, `maxRnk`, `meanRnk`, `b`, `c`, `d`, `description`, `cellType`, `tissue`, `antibody`, `treatment`, `dataSource`, `filename`, `qValue`, `size`. + +- `userSet` and `dbSet` are **1-based** in the output data.table (R convention), even though they are 0-based internally. +- `pValueLog` is `-log10(p)` from Fisher's exact test, capped at ~322. +- `oddsRatio` is the CMLE odds ratio — matches `fisher.test()$estimate`, not the simple `(a·d)/(b·c)` point estimate. +- `qValue` is Benjamini-Hochberg adjusted, computed **per user set independently**. +- Rows are sorted by descending `pValueLog`, then ascending `meanRnk` — identical to R LOLA output order. + +### Basic example + +```r +library(gtars) + +regionDB <- loadRegionDB("LOLACore/hg38") + +# Polymorphic inputs — RegionSet, GRanges, file path all accepted +userSets <- list( + RegionSet("condition_a.bed"), + RegionSet("condition_b.bed"), +) +universe <- "universe.bed" # file path works directly + +results <- runLOLA(userSets, universe, regionDB, + minOverlap = 1, + direction = "enrichment") + +# Top 20 hits for user set 1 +head(results[userSet == 1], 20) +``` + +### Two-condition comparison + +```r +# Differential enrichment — use the union of both conditions as the universe +restricted <- buildRestrictedUniverse(list(peaks_a, peaks_b)) + +results <- runLOLA(list(peaks_a, peaks_b), restricted, regionDB) + +# Top hits unique to condition A +head(results[userSet == 1 & qValue < 0.05][order(-pValueLog)], 20) +``` + +## Universe diagnostics + +LOLA results are only meaningful if your universe **contains** your user sets: every user region should overlap at least one universe region. + +### `checkUniverseAppropriateness` + +```r +report <- checkUniverseAppropriateness(userSets, userUniverse) +# data.frame with columns: +# user_set, total_regions, regions_in_universe, coverage, many_to_many +# Warnings are emitted via warning() for low coverage or many-to-many mappings. +``` + +Warnings fire when coverage is below 50% (severe) or 90% (moderate), or when any user region overlaps more than one universe region. + +### `redefineUserSets` + +Rewrite each user set in terms of universe regions — eliminates many-to-many mapping artifacts. + +```r +redefined <- redefineUserSets(userSets, userUniverse) +# Returns a list of RegionSet objects. + +# Use directly +results <- runLOLA(redefined, userUniverse, regionDB) + +# Or pass redefineUserSets = TRUE to runLOLA to do this automatically +results <- runLOLA(userSets, userUniverse, regionDB, redefineUserSets = TRUE) +``` + +### `buildRestrictedUniverse` + +For differential enrichment: build a universe that is exactly the union of all user sets, disjoined into non-overlapping pieces (R LOLA's `disjoin(unlist(userSets))`): + +```r +restricted <- buildRestrictedUniverse(list(peaks_a, peaks_b, peaks_c)) +# Returns a RegionSet +``` + +## Porting from R LOLA + +Most LOLA scripts port by changing one line — replacing `library(LOLA)` with `library(gtars)`. The core API is compatible: + +| R LOLA | gtars R | notes | +|---|---|---| +| `loadRegionDB()` | `loadRegionDB()` | same signature | +| `runLOLA()` | `runLOLA()` | same signature; returns `data.table` | +| `checkUniverseAppropriateness()` | `checkUniverseAppropriateness()` | same signature | +| `redefineUserSets()` | `redefineUserSets()` | same signature; returns list of `RegionSet` | +| `writeCombinedEnrichment()` | *(use `data.table::fwrite`)* | output table is already in the right format | +| `extractEnrichmentOverlaps()` | *(not implemented)* | file an issue if needed | +| `getRegionFile()` | `getRegionSets(regionDB, index)` | returns `RegionSet`, not `GRanges` — call `as_granges()` to convert | + +The notable differences: + +- **Return type is `data.table`, not `data.frame`.** R LOLA returns a plain data.frame; gtars returns a `data.table` by default. Convert with `as.data.frame()` if you prefer. +- **Faster p-value and odds ratio.** The p-value uses the survival function (no cancellation at small tails), and the odds ratio is the CMLE (not the point estimate). Expect tail significance to be reported more precisely. + +## End-to-end example + +```r +library(gtars) +library(data.table) + +# 1. Load the database once +regionDB <- loadRegionDB("LOLACore/hg38") +cat(sprintf("Loaded %d region sets\n", length(listRegionSets(regionDB)))) + +# 2. Load user sets and universe — polymorphic inputs +peaks_a <- "peaks_condition_a.bed" +peaks_b <- "peaks_condition_b.bed" +universe <- "universe.bed" + +# 3. Universe sanity check +diag <- checkUniverseAppropriateness(list(peaks_a, peaks_b), universe) +print(diag) + +# 4. Run enrichment with auto-redefinition +results <- runLOLA( + list(peaks_a, peaks_b), universe, regionDB, + minOverlap = 1, + redefineUserSets = TRUE, + direction = "enrichment", +) + +# 5. Top hits per user set +for (us in unique(results$userSet)) { + cat(sprintf("\n=== User set %d — top 10 ===\n", us)) + top <- head( + results[userSet == us][order(-pValueLog)], + 10, + ) + print(top[, .(filename, cellType, pValueLog, oddsRatio, support, qValue)]) +} + +# 6. Write to TSV matching R LOLA's writeCombinedEnrichment format +fwrite(results, "lola_results.tsv", sep = "\t") +``` + +## See also + +- **[RegionSet & RegionSetList (R)](regionset.md)** — the S4 classes used for user sets and extracted database sets. +- **[R GenomicDistributions wrappers](genomicdist.md)** — statistics and partition analysis. +- **[R IGD interface](igd.md)** — IGD is the overlap index that backs a `RegionDB`. +- **[gtars-lola](../lola.md)** — Rust reference with the full statistical detail. +- **[R LOLA package](https://www.bioconductor.org/packages/release/bioc/html/LOLA.html)** — the original Bioconductor package this is a port of. diff --git a/docs/gtars/r/regionset.md b/docs/gtars/r/regionset.md new file mode 100644 index 00000000..0cc62d40 --- /dev/null +++ b/docs/gtars/r/regionset.md @@ -0,0 +1,204 @@ +# RegionSet & RegionSetList (R) + +The `gtars` R package exposes S4 classes `RegionSet` and `RegionSetList` — the R-idiomatic equivalents of Bioconductor's `GRanges` and `GRangesList`, backed by a high-performance Rust implementation via `externalptr`. + +Unlike most gtars bindings, the R API is **polymorphic on input type**: nearly every function that takes a query will accept a `RegionSet`, a `GRanges` object, a file path, or a `data.frame` interchangeably. This lets you mix existing Bioconductor workflows with gtars without converting explicitly. + +For the underlying semantics see [gtars-core](../core.md) (core types) and [gtars-genomicdist](../genomicdist.md) (statistics, set algebra). + +## Loading the package + +```r +library(gtars) +``` + +## `RegionSet` + +An S4 class representing a set of genomic regions. Internally wraps a Rust `RegionSet` pointer plus an R-side strand vector. + +### Construction + +`RegionSet()` accepts any of these inputs: + +```r +# 1. File path (BED, bed.gz, narrowPeak, etc.) +rs <- RegionSet("peaks.bed") + +# 2. GRanges object (requires GenomicRanges) +library(GenomicRanges) +gr <- GRanges(seqnames = c("chr1", "chr1"), + ranges = IRanges(start = c(100, 500), end = c(200, 600)), + strand = c("+", "-")) +rs <- RegionSet(gr) # coords converted from 1-based closed to 0-based half-open + +# 3. data.frame with chr/start/end (and optional strand) +df <- data.frame(chr = c("chr1", "chr1"), + start = c(100, 500), + end = c(200, 600), + strand = c("+", "-")) +rs <- RegionSet(df) + +# 4. An existing RegionSet (returned as-is) +rs2 <- RegionSet(rs) +``` + +`as_regionset()` is an alias for `RegionSet()`. + +!!! warning "0-based half-open, BED convention" + R gtars uses **0-based half-open** coordinates internally, matching BED convention (not GRanges' 1-based closed convention). Constructor/accessor functions convert automatically when bridging to `GRanges`; if you're building a RegionSet from a data.frame directly, the `start` and `end` columns are expected in 0-based half-open form. + +### Basic S4 methods + +```r +length(rs) # number of regions +show(rs) # pretty-print summary (first 5 regions) +rs # same — calls show() + +as.data.frame(rs) # convert to a data.frame with chr/start/end/strand +rs[1:10] # subset by index (numeric, logical, or character) +rs[rs$strand == "+"] # Note: use as.data.frame(rs)$strand for filtering +``` + +### Converting to `GRanges` + +```r +gr <- as_granges(rs) +# Coordinates converted from 0-based half-open to 1-based closed. +# Strand information is preserved. +``` + +Requires the `GenomicRanges` package to be installed. + +### Statistics methods + +All of these accept `RegionSet`, `GRanges`, file path, or data.frame as input: + +```r +widths(rs) # numeric vector of region widths (end - start) +neighborDistances(rs) # signed gaps between consecutive regions per chromosome +nearestNeighbors(rs) # per-region min neighbor distance +chromosomeStatistics(rs) # named list of per-chromosome stats + +distribution(rs, nBins = 250) # bin counts across the genome +distribution(rs, nBins = 250, chromSizes = hg38) # with reference chrom sizes + +clusterRegions(rs, maxGap = 5000L) # cluster id per region +``` + +!!! warning "Output length for `neighborDistances` / `nearestNeighbors`" + Both skip chromosomes with only one region (matching R GenomicDistributions). The returned vector is **not aligned 1:1** with the input — it's shorter than `length(rs)` whenever any chromosome has a single peak. + +!!! warning "`nBins` is a target, not a total" + In `distribution(chromSizes = ...)`, `nBins` is the target bin count for the **longest** chromosome in `chromSizes`. Bin width is derived as `max(chromSizes) %/% nBins` (floored, minimum 1 bp), and every chromosome is tiled at the same bp width — so shorter chromosomes get proportionally fewer bins. The total bin count returned is `sum(ceiling(chrom_size / bin_width))`, which can substantially exceed `nBins` when `chromSizes` has many entries. To target a specific bin width in bp instead, pass `nBins = max_chrom_len %/% desired_bp`. + +### Interval set algebra + +R gtars overrides the standard Bioconductor generics (`union`, `intersect`, `setdiff`, `reduce`, `promoters`, `shift`, `flank`, `resize`, `narrow`, `disjoin`, `gaps`, `findOverlaps`, `countOverlaps`) plus adds gtars-specific ones (`trim`, `pintersect`, `concat`, `jaccard`). + +```r +merged <- reduce(rs) +trimmed <- trim(rs, chromSizes = hg38) +proms <- promoters(rs, upstream = 2000L, downstream = 200L) +shifted <- shift(rs, shift = 100L) +resized <- resize(rs, width = 500L, fix = "center") +disjoint <- disjoin(rs) +gapped <- gaps(rs, chrom_sizes = hg38) + +# Binary operations +u <- union(rs1, rs2) +i <- intersect(rs1, rs2) +d <- setdiff(rs1, rs2) +pi <- pintersect(rs1, rs2) # pairwise (by index) +c <- concat(rs1, rs2) + +j <- jaccard(rs1, rs2) # scalar Jaccard similarity + +# Overlap queries +hits <- findOverlaps(rs1, rs2) +counts <- countOverlaps(rs1, rs2) +``` + +Methods dispatch on `("RegionSet", "RegionSet")`, `("RegionSet", "ANY")`, and `("ANY", "RegionSet")` so you can mix in `GRanges`, file paths, or data.frames on either side: + +```r +# Works — second argument is a file path +j <- jaccard(rs, "other_peaks.bed") + +# Works — first argument is a GRanges +merged <- union(gr, rs) +``` + +### Consensus regions + +```r +cons <- consensus(list(rep1_rs, rep2_rs, rep3_rs)) +# Returns a data.frame with chr, start, end, count columns. +# 'count' is the number of input sets overlapping each union region. + +# Keep regions present in ≥ 2/3 replicates +robust <- cons[cons$count >= 2, ] +``` + +## `RegionSetList` + +An S4 class for collections of `RegionSet`s — the gtars equivalent of `GRangesList`. Provides the single most efficient path for operating on many region sets at once, since pointers are passed between R and Rust without copying. + +### Construction + +```r +# Variadic: any mix of RegionSet / file path / GRanges / data.frame +rsl <- RegionSetList( + RegionSet("rep1.bed"), + RegionSet("rep2.bed"), + RegionSet("rep3.bed"), +) + +# Equivalent: single list argument +rsl <- RegionSetList(list(rs1, rs2, rs3)) + +# Or directly from file paths — each is auto-wrapped via RegionSet() +rsl <- RegionSetList("rep1.bed", "rep2.bed", "rep3.bed") + +# Empty list +empty <- RegionSetList() +``` + +### S4 methods + +```r +length(rsl) # number of region sets +show(rsl) # pretty-print with sizes +rsl[[1]] # extract a single RegionSet by index (1-based) +names(rsl) # character vector of names, or NULL +``` + +### Operations + +```r +# Flatten into a single RegionSet (no merge/dedup) +flat <- concat(rsl) +# Apply reduce() on the result if you want the union +merged_all <- reduce(concat(rsl)) + +# Full N x N Jaccard similarity matrix +jac <- pairwise_jaccard(rsl) +# Returns a symmetric numeric matrix with 1.0 on the diagonal. +``` + +## Coordinate-system notes + +| system | representation | gtars handling | +|---|---|---| +| BED / gtars | 0-based half-open | **internal** — always used by `RegionSet` | +| GRanges / IRanges | 1-based closed | converted in both directions by `RegionSet(gr)` / `as_granges(rs)` | +| data.frame input | 0-based half-open | passed through as-is — make sure your `start`/`end` columns are BED-style | + +If you're working entirely in GRanges-land, use `as_granges()` to convert back before handing off to Bioconductor workflows. If you're reading BED files and writing results back to BED, stay in `RegionSet` to avoid double conversion. + +## See also + +- **[R GenomicDistributions wrappers](genomicdist.md)** — `calcWidth`, `calcGCContent`, `calcPartitions`, etc. — the drop-in API for users migrating from R GenomicDistributions. +- **[R LOLA interface](lola.md)** — `loadRegionDB`, `runLOLA`, `checkUniverseAppropriateness`, etc. +- **[R IGD interface](igd.md)** — `igd_create` / `igd_search` for low-level IGD access. +- **[gtars-core](../core.md)** — the underlying Rust types. +- **[Core models tour](../regionSet.md)** — Python + Rust walkthrough of the same concepts. diff --git a/docs/gtars/regionSet.md b/docs/gtars/regionSet.md index 3c392227..62af0ab6 100644 --- a/docs/gtars/regionSet.md +++ b/docs/gtars/regionSet.md @@ -135,3 +135,62 @@ rs.to_bigbed("path/to/save/bedfile.bb", chrom_sizes="path/to/chrom.sizes") !!! info - Detailed documentation for RegionSet is available in the [API reference](https://docs.rs/gtars-core/latest/gtars_core/models/region_set/). + +### 🟢 RegionSetList + +`RegionSetList` is the gtars equivalent of Bioconductor's `GRangesList` — an ordered collection of `RegionSet`s with optional names. It's the type downstream crates (genomicdist, lola) use to pass multiple region sets across FFI boundaries without paying N×clone costs. + +#### Example + +=== "Python" + ```python + from gtars.models import RegionSet, RegionSetList + + rs1 = RegionSet("rep1.bed") + rs2 = RegionSet("rep2.bed") + rs3 = RegionSet("rep3.bed") + + # Unnamed + rsl = RegionSetList([rs1, rs2, rs3]) + + # Or with names + rsl = RegionSetList([rs1, rs2, rs3], names=["rep1", "rep2", "rep3"]) + + print(len(rsl)) # number of sets + combined = rsl.concat() # flatten into a single RegionSet (no merge) + set_id = rsl.identifier() # stable order-independent identifier + ``` + +=== "Rust" + ```rust + use gtars_core::models::{RegionSet, RegionSetList}; + + let rs1 = RegionSet::try_from("rep1.bed")?; + let rs2 = RegionSet::try_from("rep2.bed")?; + let rs3 = RegionSet::try_from("rep3.bed")?; + + let rsl = RegionSetList::with_names( + vec![rs1, rs2, rs3], + vec!["rep1".into(), "rep2".into(), "rep3".into()], + ); + + // Iterate + for rs in &rsl { + println!("{} regions", rs.len()); + } + + // Flatten all regions into a single RegionSet (no merge/dedup) + let combined = rsl.concat(); + let id = rsl.identifier(); + # Ok::<(), gtars_core::errors::RegionSetError>(()) + ``` + +`RegionSetList::try_from` in Rust also accepts a **bedset manifest file** (text file listing one BED path per line) or a `Vec<&Path>` / `Vec<&str>` / `Vec` / `Vec`. + +`concat()` flattens without merging; if you need a reduced union, call `.reduce()` on the result — that method comes from the `IntervalRanges` trait in [gtars-genomicdist](genomicdist.md). + +## See also + +- **[gtars-core](core.md)** — the canonical Rust API reference for `Region`, `RegionSet`, `RegionSetList`, `Interval`, `Fragment`, `CoordinateMode`, and `RegionSetError`. +- **[gtars-genomicdist](genomicdist.md)** — the `IntervalRanges` and `GenomicIntervalSetStatistics` traits that extend `RegionSet` with set-algebra and summary stats. +- **[gtars-lola](lola.md)** — LOLA enrichment, which consumes `RegionSetList` for user-set and database-set inputs. diff --git a/docs/gtars/wasm.md b/docs/gtars/wasm.md index 26caaca9..b01e4b1c 100644 --- a/docs/gtars/wasm.md +++ b/docs/gtars/wasm.md @@ -1,61 +1,89 @@ # gtars Wasm/JS -We provide WebAssembly (Wasm) bindings for gtars functionality, enabling genomic interval analysis in JavaScript/TypeScript environments. This allows for client-side processing of genomic data without the need for server-side computation. +WebAssembly bindings for gtars, enabling genomic interval analysis directly in browsers and Node — no server round-trip required. The package is `@databio/gtars-js`; the underlying Rust crate is `gtars-wasm`. ## Features -- Zero-installation genomic tools -- Client-side processing (no server required) -- JavaScript/TypeScript interoperability +- Zero-installation genomic tools running client-side. +- Full `RegionSet` interval algebra and summary statistics. +- Genomic distribution analysis: partitions, signal matrices, consensus regions. +- LOLA enrichment testing with in-memory region databases. +- TypeScript-friendly — the published package ships `.d.ts` declarations generated from the Rust source. ## Installation -You can install the gtars Wasm package via npm: ```bash npm install @databio/gtars-js ``` -## Usage -We don't currently provide full feature parity bindings, but some functionality is available. +## Quick start -```typescript -import init, { Overlapper } from 'gtars-js'; +Every Wasm entry point needs the module to be initialized once before use. In ESM environments with top-level await, that's a single line: + +```ts +import init, { RegionSet, Overlapper } from '@databio/gtars-js'; await init(); -import { Overlapper } from '@databio/gtars-js'; +const peaks = new RegionSet([ + ['chr1', 100, 200, 'peak1'], + ['chr2', 150, 250, 'peak2'], + ['chr3', 300, 400, 'peak3'], +]); + +console.log(`${peaks.numberOfRegions} regions, mean width ${peaks.meanRegionWidth}`); +``` -const universe = [ - ['chr1', 100, 200], - ['chr1', 150, 250], - ['chr2', 300, 400], -] +## Submodule pages -const overlapper = new Overlapper(universe, 'ailist'); -console.log(`Using backend: ${overlapper.get_backend()}`); +| page | covers | +|---|---| +| [**Overlappers**](wasm/overlappers.md) | `Overlapper` — low-level interval overlap engine with AIList / BITS backends. | +| [**RegionSet & models**](wasm/regionset.md) | `RegionSet`, `RegionSetList`, `ConsensusBuilder`, `ChromosomeStatistics`, `RegionDistribution`, `BedClassificationOutput`, and all the per-`RegionSet` statistics and interval algebra methods. | +| [**Partitions**](wasm/partitions.md) | `GeneModel`, `PartitionList`, `calcPartitions`, `calcExpectedPartitions` — classify regions into promoter / UTR / exon / intron / intergenic. | +| [**Signal matrix**](wasm/signal.md) | `SignalMatrix` and `calcSummarySignal` — overlay region sets on region × condition signal matrices. | +| [**LOLA enrichment**](wasm/lola.md) | `LolaRegionDB`, `runLOLA`, `checkUniverseAppropriateness` — Fisher's exact test enrichment against a database of reference region sets. | -const query = ['chr1', 180, 220]; -const overlaps = overlapper.find(query); +## Limitations vs. the Rust/Python APIs -console.log(`Overlaps for ${query}:`, overlaps); -``` +Wasm is a constrained runtime — a few crate features are not exposed: -## Integration +- **No filesystem loaders.** `GeneModel::from_gtf`, `GeneModel::from_bed_files`, `SignalMatrix::from_tsv`, and `RegionDB::from_lola_folder` are not available. Everything is constructed from in-memory JS data or from packed binary buffers (e.g. `.sigm`, `.fab`, `.gda`) fetched over the network. +- **No `redefine_user_sets` / `build_restricted_universe`.** These LOLA universe helpers are Rust-only at the moment; use `checkUniverseAppropriateness` for diagnostics and replicate the set-algebra logic client-side with `RegionSetList` operations if you need the rewriting behavior. +- **Refget is exposed as a separate module** — see the refget binding docs (currently being written). -### An example React component -```jsx +## Integration example — React + +```tsx import { useEffect, useState } from 'react'; -import init, { TreeTokenizer } from 'gtars-js'; +import init, { RegionSet, ConsensusBuilder } from '@databio/gtars-js'; -function TokenizerComponent() { - const [tokenizer, setTokenizer] = useState(null); +function ConsensusComponent({ replicates }: { replicates: Array<[string, number, number, string][]> }) { + const [ready, setReady] = useState(false); + const [robust, setRobust] = useState(null); useEffect(() => { - init().then(() => { - setTokenizer(new TreeTokenizer()); - }); + init().then(() => setReady(true)); }, []); - // Use tokenizer... + useEffect(() => { + if (!ready) return; + const cb = new ConsensusBuilder(); + for (const rep of replicates) { + cb.add(new RegionSet(rep)); + } + const cons = cb.compute(); + setRobust(cons.filter((r) => r.count >= 2).length); + }, [ready, replicates]); + + if (!ready) return

Loading gtars-js…

; + return

{robust} regions present in ≥ 2 replicates

; } ``` + +## Performance notes + +- `RegionSet` construction sorts on load. Repeated operations don't re-sort. +- Overlap queries (`jaccard`, `setdiff`, `intersect`, etc.) route through the AIList index under the hood, giving O(log n) per-query lookups. +- Statistics methods are implemented in Rust and called via zero-copy boundary — they're substantially faster than pure-JS equivalents for typical peak-set sizes. +- For very large numeric outputs, the Wasm glue converts between Rust `Vec` and JS arrays — if you're hitting GC pressure, consider batching calls or keeping `RegionSet` handles around instead of returning raw arrays. diff --git a/docs/gtars/wasm/lola.md b/docs/gtars/wasm/lola.md new file mode 100644 index 00000000..88c87b30 --- /dev/null +++ b/docs/gtars/wasm/lola.md @@ -0,0 +1,190 @@ +# LOLA enrichment (Wasm) + +Wasm bindings for [gtars-lola](../lola.md) — LOLA (Locus Overlap Analysis) enrichment testing, runnable entirely in the browser. Given a user set of regions, a universe, and a region database (built in memory from API-served BED data), computes Fisher's exact test enrichment of the user set against every database region set. + +See the [Rust gtars-lola reference](../lola.md) for the full statistical details — Fisher's exact test via hypergeometric survival function, CMLE odds ratio, contingency table semantics, and compatibility notes with R LOLA. This page is a Wasm-focused binding reference. + +!!! note "In-memory only — no folder loading" + The Wasm binding's `LolaRegionDB` is built entirely from in-memory data (`{regions, name}` entries passed in at construction). The Rust `RegionDB::from_lola_folder` loader is not available in Wasm — it requires a filesystem. In practice you'd fetch the region database from an API and pass the regions directly into the constructor, or build it from a pre-indexed IGD on the server side. + +## Import + +```ts +import init, { + LolaRegionDB, + runLOLA, + checkUniverseAppropriateness, +} from '@databio/gtars-js'; + +await init(); +``` + +## `LolaRegionDB` + +### Construction + +Construct from an array of `{ regions, name }` objects, where `regions` is an array of `[chr, start, end]` tuples and `name` is the filename-like identifier: + +```ts +const dbEntries = [ + { + name: 'encode_k562_ctcf.bed', + regions: [ + ['chr1', 100, 200], + ['chr1', 400, 500], + ['chr2', 1000, 1100], + ], + }, + { + name: 'encode_hela_ctcf.bed', + regions: [ + ['chr1', 150, 250], + ['chr2', 1050, 1150], + ], + }, +]; + +const db = new LolaRegionDB(dbEntries); +``` + +The constructor builds an IGD overlap index internally and wraps it with minimal annotation (filename only). To attach richer per-file metadata — cell type, tissue, antibody, etc. — build the annotated `RegionDB` on the Rust side and serve it through a different transport (e.g. a custom binary format), or expose a method that accepts `RegionSetAnno` objects alongside the regions. + +### Accessors + +```ts +db.numRegionSets // number +db.listRegionSets() // string[] — filenames +db.collectionAnno // Array of { collectionname, collector, date, source, description } + +// Extract region sets by 0-based index as a RegionSetList +// (pass null for "all sets"; names are populated from filenames) +const rsl = db.getRegionSets(null); +const rsl2 = db.getRegionSets([0, 5, 12]); +``` + +`getRegionSets()` returns a [`RegionSetList`](regionset.md#regionsetlist) with `.names` populated from the database filenames — the one path in Wasm where a `RegionSetList` has non-null names. + +## `runLOLA` + +Runs Fisher's exact test for every `(user_set, db_set)` pair and returns a column-oriented object suitable for dropping directly into a DataFrame or table. + +```ts +runLOLA( + userSets: Array>, + universe: Array<[string, number, number]>, + regionDb: LolaRegionDB, + minOverlap?: number, // default 1 + direction?: string, // "enrichment" (default) | "depletion" +): LolaResultColumnar +``` + +**Parameters:** +- `userSets` — **array of user sets**, where each user set is an array of `[chr, start, end]` tuples. Pass a single-element outer array `[peaks]` for a one-user-set analysis. +- `universe` — a single array of `[chr, start, end]` tuples representing the background. +- `regionDb` — a `LolaRegionDB`. +- `minOverlap` — minimum bp overlap to count as overlapping (default 1). +- `direction` — `"enrichment"` (default, P(X ≥ a), alternative "greater") or `"depletion"` (P(X ≤ a), alternative "less"). The strings `"greater"` / `"less"` are accepted as aliases on the Python side but Wasm only recognizes `"enrichment"` / `"depletion"` explicitly; anything else falls back to enrichment. + +**Returns** an object with parallel arrays (one entry per `(user_set, db_set)` pair) using camelCase keys: + +| key | type | meaning | +|---|---|---| +| `userSet` | `number[]` | 0-based user-set index | +| `dbSet` | `number[]` | 0-based db-set index | +| `collection` | `(string \| null)[]` | per-file annotation, usually null in Wasm | +| `pValueLog` | `number[]` | `-log10(p)` from Fisher's exact test, capped at ~322 | +| `oddsRatio` | `number[]` | CMLE odds ratio (matches R `fisher.test()$estimate`) | +| `support` | `number[]` | overlap count between user set and db set | +| `rnkPv`, `rnkOr`, `rnkSup` | `number[]` | 1-based per-metric ranks within the user set | +| `maxRnk` | `number[]` | max of the three ranks | +| `meanRnk` | `number[]` | mean of the three ranks | +| `b`, `c`, `d` | `number[]` | signed contingency values (can be negative) | +| `qValue` | `(number \| null)[]` | BH-adjusted p-value (applied automatically inside `runLOLA`) | +| `description`, `cellType`, `tissue`, `antibody`, `treatment`, `dataSource` | `(string \| null)[]` | per-file metadata | +| `filename` | `string[]` | db set file name | +| `size` | `number[]` | number of regions in the db set | + +Like the Python binding, Wasm `runLOLA` **auto-applies** `annotate_results` + `apply_fdr_correction` internally — you don't need to call them separately. Rows come back sorted by descending `pValueLog`, then ascending `meanRnk` (matching R LOLA output order). + +### Usage example + +```ts +import init, { + LolaRegionDB, + runLOLA, + RegionSet, +} from '@databio/gtars-js'; + +await init(); + +// Helper to convert a RegionSet to the tuple form LOLA expects +function toTuples(rs: RegionSet): Array<[string, number, number]> { + // Iterate rs and pull chr/start/end — or keep the original BedEntries + // if you still have them in scope from constructing the RegionSet. + // ... +} + +// 1. Build the region database from API-fetched data +const dbEntries = await fetchDatabaseRegions(); // [{ name, regions }, ...] +const db = new LolaRegionDB(dbEntries); + +// 2. Prepare user sets and universe as tuple arrays +const userSets = [userPeakTuples]; +const universe = universeTuples; + +// 3. Run enrichment +const results = runLOLA(userSets, universe, db, 1, 'enrichment'); + +// 4. Pivot to row-oriented view for rendering +const n = results.pValueLog.length; +for (let i = 0; i < n; i++) { + console.log( + `${results.filename[i]} ` + + `p=${results.pValueLog[i].toFixed(2)} ` + + `OR=${results.oddsRatio[i].toFixed(2)} ` + + `support=${results.support[i]}` + ); +} +``` + +!!! note "Negative contingency values" + If your user set contains regions *outside* the universe, the contingency table produces negative `b`/`c`/`d`. The binding matches R LOLA's behavior — it logs a warning, stores the signed values, and returns `pValueLog = 0.0`, `oddsRatio = NaN` for that row. Pre-process with `checkUniverseAppropriateness` (below) to detect this before running enrichment. + +## `checkUniverseAppropriateness` + +Diagnostic check — for each user set, reports what fraction of user-set regions overlap the universe and whether there are many-to-many mappings. + +```ts +const report = checkUniverseAppropriateness(userSets, universe); +// Returns a column-oriented object: +// { +// userSet: number[], +// totalRegions: number[], +// regionsInUniverse: number[], +// coverage: number[], // 0.0 to 1.0 per user set +// manyToMany: number[], // count of user regions overlapping >1 universe region +// warnings: string[] // free-form warning messages, pooled across user sets +// } + +const n = report.userSet.length; +for (let i = 0; i < n; i++) { + console.log( + `user set ${report.userSet[i]}: ` + + `${(report.coverage[i] * 100).toFixed(1)}% coverage, ` + + `${report.manyToMany[i]} many-to-many` + ); +} +for (const w of report.warnings) { + console.warn('⚠', w); +} +``` + +Warnings fire when coverage drops below 50% (severe) or 90% (moderate), or when any user region overlaps more than one universe region. + +!!! note "Rust-only helpers not exposed in Wasm" + The Rust universe module has two additional helpers — `redefine_user_sets` (rewrite user sets in terms of universe regions) and `build_restricted_universe` (disjoint union of all user sets, for differential analysis). Neither is currently exposed in the Wasm binding. If you need them, you can replicate the logic client-side using `RegionSetList` operations from [wasm/regionset](regionset.md), or request a binding. + +## See also + +- **[wasm/regionset](regionset.md)** — `RegionSet`, `RegionSetList`, and the `ConsensusBuilder` used upstream of LOLA analyses. +- **[gtars-lola](../lola.md)** — full Rust API reference, including the contingency table math, p-value computation, and CMLE odds ratio details. diff --git a/docs/gtars/wasm/partitions.md b/docs/gtars/wasm/partitions.md new file mode 100644 index 00000000..d86074aa --- /dev/null +++ b/docs/gtars/wasm/partitions.md @@ -0,0 +1,152 @@ +# Partitions (Wasm) + +Wasm bindings for the gtars-genomicdist partition system — classify query regions into promoter / UTR / exon / intron / intergenic buckets given a gene model. Everything is in-memory; there is no GTF loader on the JS side (because GTF loading requires filesystem access), so you either build a `GeneModel` from pre-parsed region arrays or lift one out of a `GenomicDistAnnotation` served from the API. + +See [gtars-genomicdist → Partitions](../genomicdist.md#partitions) for the full semantics — priority order, mutually-exclusive vs. bp-proportion modes, R GenomicDistributions divergences, and the chi-square p-value caveat. + +## Import + +```ts +import init, { + GeneModel, + PartitionList, + calcPartitions, + calcExpectedPartitions, +} from '@databio/gtars-js'; + +await init(); +``` + +## `GeneModel` + +Constructed from four JS arrays of region tuples: `genes`, `exons`, and optional `three_utr` / `five_utr`. Each region tuple is `[chr, start, end]` or `[chr, start, end, strand]`. When strand is provided, `+`/`-` become stranded; any other character becomes `Unstranded`. + +```ts +const genes = [ + ['chr1', 1000, 5000, '+'], + ['chr1', 8000, 12000, '-'], +]; +const exons = [ + ['chr1', 1000, 2000, '+'], + ['chr1', 4500, 5000, '+'], + ['chr1', 11500, 12000, '-'], +]; +const threeUtr = [['chr1', 4800, 5000, '+']]; +const fiveUtr = [['chr1', 1000, 1100, '+']]; + +// Any of three_utr / five_utr can be null if you don't have UTR annotations +const model = new GeneModel(genes, exons, threeUtr, fiveUtr); +``` + +!!! note "No GTF loader in Wasm" + The Rust `GeneModel::from_gtf` and `GeneModel::from_bed_files` methods are **not** available in Wasm — they require filesystem access. In the browser, typically: + + - Fetch a pre-built `.gda` (Genomic Dist Annotation) binary from the BEDbase API and decode it, **or** + - Parse regions yourself from JSON / TSV over the network and pass them into the `GeneModel` constructor. + +## `PartitionList` + +Build an ordered, priority-based partition list from a `GeneModel`. Partition priority order is hard-coded: `promoterCore` → `promoterProx` → `threeUTR` → `fiveUTR` → `exon` → `intron` (with `intergenic` added implicitly by `calcPartitions`). + +```ts +const chromSizes = { chr1: 248956422 }; + +const pl = PartitionList.fromGeneModel( + model, + 100, // core promoter size (bp upstream of gene start) + 2000, // proximal promoter size + chromSizes, // optional — pass null to skip boundary trimming +); +``` + +Pass `chromSizes: null` if you don't need promoter trimming at chromosome boundaries. + +## `calcPartitions` + +Classify query regions into partition categories. Two modes: + +- **Priority mode** (`bpProportion = false`): each query region is assigned to the first partition it overlaps; everything else goes to `intergenic`. Mutually exclusive, one region → one partition. +- **BP proportion mode** (`bpProportion = true`): for each partition, computes the total overlapping base pairs of the query set. **Not** mutually exclusive — a region that straddles two partitions contributes bp to each. + +```ts +const peaks: RegionSet = /* ... */; + +// Priority mode (region counts) +const result = calcPartitions(peaks, pl, false); +// { +// partitions: [{ name: 'promoterCore', count: 42 }, ...], +// total: 523 +// } + +for (const { name, count } of result.partitions) { + const pct = (count / result.total) * 100; + console.log(`${name.padEnd(15)} ${String(count).padStart(6)} (${pct.toFixed(1)}%)`); +} +``` + +## `calcExpectedPartitions` + +Observed vs. expected partition enrichment with a chi-square test. Requires chromosome sizes to compute the expected fraction based on each partition's share of the genome. + +```ts +const chromSizes = { chr1: 248956422, chr2: 242193529 }; + +const expected = calcExpectedPartitions(peaks, pl, chromSizes, false); +// Array of { partition, observed, expected, log10Oe, pvalue } + +for (const row of expected) { + console.log( + `${row.partition.padEnd(15)} ` + + `obs=${row.observed.toFixed(0).padStart(6)} ` + + `exp=${row.expected.toFixed(1).padStart(10)} ` + + `log10(O/E)=${row.log10Oe >= 0 ? '+' : ''}${row.log10Oe.toFixed(2)} ` + + `p=${row.pvalue.toExponential(2)}` + ); +} +``` + +Result fields use camelCase: `partition`, `observed`, `expected`, `log10Oe`, `pvalue`. + +!!! warning "p-values will differ slightly from R GenomicDistributions" + `calcExpectedPartitions` uses a goodness-of-fit `(O-E)²/E` formula. R's `chisq.test()` computes a 2×2 test of independence with optional Yates correction, so p-values will not match R output byte-for-byte. The `log10Oe` column is directly comparable. + +## End-to-end example + +```ts +import init, { + RegionSet, + GeneModel, + PartitionList, + calcExpectedPartitions, +} from '@databio/gtars-js'; + +await init(); + +// 1. Build query RegionSet from BED-like entries +const peaks = new RegionSet([ + ['chr1', 1500, 1800, ''], + ['chr1', 4600, 4900, ''], + ['chr1', 11700, 12000, ''], +]); + +// 2. Build GeneModel — in a real app you'd fetch annotations from an API +const model = new GeneModel( + [['chr1', 1000, 5000, '+'], ['chr1', 8000, 12000, '-']], + [['chr1', 1000, 2000, '+'], ['chr1', 4500, 5000, '+'], ['chr1', 11500, 12000, '-']], + null, + null, +); + +const chromSizes = { chr1: 248956422 }; +const pl = PartitionList.fromGeneModel(model, 100, 2000, chromSizes); + +// 3. Compute observed vs expected partition enrichment +const enriched = calcExpectedPartitions(peaks, pl, chromSizes, false); +console.table(enriched); +``` + +## See also + +- **[wasm/regionset](regionset.md)** — the `RegionSet` type used as the query input. +- **[wasm/signal](signal.md)** — signal matrix overlap. +- **[gtars-genomicdist → Partitions](../genomicdist.md#partitions)** — Rust reference with the priority-order and mutually-exclusive-vs-bp-proportion semantics. diff --git a/docs/gtars/wasm/regionset.md b/docs/gtars/wasm/regionset.md new file mode 100644 index 00000000..bef2a9e4 --- /dev/null +++ b/docs/gtars/wasm/regionset.md @@ -0,0 +1,255 @@ +# RegionSet & models (Wasm) + +The `@databio/gtars-js` Wasm bindings expose gtars' core interval-analysis types directly to JavaScript and TypeScript, with no server round-trip. This page covers everything in the `regionset` Wasm module — `RegionSet`, `RegionSetList`, `ConsensusBuilder`, and the result types for statistics and BED classification. + +See the Rust reference pages for the underlying semantics: + +- [gtars-core](../core.md) — `Region`, `RegionSet`, `RegionSetList` +- [gtars-genomicdist](../genomicdist.md) — the stats, interval algebra, and consensus/classifier methods that `RegionSet` gains in Wasm + +## Import + +```ts +import init, { + RegionSet, + RegionSetList, + ConsensusBuilder, +} from '@databio/gtars-js'; + +await init(); // initializes the wasm module; top-level await in ESM environments +``` + +## `RegionSet` + +A sorted collection of genomic regions constructed from an array of `[chr, start, end, rest]` tuples. Construction sorts by `(chr, start)`. + +### Construction + +```ts +// BED-like entries: [chr, start, end, rest] +// `rest` is any trailing BED metadata — pass "" if you have BED3 data +const entries: [string, number, number, string][] = [ + ['chr1', 100, 200, 'peak1'], + ['chr1', 300, 400, 'peak2'], + ['chr2', 500, 600, 'peak3'], +]; + +const rs = new RegionSet(entries); +``` + +### Properties + +```ts +rs.numberOfRegions // number +rs.meanRegionWidth // number +rs.nucleotidesLength // number (total bp) +rs.identifier // string — canonical MD5 id over first 3 columns +rs.firstRegion // string — debug repr of the first region +rs.classify // BedClassificationOutput (see below) +``` + +### Statistics + +```ts +// Per-chromosome summary stats — returns { [chr]: ChromosomeStatistics } +const stats = rs.chromosomeStatistics(); + +// Region widths (end - start) +const widths: number[] = rs.calcWidths(); + +// Signed gaps between consecutive regions (positive gaps only) +const gaps: number[] = rs.calcNeighborDistances(); + +// Per-region min-neighbor distance +const nn: number[] = rs.calcNearestNeighbors(); + +// Single-linkage clusters at a stitching radius; returns a cluster id per region +const clusterIds: number[] = rs.cluster(5_000); +``` + +!!! warning "Output length for `calcNeighborDistances` / `calcNearestNeighbors`" + Both methods **skip chromosomes with only one region** (matching R GenomicDistributions). The returned array is therefore **not aligned 1:1 with input regions** — it's shorter than `rs.numberOfRegions` whenever any chromosome has a single peak. No sentinel values are emitted. + +### Region distribution + +```ts +// Bin counts for plotting; chrom_sizes is optional +const chromSizes = { chr1: 248956422, chr2: 242193529 }; + +// With chrom_sizes → reference-aligned bins (comparable across files) +const dist = rs.regionDistribution(250, chromSizes); + +// Without chrom_sizes → bins derived from observed max end (NOT comparable) +const distLocal = rs.regionDistribution(250, null); +// Array of { chr, start, end, n, rid } +``` + +!!! warning "`n_bins` is a target, not a total" + When `chrom_sizes` is provided, `n_bins` is the target bin count for the **longest** chromosome in `chrom_sizes`. Bin width is derived from that, and every chromosome is tiled at the same bp width — so the total bin count across chromosomes is `sum(ceil(size / bin_width))`, which can substantially exceed `n_bins`. + +### Interval set algebra + +All operations return a new `RegionSet`. Metadata (`rest` fields) is dropped by operations that merge or synthesize new intervals. + +```ts +const chromSizes = { chr1: 248956422 }; + +const trimmed = rs.trim(chromSizes); +const merged = rs.reduce(); +const disjoint = rs.disjoin(); +const proms = rs.promoters(2000, 0); +const shifted = rs.shift(100); +const resized = rs.resize(500, 'center'); +const flanked = rs.flank(1000, true, false); +const narrowed = rs.narrow(100, 200, null); +const gapped = rs.gaps(chromSizes); + +const diff = rs.setdiff(other); +const pInter = rs.pintersect(other); +const inter = rs.intersect(other); +const combined = rs.concat(other); +const unioned = rs.union(other); + +const jac: number = rs.jaccard(other); +``` + +## `RegionSetList` + +A collection of named `RegionSet`s — the gtars equivalent of Bioconductor's `GRangesList`. Provides indexed pairwise operations without copying whole `RegionSet`s on every call. + +### Construction + +```ts +// Empty builder + add() +const rsl = new RegionSetList(); +rsl.add(rs1, 'rep1'); +rsl.add(rs2, 'rep2'); +rsl.add(rs3, 'rep3'); + +// Or build directly from BED entries for multiple sets at once +const rsl2 = RegionSetList.fromEntries( + [ + [['chr1', 100, 200, ''], ['chr1', 300, 400, '']], // set 1 + [['chr1', 150, 250, '']], // set 2 + [['chr2', 500, 600, '']], // set 3 + ], + ['rep1', 'rep2', 'rep3'], // names; pass null to leave unnamed +); +``` + +### Accessors + +```ts +rsl.length // number of sets +const rs = rsl.get(0); // RegionSet at index (throws on out-of-range) +rsl.names // string[] or null +const flat = rsl.concat(); // flatten into a single RegionSet (no merge) +``` + +### Indexed pair operations + +These let you compute operations on pairs by index without shuttling full `RegionSet`s across the JS/Wasm boundary — useful for building N×N analysis grids: + +```ts +const n = rsl.regionCount(0); // regions in set 0 +const pCount = rsl.pintersectCount(0, 1); // pairwise intersect count +const jac = rsl.jaccardAt(0, 1); // Jaccard between sets 0 and 1 +const un = rsl.unionAt(0, 1); // union as a new RegionSet +const diff = rsl.setdiffAt(0, 1); // set 0 minus set 1 + +const except = rsl.unionExcept(2); // union of everything but index 2 +``` + +### Bulk operations + +O(n) N-way operations via prefix/suffix arrays. + +```ts +// Union of all sets +const unionAll = rsl.unionAll(); + +// Intersection of all sets +const interAll = rsl.intersectAll(); + +// Prefix/suffix-based leave-one-out: all N "union except i" results in O(n) unions +const bulk = rsl.bulkUnionExcept(); +// { union_regions, union_nucleotides, except_unique: number[] } +// except_unique[i] = number of regions unique to set i vs. the union of all others +``` + +### Pairwise Jaccard matrix + +```ts +const result = rsl.pairwiseJaccard(); +// { matrix: number[][], names: string[] | null } +// +// matrix is symmetric with 1.0 on the diagonal — one row per set, one col per set. +``` + +## `ConsensusBuilder` + +Builder pattern for consensus region analysis — given N input region sets, compute the reduced union annotated with per-region replicate support. + +```ts +import { ConsensusBuilder } from '@databio/gtars-js'; + +const cb = new ConsensusBuilder(); +cb.add(rep1); +cb.add(rep2); +cb.add(rep3); + +const consensus = cb.compute(); +// Array of { chr, start, end, count } + +// Keep regions present in ≥ 2/3 replicates +const robust = consensus.filter(r => r.count >= 2); +console.log(`${robust.length} robust regions`); +``` + +`count` is the number of input **sets** (not regions) that overlap each union region. + +## Result types + +### `ChromosomeStatistics` + +Returned by `rs.chromosomeStatistics()` → `{ [chr: string]: ChromosomeStatistics }`. + +| field | type | +|---|---| +| `chromosome` | `string` | +| `number_of_regions` | `number` | +| `start_nucleotide_position` | `number` — leftmost start | +| `end_nucleotide_position` | `number` — rightmost end | +| `minimum_region_length`, `maximum_region_length` | `number` | +| `mean_region_length`, `median_region_length` | `number` | + +Fields are exposed as getters on a Wasm-bound class. + +### `RegionDistribution` + +Entries in the array returned by `rs.regionDistribution(n_bins, chrom_sizes)`. + +| field | type | +|---|---| +| `chr` | `string` | +| `start`, `end`, `n`, `rid` | `number` | + +`n` is the count of regions whose midpoint falls in the bin; `rid` is the bin's row index within its chromosome. + +### `BedClassificationOutput` + +Returned by the `rs.classify` getter. Identifies the BED/ENCODE subtype based on column analysis (uses the `bedclassifier` feature, enabled by default in the Wasm build). + +| field | type | +|---|---| +| `bed_compliance` | `string` | +| `data_format` | `string` — e.g. `"UcscBed"`, `"EncodeNarrowPeak"`, etc. | +| `compliant_columns`, `non_compliant_columns` | `number` | + +## See also + +- **[wasm/partitions](partitions.md)** — `GeneModel`, `PartitionList`, `calcPartitions` / `calcExpectedPartitions`. +- **[wasm/signal](signal.md)** — `SignalMatrix` and `calcSummarySignal`. +- **[wasm/lola](lola.md)** — `LolaRegionDB` and `runLOLA`. +- **[wasm/overlappers](overlappers.md)** — low-level overlap engine. +- **[gtars-core](../core.md)** and **[gtars-genomicdist](../genomicdist.md)** — Rust reference with algorithmic details and caveats that apply identically in Wasm. diff --git a/docs/gtars/wasm/signal.md b/docs/gtars/wasm/signal.md new file mode 100644 index 00000000..75a3f225 --- /dev/null +++ b/docs/gtars/wasm/signal.md @@ -0,0 +1,140 @@ +# Signal matrix (Wasm) + +Wasm bindings for the gtars-genomicdist signal matrix overlap — overlay query regions on a region × condition matrix of signal values (e.g. a peak × cell-type ChIP intensity matrix), aggregate by MAX per query region, and compute Tukey boxplot statistics per condition. + +See [gtars-genomicdist → Signal matrix overlap](../genomicdist.md#signal-matrix-overlap) for the full algorithmic detail. + +## Import + +```ts +import init, { + SignalMatrix, + calcSummarySignal, +} from '@databio/gtars-js'; + +await init(); +``` + +## `SignalMatrix` + +A region × condition matrix of signal values, loaded from either the packed binary format or built directly from JS data. + +### `SignalMatrix.fromBin` + +Load from a packed binary buffer (the `.sigm` format produced by the Rust `SignalMatrix::save_bin` or the CLI). This is the fast path — drop the bytes of a `.sigm` file served from your API into the constructor. + +```ts +const response = await fetch('/api/signal-matrix/my_matrix.sigm'); +const bytes = new Uint8Array(await response.arrayBuffer()); + +const sm = SignalMatrix.fromBin(bytes); +``` + +### From JS arrays + +For the rare case where you have the signal data in memory already (e.g. parsed from a TSV fetched separately), you can build a `SignalMatrix` directly via the `new SignalMatrix(...)` constructor: + +```ts +const regionIds = [ + 'chr1_100_200', + 'chr1_300_400', + 'chr2_500_600', +]; +const conditionNames = ['K562', 'HeLa', 'GM12878']; + +// Flat row-major array: row i, condition j = values[i * nConditions + j] +const values = new Float64Array([ + 1.2, 0.8, 2.3, // region 0 + 0.4, 1.5, 0.9, // region 1 + 3.1, 2.7, 1.8, // region 2 +]); + +const sm = new SignalMatrix( + regionIds, + conditionNames, + values, + 3, // nRegions + 3, // nConditions +); +``` + +!!! note "Region IDs must be `chr_start_end`" + The constructor parses each `regionIds` entry by splitting on `_` and expects exactly three parts: chromosome, start, end. IDs with more or fewer underscores error out. This matches the R GenomicDistributions convention used by `signal_matrix.tsv` files. + +## `calcSummarySignal` + +Overlap a query region set against a `SignalMatrix`, take the MAX signal per query region per condition, and compute Tukey boxplot statistics per condition. + +```ts +const peaks = new RegionSet([ + ['chr1', 150, 250, ''], + ['chr2', 550, 580, ''], +]); + +const result = calcSummarySignal(peaks, sm); + +console.log(result.conditionNames); +// ['K562', 'HeLa', 'GM12878'] + +console.log(result.signalMatrix); +// Array of { region, values } — per-query-region max signal vector +// [ +// { region: "chr1_150_250", values: [1.2, 0.8, 2.3] }, +// { region: "chr2_550_580", values: [3.1, 2.7, 1.8] }, +// ] + +console.log(result.matrixStats); +// Per-condition Tukey stats +// [ +// { condition: 'K562', lowerWhisker, lowerHinge, median, upperHinge, upperWhisker }, +// { condition: 'HeLa', ... }, +// { condition: 'GM12878', ... }, +// ] +``` + +### Result schema + +**Top level:** +- `signalMatrix: { region: string, values: number[] }[]` — one entry per query region that matched at least one signal row. `region` is the query region label in `chr_start_end` form; `values` are the per-condition max signals. +- `matrixStats: ConditionStats[]` — one entry per condition, in the order of `conditionNames`. +- `conditionNames: string[]` — column labels, same as the input `SignalMatrix.conditionNames`. + +**`ConditionStats`** (per condition, standard Tukey 5-number summary): + +| field | type | +|---|---| +| `condition` | `string` | +| `lowerWhisker` | `number` | +| `lowerHinge` | `number` — Q1 | +| `median` | `number` | +| `upperHinge` | `number` — Q3 | +| `upperWhisker` | `number` | + +## End-to-end example + +```ts +import init, { RegionSet, SignalMatrix, calcSummarySignal } from '@databio/gtars-js'; + +await init(); + +// 1. Fetch a packed binary signal matrix from the API +const response = await fetch('/api/signal/encode_k562_hela.sigm'); +const sm = SignalMatrix.fromBin(new Uint8Array(await response.arrayBuffer())); + +// 2. Build a query RegionSet from user-provided peaks +const peaks = new RegionSet(userPeakEntries); + +// 3. Compute summary signal +const result = calcSummarySignal(peaks, sm); + +// 4. Render per-condition boxplot +for (const stats of result.matrixStats) { + renderBoxplot(stats.condition, stats); +} +``` + +## See also + +- **[wasm/regionset](regionset.md)** — the `RegionSet` type used as the query input. +- **[wasm/partitions](partitions.md)** — partition classification. +- **[gtars-genomicdist → Signal matrix overlap](../genomicdist.md#signal-matrix-overlap)** — Rust reference for the underlying algorithm and the packed `.sigm` binary format. diff --git a/mkdocs.yml b/mkdocs.yml index 1711b382..ebb40d42 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -175,7 +175,7 @@ nav: - Overview: gtars/modules.md - gtars-core: gtars/core.md # - gtars-io: gtars/io.md - - gtars-models: gtars/regionSet.md + - Core models: gtars/regionSet.md - gtars-overlaprs: gtars/overlaprs.md - gtars-uniwig: gtars/uniwig.md - gtars-igd: gtars/igd.md @@ -183,11 +183,16 @@ nav: - gtars-fragsplit: gtars/fragsplit.md - gtars-tokenizers: gtars/tokenizers.md - gtars-scatrs: gtars/scatrs.md + - gtars-genomicdist: gtars/genomicdist.md + - gtars-lola: gtars/lola.md - gtars-refget: gtars/refget.md - gtars-bbcache: gtars/bbcache.md - Bindings: - Python: - Overview: gtars/python-overview.md + - Models: gtars/python/models.md + - GenomicDist: gtars/python/genomic_distributions.md + - LOLA: gtars/python/lola.md - Digests: gtars/python/digests.md - RefgetStore Tutorial: gtars/python/refgetstore.ipynb - RefgetStore Python Reference: gtars/python/refget-store.md @@ -195,9 +200,17 @@ nav: - Wasm: - Overview: gtars/wasm.md - Overlappers: gtars/wasm/overlappers.md + - RegionSet: gtars/wasm/regionset.md + - Partitions: gtars/wasm/partitions.md + - Signal: gtars/wasm/signal.md + - LOLA: gtars/wasm/lola.md - CLI: gtars/cli.md - R: - R: gtars/r.md + - RegionSet: gtars/r/regionset.md + - GenomicDist: gtars/r/genomicdist.md + - LOLA: gtars/r/lola.md + - IGD: gtars/r/igd.md - RefgetStore: gtars/r/refgetstore.md - Reference: - Changelog: gtars/changelog.md