Skip to content

Add experimental pileup column aggregation API#1

Open
sstadick wants to merge 1 commit into
mainfrom
feat/pileup-column-aggregation
Open

Add experimental pileup column aggregation API#1
sstadick wants to merge 1 commit into
mainfrom
feat/pileup-column-aggregation

Conversation

@sstadick
Copy link
Copy Markdown
Owner

@sstadick sstadick commented May 9, 2026

Summary 🤖

This is an experimental API branch in my seqair fork motivated by perbase's seqair integration. perbase currently materializes each PileupColumn and then walks every alignment again to compute command-specific row summaries: depth, base counts, insertion/deletion/refskip counts, failed-read counts, and eventually mate-aware summaries.

This PR explores making seqair more useful for those downstream pileup consumers by adding:

  • PileupSummary on PileupColumn, computed while the column is built.
  • BaseCounts and PileupOpCounts helpers.
  • PileupEngine::pileup_with(...), an opt-in custom accumulator path that observes the same post-max_depth alignments as pileups() but does not need to materialize a PileupColumn.
  • A PileupColumnContext passed to accumulator finish hooks.
  • Tracey spec rules and tests for summary/aggregation semantics.

The existing lending PileupEngine::pileups() API is preserved; internally it now uses a materializing accumulator so the materialized path and custom path share the same traversal/max-depth semantics.

Why

For perbase non-mate base-depth, the custom accumulator can compute per-row counts directly while seqair is already visiting active records. In the sibling perbase branch/PR, this avoids a downstream second pass over column.raw_alignments() and avoids allocating the per-column alignment Vec for simple counting output.

This also provides an extras-based way to model count/fail/drop style downstream semantics without adding a first-class seqair ReadDisposition API yet: dropped reads are rejected by CustomizeRecordStore, while kept reads carry caller-defined extras that the accumulator can inspect through RecordStore.

Notes / API choices

  • PileupColumn::summary() describes exactly the emitted alignments exposed by raw_alignments().
  • Summary counts are post-max_depth truncation.
  • Custom aggregation observes the same post-max_depth alignments that pileups() would expose.
  • Accumulator output is owned, keeping lifetimes straightforward.
  • This does not attempt mate-aware grouping in seqair; perbase can prototype that separately if the non-mate path proves useful.

Validation

Ran locally on macOS with SDK env for htslib-backed tests:

cargo fmt --check
cargo check -p seqair
SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo clippy -p seqair --all-targets -- -D warnings

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo test -p seqair --lib

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo test -p seqair --test pileup

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo test -p seqair --test compare_bam_with_htslib pileup

Results:

  • lib tests: 740 passed
  • pileup integration tests: 36 passed
  • htslib pileup comparison subset: 3 passed
  • clippy clean with -D warnings

Downstream experiment

perbase PR sstadick/perbase#108 uses PileupEngine::pileup_with for non-mate base-depth --seqair-pileup. On the HG00157 chr1 10 Mb BAM subset, the new accumulator path produced byte-for-byte identical output to htslib.

Latest perbase benchmark summary from that PR:

  • base-depth htslib: 5.147 ± 0.461 s
  • base-depth seqair accumulator: 5.104 ± 0.166 s
  • both outputs SHA-256: 150f61653f0cd9225787248fecbbd9d46bde1a9644632bcba8f1898ee780c572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant