Add experimental pileup column aggregation API by sstadick · Pull Request #1 · sstadick/seqair

sstadick · 2026-05-09T11:11:31Z

Summary 🤖

This is an experimental API branch in my seqair fork motivated by perbase's seqair integration. perbase currently materializes each PileupColumn and then walks every alignment again to compute command-specific row summaries: depth, base counts, insertion/deletion/refskip counts, failed-read counts, and eventually mate-aware summaries.

This PR explores making seqair more useful for those downstream pileup consumers by adding:

PileupSummary on PileupColumn, computed while the column is built.
BaseCounts and PileupOpCounts helpers.
PileupEngine::pileup_with(...), an opt-in custom accumulator path that observes the same post-max_depth alignments as pileups() but does not need to materialize a PileupColumn.
A PileupColumnContext passed to accumulator finish hooks.
Tracey spec rules and tests for summary/aggregation semantics.

The existing lending PileupEngine::pileups() API is preserved; internally it now uses a materializing accumulator so the materialized path and custom path share the same traversal/max-depth semantics.

Why

For perbase non-mate base-depth, the custom accumulator can compute per-row counts directly while seqair is already visiting active records. In the sibling perbase branch/PR, this avoids a downstream second pass over column.raw_alignments() and avoids allocating the per-column alignment Vec for simple counting output.

This also provides an extras-based way to model count/fail/drop style downstream semantics without adding a first-class seqair ReadDisposition API yet: dropped reads are rejected by CustomizeRecordStore, while kept reads carry caller-defined extras that the accumulator can inspect through RecordStore.

Notes / API choices

PileupColumn::summary() describes exactly the emitted alignments exposed by raw_alignments().
Summary counts are post-max_depth truncation.
Custom aggregation observes the same post-max_depth alignments that pileups() would expose.
Accumulator output is owned, keeping lifetimes straightforward.
This does not attempt mate-aware grouping in seqair; perbase can prototype that separately if the non-mate path proves useful.

Validation

Ran locally on macOS with SDK env for htslib-backed tests:

cargo fmt --check
cargo check -p seqair
SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo clippy -p seqair --all-targets -- -D warnings

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo test -p seqair --lib

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo test -p seqair --test pileup

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo test -p seqair --test compare_bam_with_htslib pileup

Results:

lib tests: 740 passed
pileup integration tests: 36 passed
htslib pileup comparison subset: 3 passed
clippy clean with -D warnings

Downstream experiment

perbase PR sstadick/perbase#108 uses PileupEngine::pileup_with for non-mate base-depth --seqair-pileup. On the HG00157 chr1 10 Mb BAM subset, the new accumulator path produced byte-for-byte identical output to htslib.

Latest perbase benchmark summary from that PR:

base-depth htslib: 5.147 ± 0.461 s
base-depth seqair accumulator: 5.104 ± 0.166 s
both outputs SHA-256: 150f61653f0cd9225787248fecbbd9d46bde1a9644632bcba8f1898ee780c572

feat: add pileup column aggregation

6d9251c

sstadick mentioned this pull request May 9, 2026

Use seqair pileup accumulator for base-depth sstadick/perbase#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental pileup column aggregation API#1

Add experimental pileup column aggregation API#1
sstadick wants to merge 1 commit into
mainfrom
feat/pileup-column-aggregation

sstadick commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sstadick commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary 🤖

Why

Notes / API choices

Validation

Downstream experiment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sstadick commented May 9, 2026 •

edited

Loading