Skip to content

Use seqair pileup accumulator for base-depth#108

Open
sstadick wants to merge 2 commits into
feat/seqairfrom
feat/seqair-pileup-aggregation
Open

Use seqair pileup accumulator for base-depth#108
sstadick wants to merge 2 commits into
feat/seqairfrom
feat/seqair-pileup-aggregation

Conversation

@sstadick
Copy link
Copy Markdown
Owner

@sstadick sstadick commented May 9, 2026

Summary 🤖

This is a downstream experiment for the seqair pileup aggregation API in my seqair fork: sstadick/seqair#1.

The branch pins seqair / seqair-types to my seqair fork commit 6d9251c and changes only the non-mate base-depth --seqair-pileup path to use the new custom accumulator API:

  • PileupEngine::pileup_with(...)
  • SeqairPileupPositionAccumulator

The existing mate-aware base-depth -m --seqair-pileup path still uses materialized PileupColumns because mate fixing needs per-column grouping by QNAME.

Why

Previously, non-mate base-depth --seqair-pileup materialized a PileupColumn and then perbase looped over column.raw_alignments() to compute depth, base counts, insertions, deletions, refskips, and fail counts.

With the accumulator API, perbase computes those row counts while seqair is already walking emitted pileup observations. For simple non-mate base-depth, this avoids materializing a public pileup column and avoids a downstream second pass over alignments.

Correctness checks

The existing htslib-vs-seqair process-region parity test now exercises the accumulator path for non-mate cases and the old materialized path for mate-aware cases.

The empty-SEQ regression test now compares all three paths:

  • htslib pileup
  • seqair materialized column path
  • seqair accumulator path

Local validation

cargo fmt --check

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo check

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo check --features seqair-pileup

SDKROOT="$(xcrun --show-sdk-path)" \
BINDGEN_EXTRA_CLANG_ARGS="-isysroot $(xcrun --show-sdk-path)" \
cargo test --features seqair-pileup

Results:

  • lib tests: 23 passed
  • bin tests: 117 passed
  • doctests: 1 passed

Benchmarking

Benchmark numbers are posted as a PR comment.

@sstadick
Copy link
Copy Markdown
Owner Author

sstadick commented May 9, 2026

Benchmark update for this branch using the seqair accumulator API from sstadick/seqair#1 (sstadick/seqair@6d9251c).

Setup: release build with --features seqair-pileup, HG00157 chr1 10 Mb BAM subset (paper/data/HG00157.chr1_10mb.bam) and BED (paper/data/hg00157_chr1_10mb.bed), writing TSV output. Non-mate base-depth --seqair-pileup uses the new accumulator path; mate-aware base-depth -m --seqair-pileup and only-depth --seqair are unchanged paths.

Mode htslib seqair Notes
base-depth 5.147 ± 0.461 s 5.104 ± 0.166 s 3 runs; seqair accumulator 1.01x faster; exact output parity, SHA-256 150f6165...
base-depth -m 51.045 s 33.402 s 1 run; seqair 1.53x faster; same 9,892,897 row count; still the known 12 sparse default -F 0 mate-order diffs
base-depth -m -F 2304 51.072 s 33.770 s 1 run; seqair 1.51x faster; exact output parity, SHA-256 393a5787...
only-depth 1.203 ± 0.157 s 1.263 ± 0.125 s 3 runs; exact output parity, SHA-256 0d98d1ab...; unchanged path
only-depth -x 859.3 ± 15.6 ms 1.200 ± 0.053 s 3 runs; exact output parity, SHA-256 876b7692...; unchanged path

Output checks:

base-depth htslib              9,892,897 rows  150f61653f0cd9225787248fecbbd9d46bde1a9644632bcba8f1898ee780c572
base-depth seqair accumulator  9,892,897 rows  150f61653f0cd9225787248fecbbd9d46bde1a9644632bcba8f1898ee780c572

base-depth -m htslib           9,892,897 rows  07b50f9750b8760e49b997622d878d8f159276d0b39644bfc4e57bd4e88a04d7
base-depth -m seqair           9,892,897 rows  c73f7b661170eecfdb623c041d126e63cba638e2fed73eceeaeb58afb53e6b7f
base-depth -m diff lines: 12

base-depth -m -F2304 htslib    9,892,897 rows  393a578788a83f589de5e712aeb1abfc74bd4bbed464ab59ec0353a0ece81045
base-depth -m -F2304 seqair    9,892,897 rows  393a578788a83f589de5e712aeb1abfc74bd4bbed464ab59ec0353a0ece81045

only-depth htslib              3,282,664 rows  0d98d1abfe60c79d4bbb51bfa14fb00e5f5123311e216d28ba34cd0ea3c680b1
only-depth seqair              3,282,664 rows  0d98d1abfe60c79d4bbb51bfa14fb00e5f5123311e216d28ba34cd0ea3c680b1

only-depth -x htslib           3,285,180 rows  876b76926203980e3a9900f2d6e7e05cc7f77898fe4df413689ac39b60e9a030
only-depth -x seqair           3,285,180 rows  876b76926203980e3a9900f2d6e7e05cc7f77898fe4df413689ac39b60e9a030

Compared with the previous seqair v0.1.0 materialized-column run on this same benchmark, non-mate base-depth --seqair-pileup moved from 5.323 ± 0.127 s to 5.104 ± 0.166 s while preserving exact parity. The main intended win here is avoiding the downstream second pass/materialized column path for simple non-mate counting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant