Skip to content

Some features inspired by perbase#51

Open
killercup wants to merge 16 commits into
mainfrom
feature/for-perbase
Open

Some features inspired by perbase#51
killercup wants to merge 16 commits into
mainfrom
feature/for-perbase

Conversation

@killercup
Copy link
Copy Markdown
Member

No description provided.

killercup added 16 commits May 9, 2026 18:03
…reads

All three format readers (BAM, SAM, CRAM) now let unmapped reads
flow through to the CustomizeRecordStore::filter_raw callback.
The pileup engine independently excludes unmapped reads from
columns per r[pileup.unmapped_excluded].

- BAM: removed keep_unmapped/keeps_unmapped builder, removed
  per-reader unmapped skip in fetch_into_customized
- SAM: removed hard-coded unmapped skip in parse-and-push loop
- CRAM: unmapped reads now pushed to store with end_pos=pos,
  empty cigar, and overlap/foreign-tid checks
- Specs: updated 2-bam-1-reader, 2-sam-1-reader, 2-cram-1-reader
All three format readers now let unmapped reads flow through to
filter_raw by default. Comparison tests against htslib/noodles
(which exclude unmapped reads) now use fetch_into_customized with
a RejectUnmapped customizer to replicate the old reader behavior.

Added RejectUnmapped customizer to tests/helpers.rs and inline
in files without mod helpers.
…s parity

Add RejectUnmapped customizer to tests/helpers.rs and inline in
comparison test files. Tests that compare against htslib/noodles
now use fetch_into_customized with RejectUnmapped to match the
old behavior where unmapped reads were excluded at the reader level.

Updated: compare_records, sam_edge_cases, compare_bam_with_htslib,
compare_bam_with_noodles, compare_sam_with_htslib, htslib_edge_cases,
compare_cram_with_htslib, compare_cram_with_noodles
The unified_reader tests use both Readers and IndexedReader directly,
requiring careful fetch_into_customized conversion. Left as known
expected failures — they validate BAM/CRAM parity which is now affected
by CRAM pushing unmapped reads that BAM doesn't (same-region edge case).
Readers::fetch_into was ignoring self.customize and always passing
&mut (). Now it delegates to fetch_into_customized with the stored
customize, so open_customized(..., RejectUnmapped) actually takes
effect without requiring callers to use fetch_into_customized.

Fixed all remaining comparison tests to use open_customized with
RejectUnmapped customizer. All 1201 tests pass.
…ocument compute-before-filter tradeoff

FilterRawFields.end_pos is now Option<Pos0> — None in the BAM
push_raw path (CIGAR not yet decoded), Some(_) in SAM/CRAM
push_fields. Prevents silent misuse where customizers would get
pos instead of the true end_pos from BAM input.

Add pub struct RejectUnmapped (CustomizeRecordStore impl) as the
canonical way to get htslib-compatible behavior. Re-export
CustomizeRecordStore and FilterRawFields from bam module.

Document the compute-before-filter ordering tradeoff in the
CustomizeRecordStore trait, with a ReadGroupFilter example
showing how to push aux-based rejection into filter_raw.
…mentView slab accessors

Replace four Option fields on FilterRawFields with enum CigarSlice
{Bytes/Ops} and enum Sequence {Packed/Bases}. The type system now
prevents path-dependent field misuse — customizers match on the
enum instead of checking which Option is Some.

Add seq() and qualities() to AlignmentView for full-slab sequence
and quality access. Renamed from qual() to avoid shadowing
PileupAlignment::qual() via Deref.

Unify pileup iteration to a single API: PileupEngine owns an internal
Vec<PileupAlignment> buffer (zero per-column allocation), PileupColumn
holds &[PileupAlignment] instead of Vec, and PileupColumnPinned +
pileups_into() are removed. All callers use while let Some(col) =
engine.pileups() — no API choice to make.

Simplify take_store docs: guides users to PileupGuard recovery
instead of long footgun warnings.
…t transmute

Replace the CigarSlice enum (Bytes/Ops variants) with a plain &[CigarOp]
field on FilterRawFields. Add CigarOp::slice_from_bam_bytes() which
zero-cost transmutes raw BAM CIGAR bytes to &[CigarOp] on aligned
little-endian input. For the rare unaligned case (samtools/pysam
routinely produce unpadded l_read_name), falls back to a local copy
before filter_raw.

On big-endian or oddly-sized input, returns None — the caller rejects
the record with DecodeError::CigarByteLength.

Proptests verify round-trip, misalignment behavior, and edge cases.
htslib investigation revealed that on-disk BAM files do NOT pad
l_read_name to 4 bytes — the CIGAR pointer is genuinely unaligned
for ~75% of records. htslib pads in memory after reading; we cannot
safely transmute raw bytes to &[CigarOp] in filter_raw.

FilterRawFields.cigar is now &[u8] (raw BAM CIGAR bytes). Customizers
call CigarOp::slice_from_bam_bytes() for typed zero-cost access on
aligned input, or CigarOp::extend_from_bam_bytes() for a safe copy.

slice_from_bam_bytes remains as a utility — it panics on unaligned
input in debug mode (from_raw_parts assertion) and documents this.
push_fields uses CigarOp::ops_as_bytes() to encode typed ops back
to raw bytes for the unified &[u8] field.
swap_remove scrambled the active set order during eviction by swapping
the last element into the removed slot. This broke deterministic
alignment ordering within pileup columns — two runs on the same BAM
could produce different per-column alignment orders, and max_depth
truncation dropped arbitrary reads instead of the last N in insertion
order.

Replace with a single-pass stable retain that compacts survivors to
the front, preserving RecordStore (BAM file) insertion order. For
typical depths (30-200), the O(d) linear scan per column is
unmeasurable — it's dwarfed by the O(d) column building that follows.

Adds two tests:
- eviction_preserves_insertion_order: verifies 5 staggered-span reads
  remain in order through successive evictions
- max_depth_truncates_in_insertion_order: verifies max_depth keeps
  the first N reads in insertion order, not arbitrary ones
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant