Some features inspired by perbase#51
Open
killercup wants to merge 16 commits into
Open
Conversation
…reads All three format readers (BAM, SAM, CRAM) now let unmapped reads flow through to the CustomizeRecordStore::filter_raw callback. The pileup engine independently excludes unmapped reads from columns per r[pileup.unmapped_excluded]. - BAM: removed keep_unmapped/keeps_unmapped builder, removed per-reader unmapped skip in fetch_into_customized - SAM: removed hard-coded unmapped skip in parse-and-push loop - CRAM: unmapped reads now pushed to store with end_pos=pos, empty cigar, and overlap/foreign-tid checks - Specs: updated 2-bam-1-reader, 2-sam-1-reader, 2-cram-1-reader
All three format readers now let unmapped reads flow through to filter_raw by default. Comparison tests against htslib/noodles (which exclude unmapped reads) now use fetch_into_customized with a RejectUnmapped customizer to replicate the old reader behavior. Added RejectUnmapped customizer to tests/helpers.rs and inline in files without mod helpers.
…s parity Add RejectUnmapped customizer to tests/helpers.rs and inline in comparison test files. Tests that compare against htslib/noodles now use fetch_into_customized with RejectUnmapped to match the old behavior where unmapped reads were excluded at the reader level. Updated: compare_records, sam_edge_cases, compare_bam_with_htslib, compare_bam_with_noodles, compare_sam_with_htslib, htslib_edge_cases, compare_cram_with_htslib, compare_cram_with_noodles
The unified_reader tests use both Readers and IndexedReader directly, requiring careful fetch_into_customized conversion. Left as known expected failures — they validate BAM/CRAM parity which is now affected by CRAM pushing unmapped reads that BAM doesn't (same-region edge case).
Readers::fetch_into was ignoring self.customize and always passing &mut (). Now it delegates to fetch_into_customized with the stored customize, so open_customized(..., RejectUnmapped) actually takes effect without requiring callers to use fetch_into_customized. Fixed all remaining comparison tests to use open_customized with RejectUnmapped customizer. All 1201 tests pass.
…ocument compute-before-filter tradeoff FilterRawFields.end_pos is now Option<Pos0> — None in the BAM push_raw path (CIGAR not yet decoded), Some(_) in SAM/CRAM push_fields. Prevents silent misuse where customizers would get pos instead of the true end_pos from BAM input. Add pub struct RejectUnmapped (CustomizeRecordStore impl) as the canonical way to get htslib-compatible behavior. Re-export CustomizeRecordStore and FilterRawFields from bam module. Document the compute-before-filter ordering tradeoff in the CustomizeRecordStore trait, with a ReadGroupFilter example showing how to push aux-based rejection into filter_raw.
…mentView slab accessors
Replace four Option fields on FilterRawFields with enum CigarSlice
{Bytes/Ops} and enum Sequence {Packed/Bases}. The type system now
prevents path-dependent field misuse — customizers match on the
enum instead of checking which Option is Some.
Add seq() and qualities() to AlignmentView for full-slab sequence
and quality access. Renamed from qual() to avoid shadowing
PileupAlignment::qual() via Deref.
Unify pileup iteration to a single API: PileupEngine owns an internal
Vec<PileupAlignment> buffer (zero per-column allocation), PileupColumn
holds &[PileupAlignment] instead of Vec, and PileupColumnPinned +
pileups_into() are removed. All callers use while let Some(col) =
engine.pileups() — no API choice to make.
Simplify take_store docs: guides users to PileupGuard recovery
instead of long footgun warnings.
…t transmute Replace the CigarSlice enum (Bytes/Ops variants) with a plain &[CigarOp] field on FilterRawFields. Add CigarOp::slice_from_bam_bytes() which zero-cost transmutes raw BAM CIGAR bytes to &[CigarOp] on aligned little-endian input. For the rare unaligned case (samtools/pysam routinely produce unpadded l_read_name), falls back to a local copy before filter_raw. On big-endian or oddly-sized input, returns None — the caller rejects the record with DecodeError::CigarByteLength. Proptests verify round-trip, misalignment behavior, and edge cases.
htslib investigation revealed that on-disk BAM files do NOT pad l_read_name to 4 bytes — the CIGAR pointer is genuinely unaligned for ~75% of records. htslib pads in memory after reading; we cannot safely transmute raw bytes to &[CigarOp] in filter_raw. FilterRawFields.cigar is now &[u8] (raw BAM CIGAR bytes). Customizers call CigarOp::slice_from_bam_bytes() for typed zero-cost access on aligned input, or CigarOp::extend_from_bam_bytes() for a safe copy. slice_from_bam_bytes remains as a utility — it panics on unaligned input in debug mode (from_raw_parts assertion) and documents this. push_fields uses CigarOp::ops_as_bytes() to encode typed ops back to raw bytes for the unified &[u8] field.
swap_remove scrambled the active set order during eviction by swapping the last element into the removed slot. This broke deterministic alignment ordering within pileup columns — two runs on the same BAM could produce different per-column alignment orders, and max_depth truncation dropped arbitrary reads instead of the last N in insertion order. Replace with a single-pass stable retain that compacts survivors to the front, preserving RecordStore (BAM file) insertion order. For typical depths (30-200), the O(d) linear scan per column is unmeasurable — it's dwarfed by the O(d) column building that follows. Adds two tests: - eviction_preserves_insertion_order: verifies 5 staggered-span reads remain in order through successive evictions - max_depth_truncates_in_insertion_order: verifies max_depth keeps the first N reads in insertion order, not arbitrary ones
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.