Add gtars-node npm publish pipeline#255
Open
nsheff wants to merge 20 commits into
Open
Conversation
- Rename directory gtars-napi/ → gtars-node/ to match npm package name - Add build-node and publish-node jobs to rust-publish.yml - Add build-node-bindings.yml CI workflow - Set version to 0.0.1 for initial npm publish - Configure napi triples for linux-x64-gnu and darwin-arm64 only - Add optionalDependencies, repository, license, files to package.json - Bump all actions/checkout to v6 - Remove unnecessary npm install -g npm@latest from wasm job
Cargo requires gtars-node/ to be either in workspace.members or workspace.exclude since it lives inside the workspace root. Adding to members since it uses workspace dependencies (anyhow).
- Add compare(digestA, digestB) method returning comparison as JSON string - Fix list_collections to use new paginated API (page, page_size, filters) - Fix stats() to match current StoreStats (was stats_extended) - Remove total_disk_size from StoreStatsJs (no longer in StoreStats) - Add serde_json dependency for comparison serialization - Clean up unused mut warnings
napi prepublish generates index.js/index.d.ts but doesn't always copy the .node binaries into the npm/ platform subdirectories. Copy them explicitly after prepublish.
- Add load_sequence/load_collection calls for remote store support - Add pagination params to list_collections (page, page_size) - Bump version to 0.0.2 - Add README
OIDC provenance gets 404 on npm publish for this org. Use the NPM_TOKEN secret instead. Add --ignore-scripts to main package publish to prevent prepublishOnly from re-publishing platform packages.
Adapts the streaming sequence decoder (originally on feat/streaming-decoder) to the split store module structure on feat/gtars-node-publish. The stream_sequence method is added to ReadonlyRefgetStore (and inherited by RefgetStore via Deref), producing refget sequence bytes incrementally without materializing the full sequence in memory. Needed as the Rust backing for the upcoming gtars-node streamSequence binding.
Exposes RefgetStore::stream_sequence through napi as a push-based ThreadsafeFunction pipeline (streamSequenceInternal), wrapped on the JS side by a thin wrapper.js that returns a node:stream Readable. Clients can now .pipe(res) into an HTTP response without buffering the full sequence, and backpressure is propagated to the Rust worker thread via a blocking threadsafe-function call. Also teaches stream_sequence to stream from the in-memory encoded byte buffer when a Full SequenceRecord is present, so in-memory stores (which lack a local_path/remote_source) still support streaming. Adds streaming.test.mjs covering full/substring/error/backpressure paths, and updates the README with an Express /sequence/:digest example.
Previously, `open_remote_range` would silently accept any HTTP response from the remote, including 200 OK with the full file body when the server ignores the Range header (misbehaving CDNs, proxies, or object stores). The StreamingDecoder would then wrap the wrong bytes with a `leading_skip_bits` computed for the requested range, emitting silently corrupted bases with no error surfaced to the caller. Fix: - Reject any response whose status is not 206 Partial Content with a clear error message. - Add connect (30s) and read (60s) timeouts on the remote GET so hung remotes cannot stall streaming indefinitely. Also adds a regression test that spins up a local HTTP server which deliberately responds 200 OK with the full body, and asserts that `open_remote_range` now returns an error instead of quietly returning the wrong bytes.
The streaming decoder is documented as O(1) in sequence length, but two code paths violated that: 1. Full-record fast path cloned the whole encoded byte range before streaming (readonly.rs). For an in-memory chromosome-sized record, peak memory doubled on every stream_sequence call. 2. The napi binding pre-called store.load_sequence before streaming, pulling the entire sequence into RAM even when the record was backed by a file or a remote HTTP range. For a remote-backed Node server streaming a chromosome, peak memory grew to ~2x the encoded sequence, completely defeating the streaming design. Fixes: - SequenceRecord::Full now holds Arc<Vec<u8>> instead of Vec<u8>, and stream_sequence hands out an ArcSliceReader that shares ownership of the underlying buffer. No byte range is copied. Added sequence_arc() accessor; sequence() still returns &[u8] via Deref. - gtars-node stream_sequence_internal drops the load_sequence call and relies on ReadonlyRefgetStore::stream_sequence's existing file/remote fallback for Stub records. Tests (test-driven): - test_stream_sequence_bounded_memory_full_record uses peak_alloc to measure peak allocation delta around a 1 MB stream. Bounded at SEQ_LEN/4; the pre-fix code clones the full buffer (measured 1.26 MB delta), the post-fix code allocates only reader frames. - test_stream_sequence_no_preload_for_stub_record builds an on-disk store, forces the record to Stub, and asserts stream_sequence succeeds without populating the in-memory cache. Guards the napi fix structurally.
Proves the production path (store opened via open_local/open_remote, records left as Stub, stream_sequence falling back to the on-disk file) has O(1) memory use regardless of sequence size. Measured peak is ~8 KiB (BufReader default) at both 1 MB and 16 MB sequence sizes. The test drains into a fixed-size scratch buffer instead of read_to_end to avoid the geometric Vec growth's transient realloc peaks masking the actual streaming footprint.
Mammalian genomes >4 Gb (e.g. some plant assemblies) would silently truncate when callers passed u32 bounds. The internal arithmetic was already u64; this widens the public signature to match. Adds test_stream_sequence_u64_bounds_compile_and_match to lock the new signature.
Step 3 (#3): stream_sequence_internal, getSubstring, and SequenceMetadata.length now use napi BigInt (napi6 feature). A bigint_to_u64 helper validates at the boundary. Step 4 (#5): worker thread now checks an AtomicBool abort flag each iteration. AbortOnDrop guards are captured in each threadsafe function closure so that stream.destroy() on the JS side flips the flag when the tsfn is finalized. tsfn_chunk.call() non-Ok status also breaks the loop, avoiding deadlock if the queue is closing.
Replace the monkey-patch pattern in wrapper.js with a proper RefgetStore subclass. Static factories (openLocal, openRemote, inMemory) rewrap via Object.setPrototypeOf. getSubstring coerces number -> BigInt at the JS boundary. wrapper.d.ts now uses export * from './index.js' instead of duplicating every type.
BufReader(64 KiB) now wraps the HTTP response in open_remote_range so ureq doesn't issue one syscall per byte on the remote path. StreamingDecoder::refill reads up to 6 bytes at a time when the bit buffer has room, instead of always reading 1 byte. A TrickleReader test confirms partial-read correctness.
Three new streaming tests: - BigInt bounds (1n, 7n) produce same output as getSubstring - stream.destroy() mid-stream doesn't hang the process - factory instances pass instanceof RefgetStore check
Adds streaming sequence API (streamSequence), BigInt bounds, thread-safe abort on stream.destroy(), subclass wrapper, and buffered remote reads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gtars-napi/→gtars-node/to match npm package name@databio/gtars-nodebuild-nodeandpublish-nodejobs torust-publish.ymlbuild-node-bindings.ymlCI workflowgtars-nodeto workspace membersoptionalDependencies,repository,license,filesfields to package.jsonactions/checkoutto v6 across all jobscomparemethod and lazyload_sequence/load_collectionfor remote store supportlist_collectionsgetSequence,getCollectionMetadata, andcomparePublished as
@databio/gtars-node@0.0.2on npm.Test plan