Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
fef314e
Added streaming of bed file
khoroshevskyi Feb 11, 2026
e99f4af
start bm25 implementation for genomic interval data
nleroy917 Feb 28, 2026
2732051
initial pass
nleroy917 Mar 1, 2026
5658dc9
filter out unk token ids
nleroy917 Mar 1, 2026
ea0267c
enable sending tokenizer directly
nleroy917 Mar 1, 2026
346472b
type stubs
nleroy917 Mar 1, 2026
25fd660
switch to camel case
nleroy917 Mar 1, 2026
6cbb5db
update .gitignore
nleroy917 Mar 1, 2026
acbdf7c
make importing from collections us copy instead of memory
nsheff Mar 18, 2026
eb7ddc6
Merge branch 'master' into dev
nsheff Mar 19, 2026
0e32da1
Add RegionSetList constructor and add() method to wasm bindings
sanghoonio Mar 19, 2026
446f607
Bump gtars-js wasm version to 0.8.1
sanghoonio Mar 19, 2026
69b3881
Add indexed operations on RegionSetList and fromEntries constructor
sanghoonio Mar 19, 2026
eb8ef5a
Factor LOLA result conversion into core, remove dead code
sanghoonio Mar 19, 2026
db0fecf
Add bulk union/intersect operations on RegionSetList
sanghoonio Mar 19, 2026
fa48fa0
Fix intersect_all to use range-level intersect, add tests for RegionS…
sanghoonio Mar 19, 2026
99a2b51
make importing from collections us copy instead of memory
nsheff Mar 18, 2026
c0e60cc
Merge branch 'dev' of github.com:databio/gtars into dev
nsheff Mar 20, 2026
87bb1de
Fix names misalignment in add(), validate union_except bounds, optimi…
sanghoonio Mar 21, 2026
c700590
Restore empty_to_na helper removed during LOLA refactor
sanghoonio Mar 21, 2026
8c7f19c
Use Option<String> for annotation fields in RegionSetAnno and LolaResult
sanghoonio Mar 21, 2026
a1a6349
Merge pull request #247 from databio/bindings-updates
sanghoonio Mar 21, 2026
fafdc04
Bump versions: genomicdist 0.7.0, lola 0.2.0, python 0.8.1, R 0.8.1
sanghoonio Mar 21, 2026
bc110d3
Allow WASM npm publish to run independently of crates.io publish
sanghoonio Mar 21, 2026
35e1dfd
Update gtars-wasm/src/bed_stream.rs
khoroshevskyi Mar 23, 2026
a032ba0
Merge branch 'dev' into streaming_bed
khoroshevskyi Mar 23, 2026
e6f39c3
Merge pull request #235 from databio/streaming_bed
khoroshevskyi Mar 23, 2026
8af8b89
Merge dev into bm25: resolve conflicts for feature-gated Python bindings
sanghoonio Apr 1, 2026
c779e27
Add BM25 enrichment demo scripts comparing sparse embeddings to LOLA
sanghoonio Apr 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/rust-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -157,8 +157,7 @@ jobs:
secrets: inherit

publish-wasm:
if: ${{ always() && inputs.wasm != false && needs.publish-all-crates.result == 'success' }}
needs: publish-all-crates
if: ${{ always() && inputs.wasm != false }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ bin/
tests/data/out/region_scoring_count.csv.gz
/gtars-refget/tests/store_test/rgstore.json
/gtars-refget/tests/store_test/sequences.rgsi
libgtars.dylib.dSYM/
gtars-bm25/tests/demo_cache/

# Large benchmark data and validation files
tests/data/interval_ranges_benchmark/
Expand Down
5 changes: 3 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ members = [
"gtars-python",
"gtars-genomicdist",
"gtars-wasm",
"gtars-r/src/rust",
# "gtars-r/src/rust",
"gtars-core",
"gtars-refget",
"gtars-uniwig",
Expand All @@ -17,7 +17,8 @@ members = [
"gtars-lola",
"gtars-fragsplit",
"gtars-scoring",
"gtars",
"gtars-bm25",
"gtars",
]

[workspace.dependencies]
Expand Down
12 changes: 12 additions & 0 deletions gtars-bm25/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[package]
name = "gtars-bm25"
version = "0.1.0"
edition = "2024"
description = "BM25 sparse embedding implementation for genomic intervals and information retrieval"

[dependencies]
gtars-core = { path = "../gtars-core", version="0.5.2" }
gtars-tokenizers = { path = "../gtars-tokenizers" }

[dev-dependencies]
rstest = "0.26.1"
120 changes: 120 additions & 0 deletions gtars-bm25/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# gtars-bm25
This crate implements a BM25 sparse embedding for genomic intervals, motivated by Qdrants own [bm25 implementation](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py) within fastembed. The implementation is actually BM25-_like_, in that it assumes a constant, prior-known average document length. This enables us to compute the BM25 scores for a query interval without needing to know teh distribution of document lengths in the corpus. Moreover, sparse BM25 embeddings for documents need not be recomputed as the corpus grows.

This method is designed to be used in conjunction with one of our dense embedding models, such as Atacformer or Region2Vec to enable hybrid search. The sparse BM25 embedding can be used perform "key-word" search (look for specific regions), while the dense embedding can be used to perform "semantic search" (look for similar biology). By combining the two with a fusion strategy, we can achieve better recall and precision than either method alone.

## Example usage
Here is an example usage of the BM25 embedding:

```python
from gtars.bm25 import Bm25
from gtars.models import RegionSet

model = Bm25(
tokenizer="/path/to/vocab.bed",
k=1.5,
b=0.75,
avg_doc_length=1_000
)

query = RegionSet("path/to/query.bed")
embedding = model.embed(query)

print(embedding.indices) # [1, 5, 10]
print(embedding.values) # [0.5, 1.0, 0.75]
```

## Use with Atacformer and Qdrant
BM25 can be used with dense embedding models like Atacformer to perform hybrid search in Qdrant.


First, we need to create a Qdrant collection with both dense and sparse vector configurations:
```python
from geniml.atacformer import AtacformerForCellClustering
from gtars.bm25 import Bm25
from gtars.models import RegionSet
from gtars.tokenizers import Tokenizer

from qdrant_client import models as qdrant_models
from qdrant_client import QdrantClient

# instantiate the qdrant collection
client = QdrantClient("http://localhost:6333")
client.recreate_collection(
collection_name="bedbase",
# atacformer embeddings
vectors_config={
"dense": qdrant_models.VectorParams(
size=384,
distance=qdrant_models.Distance.COSINE
),
},
# bm25 sparse embeddings
sparse_vectors_config={
"sparse": qdrant_models.SparseVectorsConfig(
modifier=qdrant_models.Modifier.IDF
)
}
)
```

Then we can instantiate our Atacformer and BM25 models, and insert some data into the collection:
```python
# instantiate the models
tokenizer = Tokenizer.from_pretrained("databio/atacformer-ctft-hg38")
atacformer = AtacformerForCellClustering.from_pretrained("databio/atacformer-ctft-hg38")
bm25 = Bm25(
tokenizer=tokenizer,
k=1.5,
b=0.75,
avg_doc_length=1_000 # bed files are usually very large
)

documents = [
RegionSet("path/to/document1.bed"),
RegionSet("path/to/document2.bed"),
RegionSet("path/to/document3.bed"),
RegionSet("path/to/document4.bed"),
RegionSet("path/to/document5.bed"),
]

for i, document in enumerate(documents):
dense_embedding = atacformer.embed(document)
sparse_embedding = bm25.embed(document)

client.upsert(
collection_name="bedbase",
points=[
qdrant_models.PointStruct(
id=i,
vector=dense_embedding,
sparse_vector=sparse_embedding
)
]
)
```

Finally, we can perform a hybrid search using both the dense and sparse embeddings:
```python
query = RegionSet("path/to/query.bed")
dense_query_embedding = atacformer.embed(query)
sparse_query_embedding = bm25.embed(query)

response = client.query_points(
collection_name="bedbase",
prefetch=[
qdrant_models.Prefetch(
query=sparse_query_embedding,
using="sparse",
limit=3,
),
qdrant_models.Prefetch(
query=dense_query_embedding,
using="dense",
limit=3,
)
],
query=qdrant_models.FusionQuery(fusion=qdrant_models.Fusion.RRF),
limit=3,
)
```
Loading
Loading