A high-performance DuckDB extension that provides GPU-accelerated Bitcoin Silent Payments (BIP-352) scanning using NVIDIA CUDA. This extension enables efficient scanning of large transaction datasets by leveraging GPU parallel processing for elliptic curve cryptography operations.
- GPU Acceleration: Utilizes NVIDIA CUDA for parallel elliptic curve multiplication
- Multi-GPU Support: Automatically distributes workload across multiple GPUs
- High Throughput: Processes millions of transactions per second
- Optimized Batching: Configurable batch sizes for optimal GPU utilization
- Thread-Safe: Concurrent multi-user access supported
- Memory Efficient: Handles databases with 100M+ rows
- CMake 3.18 or higher
- C++ compiler with C++17 support
- NVIDIA GPU with compute capability 8.0+ (Ampere, Ada Lovelace, or Hopper)
- CUDA Toolkit 12.8 or 13.0
- Python 3 (for gECC constant generation)
- Git
Supported GPUs:
- NVIDIA A100 (compute capability 80)
- NVIDIA RTX 30xx series (compute capability 86)
- NVIDIA RTX 40xx/50xx series (compute capability 89)
- NVIDIA H100/H200 (compute capability 90)
- Clone the repository:
git clone --recursive https://github.com/sparrowwallet/duckdb-cudasp-extension.git
cd duckdb-cudasp-extension- Set CUDA environment variables (if necessary):
export CUDA_HOME=/usr/local/cuda
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH- Build the extension:
make clean
make- Run tests:
make testThe compiled extension will be available at build/release/extension/cudasp/cudasp.duckdb_extension.
Running the compiled DuckDB binary at build/release/duckdb will run DuckDB with the extension already loaded.
LOAD 'path/to/cudasp.duckdb_extension';Scans a table of Bitcoin transactions for Silent Payments (BIP-352) matches using GPU acceleration. This function implements the complete Silent Payments scanning algorithm with optimized elliptic curve operations.
Parameters:
input_table(TABLE): Input table with columns:txid(BLOB): 32-byte transaction IDheight(INTEGER): Block heighttweak_key(BLOB): 64-byte uncompressed EC point (32-byte x || 32-byte y, little-endian)outputs(BIGINT[]): Array of output values (first 8 bytes of x-coordinates as big-endian integers)
scan_private_key(BLOB): 32-byte scan private key (little-endian)spend_public_key(BLOB): 64-byte uncompressed spend public key (32-byte x || 32-byte y, little-endian)label_keys(LIST[BLOB]): Array of 64-byte uncompressed label public keys (can be empty)batch_size(INTEGER, optional): Number of rows to process per GPU batch (default: 300000)
Returns: TABLE with columns:
txid(BLOB): Transaction ID of matching transactionheight(INTEGER): Block height of matching transactiontweak_key(BLOB): Tweak key that produced the match
Algorithm:
- Batch Processing: Groups input rows into batches for efficient GPU processing
- EC Multiplication: Computes
tweak_key × scan_private_keyfor each row - Shared Secret: Hashes the result using BIP-352 tagged hash (SHA256)
- Fixed-Point Multiplication: Computes
shared_secret × Gusing GPU-optimized fixed-point multiplication - Point Addition: Adds spend public key to create candidate output keys
- Label Checking: Tests both base output and label-tweaked variants
- Match Detection: Compares x-coordinates against output list
- Result Aggregation: Returns all matching transactions
Example:
-- Create a table of transactions to scan
CREATE TABLE tweak AS
SELECT
txid,
height,
tweak_key,
outputs
FROM read_parquet('bitcoin_transactions.parquet');
-- Scan for silent payments
SELECT hex(txid), height
FROM cudasp_scan(
(SELECT txid, height, tweak_key, outputs FROM tweak),
from_hex('0f694e068028a717f8af6b9411f9a133dd3565258714cc226594b34db90c1f2c'), -- scan_private_key
from_hex('36cf8fcd4d4890ab6c1083aeb5b50c260c20acda7839120e3575836f6d85c95ce0d705e31ff9fdcce67a8f3598871c6dfbe6bcde8a51cb7b48b0f95be0ea94de'), -- spend_public_key
[from_hex('cd63f9212a2deebde8a71e9ea23f6f958c47c41d2ed74b9617fe6fb554d1524e292fabddbdcbb643eafc328875c46d75a1d697b2b31c42d38aa93f85eab34bc1')], -- label_keys
batch_size := 300000
);Measured on dual RTX 5090 GPUs with batch_size = 300000:
| Dataset Size | Processing Time | Throughput (tx/sec) |
|---|---|---|
| 1 week (1M rows) | 575ms | 1,989,401 |
| 2 weeks (2.3M rows) | 1.04s | 2,265,266 |
| 4 weeks (5M rows) | 2.28s | 2,198,706 |
| 8 weeks (9.4M rows) | 3.64s | 2,596,475 |
| 32 weeks (32.7M rows) | 12.5s | 2,622,216 |
- Single GPU: ~7.2 seconds for 1M rows
- Dual GPU: ~6.1 seconds for 1M rows (~1.17× speedup)
- Speedup limited by serial table scan overhead
The extension automatically detects and utilizes multiple GPUs:
SELECT * FROM cudasp_scan(...);
-- Both GPUs will process batches concurrentlyGPU Assignment:
- Round-robin thread assignment to GPUs
- Independent CUDA streams per thread
- Thread-safe per-device initialization
# Real-time GPU monitoring (recommended)
nvtop
# Or use nvidia-smi
nvidia-smi -l 0.5- gECC: Fork of GPU elliptic curve cryptography library
- NVIDIA CUDA Runtime (statically linked)
- DuckDB 1.4.1
- Column-major memory layout: Optimized for coalesced GPU memory access
- Fixed-point multiplication: Precomputed base point multiples
- Batch inversion: Efficient modular inverse using Montgomery's trick
- Persistent L2 cache: Pinned frequently accessed data
- Concurrent kernel execution: Multiple batches processed simultaneously on multi-GPU
The function handles errors gracefully:
- Returns empty result set if no matches found
- Throws exception for invalid input formats
- Validates BLOB sizes (32 bytes for scalars, 64 bytes for points)
- Reports CUDA errors with detailed messages
This project is licensed under the MIT License - see the LICENSE file for details.