-
Notifications
You must be signed in to change notification settings - Fork 772
Description
Hey there! I've been trying to use DeepVariant fast pipeline based on the case study, but I have problems running it. Thanks for the time taking to help!
Issue
I've followed the NVIDIA CUDA Installation Guide for Linux and the NVIDIA Container Toolkit Installation Guide before running the pipeline.
Running nvidia-smi in the container works properly:
singularity exec --oci --nv ./singularity/deepvariant-gpu.oci.sif nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5060 Ti On | 00000000:01:00.0 Off | N/A |
| 0% 29C P8 3W / 180W | 2MiB / 16311MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
However, when I try running it, this is the error I get after a few seconds:
File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 1116, in <module>
app.run(main)
File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/absl_py/absl/app.py", line 312, in run
_run_main(main, args)
File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/absl_py/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 1092, in main
call_variants(
File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 960, in call_variants
for distributed_inputs in dist_dataset:
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/input_lib.py", line 264, in __next__
return self.get_next()
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/input_lib.py", line 349, in get_next
num_replicas_with_values > 0, _value_or_dummy, _eof, strict=True)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/tensor_math_operator_overrides.py", line 150, in wrapper
return fn(x, y, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4278, in greater
_ops.raise_from_not_ok_status(e, name)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 5983, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__Greater_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Greater] name:
I0000 00:00:1770812529.947133 40 fast_pipeline.cc:211] postprocess_variants_bin: "/opt/deepvariant/bin/postprocess_variants"
I0000 00:00:1770812529.947147 40 fast_pipeline.cc:212] Spawning postprocess_variants process
I0000 00:00:1770812529.948255 40 fast_pipeline.cc:216] postprocess_variants process stared
2026-02-11 12:22:10.497911: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-11 12:22:10.520597: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-11 12:22:10.899907: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2026-02-11 12:22:11.593359: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2026-02-11 12:22:11.593377: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: pop-os
2026-02-11 12:22:11.593381: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: pop-os
2026-02-11 12:22:11.593441: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: INVALID_ARGUMENT: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1"
2026-02-11 12:22:11.593453: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 590.48.01 Release Build (dvs-builder@U22-I3-AE18-23-3) Mon Dec 8 13:05:00 UTC 2025
GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
"
Traceback (most recent call last):
File "tmp/Bazel.runfiles_3kht2prg/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 2348, in <module>
app.run(main)
File "tmp/Bazel.runfiles_3kht2prg/runfiles/absl_py/absl/app.py", line 312, in run
_run_main(main, args)
File "tmp/Bazel.runfiles_3kht2prg/runfiles/absl_py/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "tmp/Bazel.runfiles_3kht2prg/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 2259, in main
cvo_paths = get_cvo_paths(_INFILE.value)
File "tmp/Bazel.runfiles_3kht2prg/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1633, in get_cvo_paths
raise ValueError(
ValueError: ('Found multiple file patterns in input filename space: ', './output/HG002.cvo.tfrecord.gz')
Furthermore, interestingly when I run the pipeline using run_deepvariant with the same GPU image, DeepVariant make_examples runs for 7m 30s on 100% CPU. Then starting with call_variants, VRAM goes 100% but with 0% usage. After 5 minutes I get the following error message:
raise Empty
_queue.Empty
Process ForkProcess-15:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "tmp/Bazel.runfiles_ey3g62wy/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 561, in post_processing
item = output_queue.get(timeout=300)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
raise Empty
_queue.Empty
Process ForkProcess-13:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "tmp/Bazel.runfiles_ey3g62wy/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 561, in post_processing
item = output_queue.get(timeout=300)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
raise Empty
_queue.Empty
Process ForkProcess-14:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "tmp/Bazel.runfiles_ey3g62wy/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 561, in post_processing
item = output_queue.get(timeout=300)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
raise Empty
_queue.Empty
Process ForkProcess-16:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "tmp/Bazel.runfiles_ey3g62wy/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 561, in post_processing
item = output_queue.get(timeout=300)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
raise Empty
_queue.Empty
Parabricks DeepVariant runs without problems.
Setup
- System:
- OS: PopOS 24.04
- CPU: AMD Ryzen 9 7900 12-Core (24 threads) AVX512
- GPU: NVIDIA GeForce RTX 5060 Ti 16GB (driver 590.48.01)
- CUDA: 13.1
- DeepVariant version: r1.10.0-beta
- Installation method: Singularity image built from docker://google/deepvariant:1.10.0-beta-gpu
- Type of data: WES HG002 from SRR2962669, mapped to GCA_000001405.15_GRCh38_no_alt_analysis_set with Parabricks fq2bam
Steps to reproduce:
Command in Nextflow 25.10.3 (with containerOptions: '--nv'):
def regions_arg = target_bed == [] ? "" : "--regions ${target_bed}"
def haploid_arg = params.gender == "male" ? "--haploid_contigs=chrX,chrY" : ""
def par_regions_arg = params.gender == "male" ? "--par_regions_bed=PAR.bed" : ""
def model = params.mode.toLowerCase() // Model is 'wes' here
"""
mkdir -p config
FILE=config/make_examples.ini
# Create config 1.
cat <<EOM >\$FILE
--examples=/tmp/examples.tfrecords@14.gz
--gvcf=/tmp/examples.gvcf.tfrecord@14.gz
--mode=calling
--reads=${bam_input}
--ref=${fasta}
--alt_aligned_pileup=diff_channels
--max_reads_per_partition=600
--min_mapping_quality=1
--parse_sam_aux_fields
--partition_size=25000
--phase_reads
--pileup_image_width=147
--norealign_reads
--sort_by_haplotypes
--track_ref_reads
--vsc_min_fraction_indels=0.12
--trim_reads_for_pileup
--trained_small_model_path=/opt/smallmodels/wgs
--small_model_snp_gq_threshold=25
--small_model_indel_gq_threshold=30
--small_model_vaf_context_window_size=51
--output_phase_info
--checkpoint=/opt/models/${model}
${regions_arg}
EOM
FILE=config/call_variants.ini
cat <<EOM >\$FILE
--outfile=./output/${meta.id}.cvo.tfrecord.gz
--checkpoint=/opt/models/${model}
--batch_size=1024
--writer_threads=1
EOM
FILE=config/postprocess_variants.ini
cat <<EOM >\$FILE
--ref=${fasta}
--infile=./output/${meta.id}.cvo.tfrecord.gz
--nonvariant_site_tfrecord_path=/tmp/examples.gvcf.tfrecord@14.gz
--outfile=./${meta.id}.vcf.gz
--gvcf_outfile=./${meta.id}.g.vcf.gz
--small_model_cvo_records=/tmp/examples_call_variant_outputs.tfrecords@14.gz
--cpus=${task.cpus}
${haploid_arg}
${par_regions_arg}
EOM
# Create PAR.bed if male
if [ "${params.gender}" = "male" ]; then
cat > PAR.bed <<EOF
chrX 10000 44821
chrX 94821 133871
chrX 222346 226276
chrX 226351 1949345
chrX 2132994 2137388
chrX 2137488 2781479
chrX 155701383 156030895
chrY 10000 44821
chrY 94821 133871
chrY 222346 226276
chrY 226351 1949345
chrY 2132994 2137388
chrY 2137488 2781479
chrY 56887902 57217415
EOF
fi
# Bazel is using TMPDIR and writing files there,
# therefore redirecting that to a writable folder (current workdir).
export TMPDIR=\$PWD/tmp
mkdir tmp
/opt/deepvariant/bin/fast_pipeline \
--make_example_flags ./config/make_examples.ini \
--call_variants_flags ./config/call_variants.ini \
--postprocess_variants_flags ./config/postprocess_variants.ini \
--shm_prefix dv \
--num_shards ${task.cpus} \
--buffer_size 10485760
run_pipeline command:
def regions_arg = target_bed == [] ? "" : "--regions ${target_bed}"
def haploid_arg = params.gender == "male" ? "--haploid_contigs chrX,chrY" : ""
def par_regions_arg = params.gender == "male" ? "--par_regions_bed PAR.bed" : ""
"""
# Create PAR.bed if male
if [ "${params.gender}" = "male" ]; then
cat > PAR.bed <<EOF
chrX 10000 44821
chrX 94821 133871
chrX 222346 226276
chrX 226351 1949345
chrX 2132994 2137388
chrX 2137488 2781479
chrX 155701383 156030895
chrY 10000 44821
chrY 94821 133871
chrY 222346 226276
chrY 226351 1949345
chrY 2132994 2137388
chrY 2137488 2781479
chrY 56887902 57217415
EOF
fi
# Bazel is using TMPDIR and writing files there,
# therefore redirecting that to a writable folder (current workdir).
export TMPDIR=\$PWD/tmp
mkdir tmp
/opt/deepvariant/bin/run_deepvariant \
--model_type ${params.mode} \
--ref ${fasta} \
--reads ${bam_input} \
${regions_arg} \
--output_vcf ./${meta.id}.output.vcf.gz \
--output_gvcf ./${meta.id}.output.g.vcf.gz \
--num_shards ${task.cpus} \
${haploid_arg} \
${par_regions_arg}