Skip to content

Crash when running DeepVariant on GPU #1054

@gergo-hollo

Description

@gergo-hollo

Hey there! I've been trying to use DeepVariant fast pipeline based on the case study, but I have problems running it. Thanks for the time taking to help!

Issue
I've followed the NVIDIA CUDA Installation Guide for Linux and the NVIDIA Container Toolkit Installation Guide before running the pipeline.

Running nvidia-smi in the container works properly:

singularity exec --oci --nv ./singularity/deepvariant-gpu.oci.sif nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     On  |   00000000:01:00.0 Off |                  N/A |
|  0%   29C    P8              3W /  180W |       2MiB /  16311MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

However, when I try running it, this is the error I get after a few seconds:

File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 1116, in <module>
      app.run(main)
    File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/absl_py/absl/app.py", line 312, in run
      _run_main(main, args)
    File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/absl_py/absl/app.py", line 258, in _run_main
      sys.exit(main(argv))
    File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 1092, in main
      call_variants(
    File "tmp/Bazel.runfiles_zp5s9wsq/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 960, in call_variants
      for distributed_inputs in dist_dataset:
    File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/input_lib.py", line 264, in __next__
      return self.get_next()
    File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/input_lib.py", line 349, in get_next
      num_replicas_with_values > 0, _value_or_dummy, _eof, strict=True)
    File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/tensor_math_operator_overrides.py", line 150, in wrapper
      return fn(x, y, *args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4278, in greater
      _ops.raise_from_not_ok_status(e, name)
    File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 5983, in raise_from_not_ok_status
      raise core._status_to_exception(e) from None  # pylint: disable=protected-access
  tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__Greater_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Greater] name: 
  I0000 00:00:1770812529.947133      40 fast_pipeline.cc:211] postprocess_variants_bin: "/opt/deepvariant/bin/postprocess_variants"
  I0000 00:00:1770812529.947147      40 fast_pipeline.cc:212] Spawning postprocess_variants process
  I0000 00:00:1770812529.948255      40 fast_pipeline.cc:216] postprocess_variants process stared
  2026-02-11 12:22:10.497911: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
  2026-02-11 12:22:10.520597: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
  To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  2026-02-11 12:22:10.899907: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
  2026-02-11 12:22:11.593359: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
  2026-02-11 12:22:11.593377: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: pop-os
  2026-02-11 12:22:11.593381: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: pop-os
  2026-02-11 12:22:11.593441: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: INVALID_ARGUMENT: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1"
  2026-02-11 12:22:11.593453: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  590.48.01  Release Build  (dvs-builder@U22-I3-AE18-23-3)  Mon Dec  8 13:05:00 UTC 2025
  GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04) 
  "
  Traceback (most recent call last):
    File "tmp/Bazel.runfiles_3kht2prg/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 2348, in <module>
      app.run(main)
    File "tmp/Bazel.runfiles_3kht2prg/runfiles/absl_py/absl/app.py", line 312, in run
      _run_main(main, args)
    File "tmp/Bazel.runfiles_3kht2prg/runfiles/absl_py/absl/app.py", line 258, in _run_main
      sys.exit(main(argv))
    File "tmp/Bazel.runfiles_3kht2prg/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 2259, in main
      cvo_paths = get_cvo_paths(_INFILE.value)
    File "tmp/Bazel.runfiles_3kht2prg/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1633, in get_cvo_paths
      raise ValueError(
  ValueError: ('Found multiple file patterns in input filename space: ', './output/HG002.cvo.tfrecord.gz')

Furthermore, interestingly when I run the pipeline using run_deepvariant with the same GPU image, DeepVariant make_examples runs for 7m 30s on 100% CPU. Then starting with call_variants, VRAM goes 100% but with 0% usage. After 5 minutes I get the following error message:

     raise Empty
  _queue.Empty
  Process ForkProcess-15:
  Traceback (most recent call last):
    File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
      self.run()
    File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
      self._target(*self._args, **self._kwargs)
    File "tmp/Bazel.runfiles_ey3g62wy/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 561, in post_processing
      item = output_queue.get(timeout=300)
    File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
      raise Empty
  _queue.Empty
  Process ForkProcess-13:
  Traceback (most recent call last):
    File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
      self.run()
    File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
      self._target(*self._args, **self._kwargs)
    File "tmp/Bazel.runfiles_ey3g62wy/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 561, in post_processing
      item = output_queue.get(timeout=300)
    File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
      raise Empty
  _queue.Empty
  Process ForkProcess-14:
  Traceback (most recent call last):
    File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
      self.run()
    File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
      self._target(*self._args, **self._kwargs)
    File "tmp/Bazel.runfiles_ey3g62wy/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 561, in post_processing
      item = output_queue.get(timeout=300)
    File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
      raise Empty
  _queue.Empty
  Process ForkProcess-16:
  Traceback (most recent call last):
    File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
      self.run()
    File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
      self._target(*self._args, **self._kwargs)
    File "tmp/Bazel.runfiles_ey3g62wy/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 561, in post_processing
      item = output_queue.get(timeout=300)
    File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
      raise Empty
  _queue.Empty

Parabricks DeepVariant runs without problems.

Setup

  • System:
    • OS: PopOS 24.04
    • CPU: AMD Ryzen 9 7900 12-Core (24 threads) AVX512
    • GPU: NVIDIA GeForce RTX 5060 Ti 16GB (driver 590.48.01)
    • CUDA: 13.1
  • DeepVariant version: r1.10.0-beta
  • Installation method: Singularity image built from docker://google/deepvariant:1.10.0-beta-gpu
  • Type of data: WES HG002 from SRR2962669, mapped to GCA_000001405.15_GRCh38_no_alt_analysis_set with Parabricks fq2bam

Steps to reproduce:
Command in Nextflow 25.10.3 (with containerOptions: '--nv'):

    def regions_arg = target_bed == [] ? "" : "--regions ${target_bed}"
    def haploid_arg = params.gender == "male" ? "--haploid_contigs=chrX,chrY" : ""
    def par_regions_arg = params.gender == "male" ? "--par_regions_bed=PAR.bed" : ""
    def model = params.mode.toLowerCase() // Model is 'wes' here
    """
    mkdir -p config
    FILE=config/make_examples.ini

    # Create config 1.
    cat <<EOM >\$FILE
--examples=/tmp/examples.tfrecords@14.gz
--gvcf=/tmp/examples.gvcf.tfrecord@14.gz
--mode=calling
--reads=${bam_input}
--ref=${fasta}
--alt_aligned_pileup=diff_channels
--max_reads_per_partition=600
--min_mapping_quality=1
--parse_sam_aux_fields
--partition_size=25000
--phase_reads
--pileup_image_width=147
--norealign_reads
--sort_by_haplotypes
--track_ref_reads
--vsc_min_fraction_indels=0.12
--trim_reads_for_pileup
--trained_small_model_path=/opt/smallmodels/wgs
--small_model_snp_gq_threshold=25
--small_model_indel_gq_threshold=30
--small_model_vaf_context_window_size=51
--output_phase_info
--checkpoint=/opt/models/${model}
${regions_arg}
EOM

    FILE=config/call_variants.ini

    cat <<EOM >\$FILE
--outfile=./output/${meta.id}.cvo.tfrecord.gz
--checkpoint=/opt/models/${model}
--batch_size=1024
--writer_threads=1
EOM

    FILE=config/postprocess_variants.ini

    cat <<EOM >\$FILE
--ref=${fasta}
--infile=./output/${meta.id}.cvo.tfrecord.gz
--nonvariant_site_tfrecord_path=/tmp/examples.gvcf.tfrecord@14.gz
--outfile=./${meta.id}.vcf.gz
--gvcf_outfile=./${meta.id}.g.vcf.gz
--small_model_cvo_records=/tmp/examples_call_variant_outputs.tfrecords@14.gz
--cpus=${task.cpus}
${haploid_arg}
${par_regions_arg}
EOM

    # Create PAR.bed if male
    if [ "${params.gender}" = "male" ]; then
        cat > PAR.bed <<EOF
chrX	10000	44821
chrX	94821	133871
chrX	222346	226276
chrX	226351	1949345
chrX	2132994	2137388
chrX	2137488	2781479
chrX	155701383	156030895
chrY	10000	44821
chrY	94821	133871
chrY	222346	226276
chrY	226351	1949345
chrY	2132994	2137388
chrY	2137488	2781479
chrY	56887902	57217415
EOF
    fi

    # Bazel is using TMPDIR and writing files there,
    # therefore redirecting that to a writable folder (current workdir).
    export TMPDIR=\$PWD/tmp
    mkdir tmp

    /opt/deepvariant/bin/fast_pipeline \
    --make_example_flags ./config/make_examples.ini \
    --call_variants_flags ./config/call_variants.ini \
    --postprocess_variants_flags ./config/postprocess_variants.ini \
    --shm_prefix dv \
    --num_shards ${task.cpus} \
    --buffer_size 10485760 

run_pipeline command:

    def regions_arg = target_bed == [] ? "" : "--regions ${target_bed}"
    def haploid_arg = params.gender == "male" ? "--haploid_contigs chrX,chrY" : ""
    def par_regions_arg = params.gender == "male" ? "--par_regions_bed PAR.bed" : ""
    """
    # Create PAR.bed if male
    if [ "${params.gender}" = "male" ]; then
        cat > PAR.bed <<EOF
chrX	10000	44821
chrX	94821	133871
chrX	222346	226276
chrX	226351	1949345
chrX	2132994	2137388
chrX	2137488	2781479
chrX	155701383	156030895
chrY	10000	44821
chrY	94821	133871
chrY	222346	226276
chrY	226351	1949345
chrY	2132994	2137388
chrY	2137488	2781479
chrY	56887902	57217415
EOF
    fi

    # Bazel is using TMPDIR and writing files there,
    # therefore redirecting that to a writable folder (current workdir).
    export TMPDIR=\$PWD/tmp
    mkdir tmp

    /opt/deepvariant/bin/run_deepvariant \
    --model_type ${params.mode} \
    --ref ${fasta} \
    --reads ${bam_input} \
    ${regions_arg} \
    --output_vcf ./${meta.id}.output.vcf.gz \
    --output_gvcf ./${meta.id}.output.g.vcf.gz \
    --num_shards ${task.cpus} \
    ${haploid_arg} \
    ${par_regions_arg}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions