RuntimeError: CUDA error: an illegal memory access was encountered

```
RETURNN starting up, version 1.20260227.073346+git.9e93588a, date/time 2026-03-03-11-05-14 (UTC+0100), pid 79310, cwd /rwthfs/rz/cluster/hpcwork/p0023
565/mwk22690/setups/2025-11-10-start/work/i6_core/returnn/training/ReturnnTrainingJob.OuK06vxsrJJq/work, Python /home/az668407/work/py-envs/py3.12-tor
ch2.7/bin/python3
RETURNN command line options: ['/rwthfs/rz/cluster/home/mwk22690/setups/2025-11-10-start/work/i6_core/returnn/training/ReturnnTrainingJob.OuK06vxsrJJq
/output/returnn.config']
Hostname: n23g0010.hpc.itc.rwth-aachen.de
...
PyTorch: 2.7.1+cu126 (e2d141dbde55c2a4370fac5165b0561b6af4798b) (<site-package> in /home/az668407/work/py-envs/py3.12-torch2.7/lib/python3.12/site-pac
kages/torch)
CUDA_VISIBLE_DEVICES=0
MKL_EXAMPLES=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/share/doc/mkl/examples
CUDA_PATH=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/CUDA/12.6.3
CUDA_ROOT=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/CUDA/12.6.3
OMP_NUM_THREADS=24
CUDA_HOME=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/CUDA/12.6.3
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
MKL_NUM_THREADS=24
CUDA_VISIBLE_DEVICES is set to '0'.
Num NVML devices: 1
Available CUDA devices:
  1/1: cuda:0
       name: NVIDIA H100
       total_memory: 93.1GB
       free_memory: 92.6GB (99%)
       capability: 9.0
       device_index: 0
       uuid: c67fb24c-9d87-f003-79a7-5aed55c54f76
       nvml_device_index: 0
RETURNN global startup callback.
/w0/tmp disk usage: total 695.8GB, used 54.0GB, free 641.9GB
Total freed space: 0B

...
ep 18 train, step 52, no_collapse_ctc 1.580, no_collapse_ctc_err 0.363, ctc_4 1.580, ctc_err_4 0.363, ctc_10 1.339, ctc_err_10 0.287, ctc_16 1.176, ctc_err_16 0.252, ce 0.866, fer 0.160, num_seqs 28, max_size:time 266552, max_size:out-spatial 69, mem_usage:cuda 52.4GB, 0.305 sec/step, elapsed 0:00:28, exp. remaining 2:57:08, complete 0.27%
ep 18 train, step 53, txt_ctc_10 1.032, txt_ctc_err_10 0.157, txt_ctc_16 0.944, txt_ctc_err_16 0.149, txt_ce 0.916, txt_fer 0.135, grad_norm:p2 11.455, num_seqs 200, max_size:time 0, max_size:out-spatial 32, mem_usage:cuda 52.4GB, 0.280 sec/step, elapsed 0:00:29, exp. remaining 2:58:40, complete 0.27%
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

...
  File "/rwthfs/rz/cluster/home/mwk22690/setups/2025-11-10-start/tools/returnn/returnn/torch/frontend/_backend.py", line 755, in TorchBackend.ctc_loss
    line: loss_raw = torch.nn.functional.ctc_loss(
              log_probs=log_probs,
              targets=targets_raw,
              input_lengths=input_lengths,
              target_lengths=targets_lengths,
              blank=blank_index,
              zero_infinity=True,
              reduction="none",
          )
    locals:
      log_probs = <local> <torch.Tensor: repr-error RuntimeError: CUDA error: an illegal memory access was encountered
                          CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
                          For debugging consider passing CUDA_LAUNCH_BLOCKING=1
                          Compile with `TORCH_USE_CUDA_D...
      targets = <local> Tensor{'text', [B,T|'out-spatial'[B]], dtype='int32', sparse_dim=Dim{F'vocab'(10240)}}
      targets_raw = <local> <torch.Tensor: repr-error RuntimeError: CUDA error: an illegal memory access was encountered
                            CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
                            For debugging consider passing CUDA_LAUNCH_BLOCKING=1
                            Compile with `TORCH_USE_CUDA_D...
      input_lengths = <local> tensor[28] i32 x∈[228, 280] μ=254.179 σ=23.777
      targets_lengths = <local> tensor[28] i32 x∈[35, 66] μ=50.429 σ=9.183
      blank_index = <local> 10240
  File "/home/az668407/work/py-envs/py3.12-torch2.7/lib/python3.12/site-packages/torch/nn/functional.py", line 3079, in ctc_loss
    line: return torch.ctc_loss(
              log_probs,
              targets,
              input_lengths,
              target_lengths,
              blank,
              _Reduction.get_enum(reduction),
              zero_infinity,
          )
    locals:
      torch.ctc_loss = <global> <built-in method ctc_loss of type object at 0x14b283689fa0>
      log_probs = <local> <torch.Tensor: repr-error RuntimeError: CUDA error: an illegal memory access was encountered
                          CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
                          For debugging consider passing CUDA_LAUNCH_BLOCKING=1
                          Compile with `TORCH_USE_CUDA_D...
      targets = <local> <torch.Tensor: repr-error RuntimeError: CUDA error: an illegal memory access was encountered
                        CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
                        For debugging consider passing CUDA_LAUNCH_BLOCKING=1
                        Compile with `TORCH_USE_CUDA_D...
      input_lengths = <local> tensor[28] i32 x∈[228, 280] μ=254.179 σ=23.777
      target_lengths = <local> tensor[28] i32 x∈[35, 66] μ=50.429 σ=9.183
      blank = <local> 10240
      reduction = <local> 'none'
      zero_infinity = <local> True
RuntimeError: CUDA error: an illegal memory access was encountered
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: an illegal memory access was encountered #1816

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: CUDA error: an illegal memory access was encountered #1816

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions