Improve CUDA performance for multi-dimensional arrays

The baseline performance of the multidimensional array kernels is not good. Based on our CUDA benchmarks:

```julia
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌───────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                             │ time per call                     │ bw %    │ achieved bw │ problem size        │
├───────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_unfused! │ 35 milliseconds, 32 microseconds  │ 4.70676 │ 34.4535     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_fused!   │ 37 milliseconds, 933 microseconds │ 4.34687 │ 31.8191     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_unfused! │ 6 milliseconds, 464 microseconds  │ 25.5066 │ 186.708     │ (40500000,)         │
│ perf_kernel_shared_reads_fused!   │ 3 milliseconds, 57 microseconds   │ 53.9236 │ 394.721     │ (40500000,)         │
└───────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 98.213180 seconds (41.61 M allocations: 5.144 GiB, 10.82% gc time, 36.05% compilation time: 1% of which was recompilation)
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌──────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                                    │ time per call                     │ bw %    │ achieved bw │ problem size        │
├──────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_writes_unfused! │ 24 milliseconds, 180 microseconds │ 6.81902 │ 49.9152     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_fused!   │ 15 milliseconds, 248 microseconds │ 10.8137 │ 79.156      │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_unfused! │ 3 milliseconds, 895 microseconds  │ 42.325  │ 309.819     │ (40500000,)         │
│ perf_kernel_shared_reads_writes_fused!   │ 2 milliseconds, 673 microseconds  │ 61.6844 │ 451.53      │ (40500000,)         │
└──────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 56.390611 seconds (3.79 M allocations: 2.642 GiB, 15.67% gc time, 6.64% compilation time)
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌─────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                   │ time per call                     │ bw %    │ achieved bw │ problem size        │
├─────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_unfused!    │ 34 milliseconds, 948 microseconds │ 4.71814 │ 34.5368     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_fused!      │ 37 milliseconds, 955 microseconds │ 4.34428 │ 31.8001     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_hard_coded! │ 3 milliseconds, 291 microseconds  │ 50.1021 │ 366.747     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_unfused!    │ 6 milliseconds, 463 microseconds  │ 25.5097 │ 186.731     │ (40500000,)         │
│ perf_kernel_fused!      │ 3 milliseconds, 58 microseconds   │ 53.9181 │ 394.68      │ (40500000,)         │
│ perf_kernel_hard_coded! │ 3 milliseconds, 339 microseconds  │ 49.3735 │ 361.414     │ (40500000,)         │
└─────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 81.495402 seconds (3.83 M allocations: 2.629 GiB, 15.54% gc time, 5.22% compilation time)
```

Our multi-dimensional array fusion is not improving over unfused kernels like our vector-fused kernels. Could be https://github.com/JuliaGPU/Metal.jl/issues/101?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve CUDA performance for multi-dimensional arrays #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve CUDA performance for multi-dimensional arrays #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions