Skip to content

Improve CUDA performance for multi-dimensional arrays #48

@charleskawczynski

Description

@charleskawczynski

The baseline performance of the multidimensional array kernels is not good. Based on our CUDA benchmarks:

N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌───────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                             │ time per call                     │ bw %    │ achieved bw │ problem size        │
├───────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_unfused! │ 35 milliseconds, 32 microseconds  │ 4.7067634.4535     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_fused!   │ 37 milliseconds, 933 microseconds │ 4.3468731.8191     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_unfused! │ 6 milliseconds, 464 microseconds  │ 25.5066186.708     │ (40500000,)         │
│ perf_kernel_shared_reads_fused!   │ 3 milliseconds, 57 microseconds   │ 53.9236394.721     │ (40500000,)         │
└───────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 98.213180 seconds (41.61 M allocations: 5.144 GiB, 10.82% gc time, 36.05% compilation time: 1% of which was recompilation)
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌──────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                                    │ time per call                     │ bw %    │ achieved bw │ problem size        │
├──────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_writes_unfused! │ 24 milliseconds, 180 microseconds │ 6.8190249.9152     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_fused!   │ 15 milliseconds, 248 microseconds │ 10.813779.156      │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_unfused! │ 3 milliseconds, 895 microseconds  │ 42.325309.819     │ (40500000,)         │
│ perf_kernel_shared_reads_writes_fused!   │ 2 milliseconds, 673 microseconds  │ 61.6844451.53      │ (40500000,)         │
└──────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 56.390611 seconds (3.79 M allocations: 2.642 GiB, 15.67% gc time, 6.64% compilation time)
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌─────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                   │ time per call                     │ bw %    │ achieved bw │ problem size        │
├─────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_unfused!    │ 34 milliseconds, 948 microseconds │ 4.7181434.5368     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_fused!      │ 37 milliseconds, 955 microseconds │ 4.3442831.8001     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_hard_coded! │ 3 milliseconds, 291 microseconds  │ 50.1021366.747     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_unfused!    │ 6 milliseconds, 463 microseconds  │ 25.5097186.731     │ (40500000,)         │
│ perf_kernel_fused!      │ 3 milliseconds, 58 microseconds   │ 53.9181394.68      │ (40500000,)         │
│ perf_kernel_hard_coded! │ 3 milliseconds, 339 microseconds  │ 49.3735361.414     │ (40500000,)         │
└─────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 81.495402 seconds (3.83 M allocations: 2.629 GiB, 15.54% gc time, 5.22% compilation time)

Our multi-dimensional array fusion is not improving over unfused kernels like our vector-fused kernels. Could be JuliaGPU/Metal.jl#101?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions