-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
The baseline performance of the multidimensional array kernels is not good. Based on our CUDA benchmarks:
N reads-writes: 8, N-reps: 1, Float_type = Float32, Device_bandwidth_GBs=732
┌───────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs │ time per call │ bw % │ achieved bw │ problem size │
├───────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_unfused! │ 35 milliseconds, 32 microseconds │ 4.70676 │ 34.4535 │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_fused! │ 37 milliseconds, 933 microseconds │ 4.34687 │ 31.8191 │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_unfused! │ 6 milliseconds, 464 microseconds │ 25.5066 │ 186.708 │ (40500000,) │
│ perf_kernel_shared_reads_fused! │ 3 milliseconds, 57 microseconds │ 53.9236 │ 394.721 │ (40500000,) │
└───────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
98.213180 seconds (41.61 M allocations: 5.144 GiB, 10.82% gc time, 36.05% compilation time: 1% of which was recompilation)
N reads-writes: 8, N-reps: 1, Float_type = Float32, Device_bandwidth_GBs=732
┌──────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs │ time per call │ bw % │ achieved bw │ problem size │
├──────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_writes_unfused! │ 24 milliseconds, 180 microseconds │ 6.81902 │ 49.9152 │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_fused! │ 15 milliseconds, 248 microseconds │ 10.8137 │ 79.156 │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_unfused! │ 3 milliseconds, 895 microseconds │ 42.325 │ 309.819 │ (40500000,) │
│ perf_kernel_shared_reads_writes_fused! │ 2 milliseconds, 673 microseconds │ 61.6844 │ 451.53 │ (40500000,) │
└──────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
56.390611 seconds (3.79 M allocations: 2.642 GiB, 15.67% gc time, 6.64% compilation time)
N reads-writes: 8, N-reps: 1, Float_type = Float32, Device_bandwidth_GBs=732
┌─────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs │ time per call │ bw % │ achieved bw │ problem size │
├─────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_unfused! │ 34 milliseconds, 948 microseconds │ 4.71814 │ 34.5368 │ (50, 5, 5, 6, 5400) │
│ perf_kernel_fused! │ 37 milliseconds, 955 microseconds │ 4.34428 │ 31.8001 │ (50, 5, 5, 6, 5400) │
│ perf_kernel_hard_coded! │ 3 milliseconds, 291 microseconds │ 50.1021 │ 366.747 │ (50, 5, 5, 6, 5400) │
│ perf_kernel_unfused! │ 6 milliseconds, 463 microseconds │ 25.5097 │ 186.731 │ (40500000,) │
│ perf_kernel_fused! │ 3 milliseconds, 58 microseconds │ 53.9181 │ 394.68 │ (40500000,) │
│ perf_kernel_hard_coded! │ 3 milliseconds, 339 microseconds │ 49.3735 │ 361.414 │ (40500000,) │
└─────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
81.495402 seconds (3.83 M allocations: 2.629 GiB, 15.54% gc time, 5.22% compilation time)Our multi-dimensional array fusion is not improving over unfused kernels like our vector-fused kernels. Could be JuliaGPU/Metal.jl#101?
Metadata
Metadata
Assignees
Labels
No labels