[Benchmark] Add all gather matmul benchmark #400

joydddd · 2025-07-30T06:10:39Z

Stacked PRs:

[Benchmark] Add all gather matmul benchmark

stack-info: PR: #400, branch: joydddd/stack/22

joydddd · 2025-07-30T06:13:40Z

shape	dtype	nccl	torch_symm_mem	helion	kraken	Best Backend
(256, 256, 256)	torch.bfloat16	41.156	431.319	119.549	111.359	nccl
(384, 384, 384)	torch.bfloat16	42.737	424.313	137.685	105.681	nccl
(512, 512, 512)	torch.bfloat16	51.284	952.367	147.106	180.390	nccl
(640, 640, 640)	torch.bfloat16	54.416	946.200	124.764	nan	nccl
(768, 768, 768)	torch.bfloat16	2373.010	467.570	127.340	nan	helion
(896, 896, 896)	torch.bfloat16	79.772	450.538	230.704	nan	nccl
(1024, 1024, 1024)	torch.bfloat16	100.879	628.553	144.646	161.596	nccl
(1152, 1152, 1152)	torch.bfloat16	122.012	628.340	164.184	nan	nccl
(1280, 1280, 1280)	torch.bfloat16	159.020	433.261	183.333	nan	nccl
(1408, 1408, 1408)	torch.bfloat16	194.298	433.295	196.417	nan	nccl
(1536, 1536, 1536)	torch.bfloat16	211.553	431.485	206.027	297.015	helion
(1664, 1664, 1664)	torch.bfloat16	251.526	427.265	581.858	nan	nccl
(1792, 1792, 1792)	torch.bfloat16	286.406	678.410	246.517	nan	helion
(1920, 1920, 1920)	torch.bfloat16	341.697	974.870	264.127	nan	helion
(2048, 2048, 2048)	torch.bfloat16	380.024	446.875	287.984	481.179	helion
(2176, 2176, 2176)	torch.bfloat16	445.310	477.962	333.809	nan	helion
(2304, 2304, 2304)	torch.bfloat16	496.317	457.464	363.813	nan	helion
(2432, 2432, 2432)	torch.bfloat16	597.861	460.951	397.363	nan	helion
(2560, 2560, 2560)	torch.bfloat16	655.093	489.344	430.963	804.186	helion
(2688, 2688, 2688)	torch.bfloat16	775.004	1021.574	1146.624	nan	nccl
(2816, 2816, 2816)	torch.bfloat16	839.021	562.788	691.636	nan	torch_symm_mem

shape	dtype	nccl	torch_symm_mem	helion	kraken	Best Backend
(2944, 2944, 2944)	torch.bfloat16	973.901	625.908	649.289	nan	torch_symm_mem
(3072, 3072, 3072)	torch.bfloat16	1012.294	680.064	737.418	722.340	torch_symm_mem
(3200, 3200, 3200)	torch.bfloat16	1458.743	1001.001	927.776	nan	helion
(3328, 3328, 3328)	torch.bfloat16	1340.722	961.890	2509.657	nan	torch_symm_mem
(3456, 3456, 3456)	torch.bfloat16	2240.075	2239.796	1171.304	nan	helion
(3584, 3584, 3584)	torch.bfloat16	2181.338	2044.006	1729.832	1456.677	kraken
(3712, 3712, 3712)	torch.bfloat16	4645.679	3967.047	2498.389	nan	helion
(3840, 3840, 3840)	torch.bfloat16	1950.385	1464.106	1650.844	nan	torch_symm_mem
(3968, 3968, 3968)	torch.bfloat16	2937.086	2874.260	1747.495	nan	helion
(4096, 4096, 4096)	torch.bfloat16	2819.741	6313.106	1725.565	1867.402	helion

stack-info: PR: #400, branch: joydddd/stack/22

joydddd · 2025-07-30T22:00:04Z

Optimization implemented in Kraken but not supported in Helion:

(a, out) = ag_matmul(a_shared, b), where a = all_gather(a_shared), and out = a@b.
For an a tile originated from the local a_shared, there's potential timesaving by accessing it directly through a_shared, and skip waiting for cudaMemcpy.

Helion does not support conditional calculate tile offset and conditionally use different tensor_descriptor for tensor_descriptor.load. i.e.

if xx: 
   a_load_desc = a_shared_desc
   a_load_offset = a_shared_stride_0 * .... 
else: 
   a_load_desc = a_desc
   a_load_offset = a_stride_0 * .... 
a_load_desc.load(a_load_offset)

Same access pattern can be implementation in Helion as:

if xx: 
    a_tile = a_shared[tile.index - RANK * M_per_RANK]
else: 
    a_tile = a[tile]

But this generates 2 tensor_descriptor loads in each branch, and breaks Triton data prefetching.

stack-info: PR: #400, branch: joydddd/stack/22

jansel · 2025-08-06T04:56:21Z

If xx is a constexpr would that fix it too? Can it be in this case?

stack-info: PR: #400, branch: joydddd/stack/22

joydddd · 2025-08-08T16:08:56Z

If xx is a constexpr would that fix it too? Can it be in this case?

Yep. If xx is a constexpr the software pipelining should be fine.
In this case however, xx is dependent on the loop iterator. In most cases the loop is too large to unroll (via tl.static_range) :(

joydddd · 2025-08-17T23:55:08Z

@yf225 Another distributed benchmark PR.

joydddd added a commit that referenced this pull request Jul 30, 2025

[Benchmark] Add all gather matmul benchmark

d87a64a

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from 0513a58 to d87a64a Compare July 30, 2025 06:10

This was referenced Jul 30, 2025

Add stacked tensor #346

Merged

[BC breaking] Add StackTensor support to hl.signal & hl.wait (as_ptrs) #261

Merged

[Example] One shot all reduce #245

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 30, 2025

This was referenced Jul 30, 2025

[Pending multi device CI] Add symmetric memory sync test. #375

Draft

[Benchmark] Add all reduce benchmark #393

Open

joydddd changed the base branch from joydddd/stack/21 to main July 30, 2025 20:33

joydddd added a commit that referenced this pull request Jul 30, 2025

[Benchmark] Add all gather matmul benchmark

a9da45b

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from d87a64a to a9da45b Compare July 30, 2025 20:33

joydddd changed the base branch from main to joydddd/stack/21 July 30, 2025 20:33

joydddd changed the base branch from joydddd/stack/21 to main July 30, 2025 21:38

joydddd added a commit that referenced this pull request Jul 30, 2025

[Benchmark] Add all gather matmul benchmark

e482622

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from a9da45b to e482622 Compare July 30, 2025 21:38

joydddd changed the base branch from main to joydddd/stack/21 July 30, 2025 21:39

joydddd marked this pull request as ready for review August 4, 2025 17:27

joydddd requested review from drisspg, jansel, oulgen and yf225 August 4, 2025 17:28

joydddd changed the base branch from joydddd/stack/21 to main August 4, 2025 21:22

joydddd added a commit that referenced this pull request Aug 4, 2025

[Benchmark] Add all gather matmul benchmark

2e5a80e

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from e482622 to 2e5a80e Compare August 4, 2025 21:22

joydddd changed the base branch from main to joydddd/stack/21 August 4, 2025 21:23

joydddd changed the base branch from joydddd/stack/21 to main August 4, 2025 21:44

joydddd force-pushed the joydddd/stack/22 branch from 2e5a80e to 1fa69aa Compare August 4, 2025 21:44

joydddd added a commit that referenced this pull request Aug 4, 2025

[Benchmark] Add all gather matmul benchmark

1fa69aa

stack-info: PR: #400, branch: joydddd/stack/22

joydddd changed the base branch from main to joydddd/stack/21 August 5, 2025 20:45

joydddd force-pushed the joydddd/stack/21 branch from 4d1ff3b to 80dd2ea Compare August 5, 2025 20:47

joydddd added a commit that referenced this pull request Aug 5, 2025

[Benchmark] Add all gather matmul benchmark

cc373e2

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from 96aa4a7 to cc373e2 Compare August 5, 2025 20:47

joydddd changed the base branch from joydddd/stack/21 to main August 5, 2025 22:28

joydddd added a commit that referenced this pull request Aug 5, 2025

[Benchmark] Add all gather matmul benchmark

55cd2d8

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from cc373e2 to 55cd2d8 Compare August 5, 2025 22:28

joydddd changed the base branch from main to joydddd/stack/21 August 5, 2025 22:28

joydddd changed the base branch from joydddd/stack/21 to main August 5, 2025 22:36

joydddd added a commit that referenced this pull request Aug 5, 2025

[Benchmark] Add all gather matmul benchmark

5171d4b

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from 55cd2d8 to 5171d4b Compare August 5, 2025 22:36

joydddd changed the base branch from main to joydddd/stack/21 August 5, 2025 22:36

joydddd changed the base branch from joydddd/stack/21 to main August 7, 2025 23:06

joydddd force-pushed the joydddd/stack/22 branch from 5171d4b to 22858ee Compare August 7, 2025 23:06

joydddd added a commit that referenced this pull request Aug 7, 2025

[Benchmark] Add all gather matmul benchmark

22858ee

stack-info: PR: #400, branch: joydddd/stack/22

joydddd changed the base branch from main to joydddd/stack/21 August 7, 2025 23:06

joydddd force-pushed the joydddd/stack/21 branch from ec22ee1 to 644b641 Compare August 8, 2025 15:10

joydddd added a commit that referenced this pull request Aug 8, 2025

[Benchmark] Add all gather matmul benchmark

e0ab2e4

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from 22858ee to e0ab2e4 Compare August 8, 2025 15:10

joydddd changed the base branch from joydddd/stack/21 to main August 8, 2025 15:11

joydddd added a commit that referenced this pull request Aug 8, 2025

[Benchmark] Add all gather matmul benchmark

dfcd4ad

stack-info: PR: #400, branch: joydddd/stack/22

joydddd force-pushed the joydddd/stack/22 branch from e0ab2e4 to dfcd4ad Compare August 8, 2025 15:11

joydddd changed the base branch from main to joydddd/stack/21 August 8, 2025 15:11

[Benchmark] Add all gather matmul benchmark

ae9927f

stack-info: PR: #400, branch: joydddd/stack/22

joydddd changed the base branch from joydddd/stack/21 to main August 8, 2025 15:52

joydddd force-pushed the joydddd/stack/22 branch from dfcd4ad to ae9927f Compare August 8, 2025 15:52

joydddd changed the base branch from main to joydddd/stack/21 August 8, 2025 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Benchmark] Add all gather matmul benchmark #400

[Benchmark] Add all gather matmul benchmark #400

joydddd commented Jul 30, 2025 •

edited

Loading

Uh oh!

joydddd commented Jul 30, 2025

Uh oh!

joydddd commented Jul 30, 2025

Uh oh!

jansel commented Aug 6, 2025

Uh oh!

joydddd commented Aug 8, 2025

Uh oh!

joydddd commented Aug 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Benchmark] Add all gather matmul benchmark #400

Are you sure you want to change the base?

[Benchmark] Add all gather matmul benchmark #400

Conversation

joydddd commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!