Skip to content

Port necessary FBGEMM MoE kernels#30

Draft
Clydingus wants to merge 3 commits intowp-1.5from
yoink-fbgemm-kernels
Draft

Port necessary FBGEMM MoE kernels#30
Clydingus wants to merge 3 commits intowp-1.5from
yoink-fbgemm-kernels

Conversation

@Clydingus
Copy link

Fixes #24

Yoinks the two kernels index_shuffling and scatter_add_dense_tokens we need from FBGEMM as triton kernels that should work on windows even if fbgemm isn't installed on windows devices, which will bring Biome WP1.5 MoE models up to par.

Benchmarks

4090 benchmarks normal FBGEMM:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name (time in ms)                                                                                                                                                                                                                           Median                    Max                   Mean              StdDev            max_vram_alloc  max_vram_reserved
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_img_decoder_only                                                                                                                                                                                                                       3.2922 (1.0)           3.3813 (1.0)           3.2957 (1.0)        0.0129 (1.0)                 271                  1
test_ar_rollout[taehv_ae=True,ae_uri=Overworld-Models/taehv1_5,gated_linear=False,n_kv_heads=16,n_heads=32,moe=True,moe_n_experts=8,moe_top_k=8,shared_frame_experts=False-256-True] | params=1,236,166,808 | active=1,236,166,808     11,353.9741 (>1000.0)  11,421.3518 (>1000.0)  11,358.1943 (>1000.0)   35.1215 (>1000.0)           11.21              12.72
test_ar_rollout[taehv_ae=True,ae_uri=Overworld-Models/taehv1_5,gated_linear=False,n_kv_heads=16,n_heads=32-256-True] | params=1,235,773,592 | active=1,235,773,592                                                                     18,521.6008 (>1000.0)  18,641.9367 (>1000.0)  18,502.2185 (>1000.0)  124.9938 (>1000.0)           11.21              12.72
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
========================================================================= short test summary info =========================================================================
FAILED tests/benchmark_moe.py::test_ar_rollout[taehv_ae=True,ae_uri=Overworld-Models/taehv1_5,gated_linear=False,n_kv_heads=16,n_heads=32,moe=True,moe_n_experts=16,moe_top_k=1,shared_frame_experts=False-256-True] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 23.52 GiB of which 3.69 MiB is free. Including non-PyTorch memory, this process has 23.42 GiB memory in use. Of the allocated memory 22.82 GiB is allocated by PyTorch, with 732.00 MiB allocated in private pools (e.g., CUDA Graphs), and 59.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
FAILED tests/benchmark_moe.py::test_ar_rollout[taehv_ae=True,ae_uri=Overworld-Models/taehv1_5,gated_linear=False,n_kv_heads=16,n_heads=32,moe=True,moe_n_experts=32,moe_top_k=4,shared_frame_experts=False-256-True] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.52 GiB of which 43.69 MiB is free. Including non-PyTorch memory, this process has 23.38 GiB memory in use. Of the allocated memory 22.76 GiB is allocated by PyTorch, with 732.00 MiB allocated in private pools (e.g., CUDA Graphs), and 80.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
========================================================== 2 failed, 3 passed, 13 warnings in 398.28s (0:06:38) ===========================================================

triton port kernels:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant