I've tried running the CogVideoX experiment on A100 (80G). Using the methods from this repo, no obvious improvement was observed.
The results as follows:
Will try to load from local cache.
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.32it/s]it/s]
Loading pipeline components...: 100%|██████████| 5/5 [00:04<00:00, 1.10it/s]
Prompt: A bright yellow water taxi glides smoothly across the choppy waters, creating gentle ripples in its wake. The iconic Brooklyn Bridge looms majestically in the background, its intricate web of cables and towering stone arches standing out against the city skyline. The boat, bustling with passengers, offers a lively contrast to the serene, expansive sky dotted with fluffy clouds. As it cruises forward, the vibrant cityscape of New York unfolds, with towering skyscrapers and historic buildings lining the waterfront, capturing the dynamic essence of urban life.
Image Is Ready. Seed is 0
20%|██ | 10/50 [02:40<10:43, 16.09s/it]AUTOTUNE flex_attention(2x48x45106x64, 2x48x45106x64, 2x48x45106x64, 2x48x45106, 1x96x353, 1x96x353x353, 1x96x353, 1x96x353x353)
triton_flex_attention_0 35.7949 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=False, OUTPUT_LOGSUMEXP=False, PRESCALE_QK=False, QK_HEAD_DIM=64, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, num_stages=3, num_warps=4
triton_flex_attention_1 36.1697 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=False, OUTPUT_LOGSUMEXP=False, PRESCALE_QK=False, QK_HEAD_DIM=64, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, num_stages=3, num_warps=4
triton_flex_attention_4 39.2038 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=False, OUTPUT_LOGSUMEXP=False, PRESCALE_QK=False, QK_HEAD_DIM=64, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, num_stages=3, num_warps=4
triton_flex_attention_2 41.6369 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=False, OUTPUT_LOGSUMEXP=False, PRESCALE_QK=False, QK_HEAD_DIM=64, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, num_stages=3, num_warps=4
triton_flex_attention_3 44.2808 ms 80.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=False, OUTPUT_LOGSUMEXP=False, PRESCALE_QK=False, QK_HEAD_DIM=64, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, num_stages=3, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 1.9273 seconds and 4.9549 seconds precompiling for 5 choices
100%|██████████| 50/50 [08:49<00:00, 10.58s/it]
Will try to load from local cache.
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 3.83it/s]it/s]
Loading pipeline components...: 100%|██████████| 5/5 [00:01<00:00, 2.53it/s]
Prompt: A bright yellow water taxi glides smoothly across the choppy waters, creating gentle ripples in its wake. The iconic Brooklyn Bridge looms majestically in the background, its intricate web of cables and towering stone arches standing out against the city skyline. The boat, bustling with passengers, offers a lively contrast to the serene, expansive sky dotted with fluffy clouds. As it cruises forward, the vibrant cityscape of New York unfolds, with towering skyscrapers and historic buildings lining the waterfront, capturing the dynamic essence of urban life.
Image Is Ready. Seed is 0
100%|██████████| 50/50 [08:40<00:00, 10.41s/it]
the inference time has increased by 9 seconds (from 8:40 to 8:49).
Do you have any good suggestions on how to reproduce the experimental results from the paper on an A100?
I've tried running the CogVideoX experiment on A100 (80G). Using the methods from this repo, no obvious improvement was observed.
The results as follows:
the inference time has increased by 9 seconds (from 8:40 to 8:49).
Do you have any good suggestions on how to reproduce the experimental results from the paper on an A100?