Skip to content

[Type] Mat: better cache locality for operator*(Mat)#5921

Open
fredroy wants to merge 3 commits intosofa-framework:masterfrom
fredroy:optim_mat_operator_mult
Open

[Type] Mat: better cache locality for operator*(Mat)#5921
fredroy wants to merge 3 commits intosofa-framework:masterfrom
fredroy:optim_mat_operator_mult

Conversation

@fredroy
Copy link
Contributor

@fredroy fredroy commented Feb 2, 2026

Changing accesses for better cache locality (suggested by AI)

TL;DR:
the Mat<3,3> version does not change because it has its own optimized specialized version
bigger the matrices, bigger the gain (Mat24x24, speedup of 400% in floats !)
macOS has a weird quirk for Mat6x6 on double, which is 50% slower ? 🤔 maybe due to a failed vectorization or somethin'

Timings:
Ubuntu 22.04, gcc12, lto, O3

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.53 us         1.53 us       457258
BM_Matrix_typemat_matmult<float, 3>/1024          3.07 us         3.07 us       227524
BM_Matrix_typemat_matmult<float, 3>/2048          6.16 us         6.16 us       112806
BM_Matrix_typemat_matmult<double, 3>/512          1.73 us         1.73 us       402135
BM_Matrix_typemat_matmult<double, 3>/1024         3.49 us         3.48 us       201140
BM_Matrix_typemat_matmult<double, 3>/2048         6.99 us         6.99 us        99944
BM_Matrix_typemat_matmult<float, 6>/512           23.8 us         23.8 us        29239
BM_Matrix_typemat_matmult<float, 6>/1024          47.7 us         47.7 us        14642
BM_Matrix_typemat_matmult<float, 6>/2048          95.8 us         95.8 us         7241
BM_Matrix_typemat_matmult<double, 6>/512          24.4 us         24.4 us        28460
BM_Matrix_typemat_matmult<double, 6>/1024         49.0 us         49.0 us        14222
BM_Matrix_typemat_matmult<double, 6>/2048         98.3 us         98.3 us         7058
BM_Matrix_typemat_matmult<float, 24>/512          2108 us         2108 us          331
BM_Matrix_typemat_matmult<float, 24>/1024         4234 us         4234 us          165
BM_Matrix_typemat_matmult<float, 24>/2048         8458 us         8457 us           80
BM_Matrix_typemat_matmult<double, 24>/512         1878 us         1878 us          372
BM_Matrix_typemat_matmult<double, 24>/1024        3773 us         3773 us          185
BM_Matrix_typemat_matmult<double, 24>/2048        7741 us         7741 us           89

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.54 us         1.54 us       453879
BM_Matrix_typemat_matmult<float, 3>/1024          3.09 us         3.09 us       226329
BM_Matrix_typemat_matmult<float, 3>/2048          6.17 us         6.16 us       113432
BM_Matrix_typemat_matmult<double, 3>/512          1.73 us         1.73 us       403088
BM_Matrix_typemat_matmult<double, 3>/1024         3.46 us         3.46 us       202741
BM_Matrix_typemat_matmult<double, 3>/2048         6.91 us         6.91 us       100423
BM_Matrix_typemat_matmult<float, 6>/512           22.4 us         22.4 us        31211
BM_Matrix_typemat_matmult<float, 6>/1024          44.4 us         44.4 us        15589
BM_Matrix_typemat_matmult<float, 6>/2048          89.2 us         89.2 us         7770
BM_Matrix_typemat_matmult<double, 6>/512          22.7 us         22.7 us        30714
BM_Matrix_typemat_matmult<double, 6>/1024         45.6 us         45.6 us        15286
BM_Matrix_typemat_matmult<double, 6>/2048         91.9 us         91.9 us         7593
BM_Matrix_typemat_matmult<float, 24>/512           522 us          522 us         1338
BM_Matrix_typemat_matmult<float, 24>/1024         1039 us         1039 us          672
BM_Matrix_typemat_matmult<float, 24>/2048         2090 us         2090 us          334
BM_Matrix_typemat_matmult<double, 24>/512          963 us          963 us          725
BM_Matrix_typemat_matmult<double, 24>/1024        1925 us         1925 us          362
BM_Matrix_typemat_matmult<double, 24>/2048        3929 us         3929 us          179

after (revised)
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.54 us         1.54 us       456346
BM_Matrix_typemat_matmult<float, 3>/1024          3.08 us         3.08 us       227839
BM_Matrix_typemat_matmult<float, 3>/2048          6.17 us         6.17 us       112654
BM_Matrix_typemat_matmult<double, 3>/512          1.73 us         1.73 us       399904
BM_Matrix_typemat_matmult<double, 3>/1024         3.46 us         3.46 us       201315
BM_Matrix_typemat_matmult<double, 3>/2048         6.92 us         6.92 us       100507
BM_Matrix_typemat_matmult<float, 6>/512           22.4 us         22.3 us        31397
BM_Matrix_typemat_matmult<float, 6>/1024          44.5 us         44.5 us        15630
BM_Matrix_typemat_matmult<float, 6>/2048          89.1 us         89.1 us         7768
BM_Matrix_typemat_matmult<double, 6>/512          22.8 us         22.8 us        30601
BM_Matrix_typemat_matmult<double, 6>/1024         45.7 us         45.7 us        15298
BM_Matrix_typemat_matmult<double, 6>/2048         91.9 us         91.8 us         7563
BM_Matrix_typemat_matmult<float, 24>/512           519 us          519 us         1356
BM_Matrix_typemat_matmult<float, 24>/1024         1045 us         1045 us          667
BM_Matrix_typemat_matmult<float, 24>/2048         2083 us         2082 us          334
BM_Matrix_typemat_matmult<double, 24>/512          959 us          958 us          723
BM_Matrix_typemat_matmult<double, 24>/1024        1938 us         1936 us          361
BM_Matrix_typemat_matmult<double, 24>/2048        3931 us         3927 us          178

Windows VS2026, release, lto

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           22.7 us         22.0 us        29867
BM_Matrix_typemat_matmult<float, 3>/1024          44.6 us         45.5 us        15448
BM_Matrix_typemat_matmult<float, 3>/2048          88.6 us         90.0 us         7467
BM_Matrix_typemat_matmult<double, 3>/512          18.0 us         18.0 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024         36.2 us         36.8 us        18667
BM_Matrix_typemat_matmult<double, 3>/2048         72.3 us         71.5 us         8960
BM_Matrix_typemat_matmult<float, 6>/512            457 us          450 us         1493
BM_Matrix_typemat_matmult<float, 6>/1024           922 us          920 us          747
BM_Matrix_typemat_matmult<float, 6>/2048          1825 us         1843 us          407
BM_Matrix_typemat_matmult<double, 6>/512           415 us          414 us         1659
BM_Matrix_typemat_matmult<double, 6>/1024          822 us          816 us          747
BM_Matrix_typemat_matmult<double, 6>/2048         1664 us         1651 us          407
BM_Matrix_typemat_matmult<float, 24>/512          3469 us         3446 us          195
BM_Matrix_typemat_matmult<float, 24>/1024         7058 us         7115 us          112
BM_Matrix_typemat_matmult<float, 24>/2048        14486 us        14375 us           50
BM_Matrix_typemat_matmult<double, 24>/512         3543 us         3526 us          195
BM_Matrix_typemat_matmult<double, 24>/1024        7035 us         6836 us          112
BM_Matrix_typemat_matmult<double, 24>/2048       14557 us        14375 us           50

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           21.9 us         22.0 us        32000
BM_Matrix_typemat_matmult<float, 3>/1024          45.2 us         44.9 us        16000
BM_Matrix_typemat_matmult<float, 3>/2048          87.5 us         87.9 us         7467
BM_Matrix_typemat_matmult<double, 3>/512          18.1 us         18.0 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024         36.9 us         36.9 us        19478
BM_Matrix_typemat_matmult<double, 3>/2048         72.7 us         71.5 us         8960
BM_Matrix_typemat_matmult<float, 6>/512            319 us          321 us         2240
BM_Matrix_typemat_matmult<float, 6>/1024           635 us          628 us         1120
BM_Matrix_typemat_matmult<float, 6>/2048          1303 us         1311 us          560
BM_Matrix_typemat_matmult<double, 6>/512           322 us          321 us         2240
BM_Matrix_typemat_matmult<double, 6>/1024          645 us          642 us         1120
BM_Matrix_typemat_matmult<double, 6>/2048         1286 us         1283 us          560
BM_Matrix_typemat_matmult<float, 24>/512          1715 us         1728 us          407
BM_Matrix_typemat_matmult<float, 24>/1024         3351 us         3294 us          204
BM_Matrix_typemat_matmult<float, 24>/2048         6725 us         6771 us           90
BM_Matrix_typemat_matmult<double, 24>/512         1766 us         1766 us          407
BM_Matrix_typemat_matmult<double, 24>/1024        3460 us         3446 us          195
BM_Matrix_typemat_matmult<double, 24>/2048        7244 us         7292 us           90

after (revised)
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512          22.5 us         22.5 us        32000
BM_Matrix_typemat_matmult<float, 3>/1024         44.3 us         44.9 us        16000
BM_Matrix_typemat_matmult<float, 3>/2048         87.9 us         87.9 us         7467
BM_Matrix_typemat_matmult<double, 3>/512         18.3 us         18.4 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024        36.3 us         36.0 us        18667
BM_Matrix_typemat_matmult<double, 3>/2048        73.1 us         73.9 us        11200
BM_Matrix_typemat_matmult<float, 6>/512           322 us          322 us         2133
BM_Matrix_typemat_matmult<float, 6>/1024          645 us          645 us          896
BM_Matrix_typemat_matmult<float, 6>/2048         1304 us         1311 us          560
BM_Matrix_typemat_matmult<double, 6>/512          306 us          300 us         2240
BM_Matrix_typemat_matmult<double, 6>/1024         620 us          628 us         1120
BM_Matrix_typemat_matmult<double, 6>/2048        1247 us         1228 us          560
BM_Matrix_typemat_matmult<float, 24>/512         1674 us         1689 us          407
BM_Matrix_typemat_matmult<float, 24>/1024        3341 us         3374 us          213
BM_Matrix_typemat_matmult<float, 24>/2048        6723 us         6771 us           90
BM_Matrix_typemat_matmult<double, 24>/512        1752 us         1766 us          407
BM_Matrix_typemat_matmult<double, 24>/1024       3557 us         3526 us          195
BM_Matrix_typemat_matmult<double, 24>/2048       7238 us         7254 us          112

macOS, xcode 26, lto

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.06 us         1.06 us       652973
BM_Matrix_typemat_matmult<float, 3>/1024          2.10 us         2.10 us       335371
BM_Matrix_typemat_matmult<float, 3>/2048          4.20 us         4.20 us       164335
BM_Matrix_typemat_matmult<double, 3>/512          1.14 us         1.14 us       615249
BM_Matrix_typemat_matmult<double, 3>/1024         2.30 us         2.29 us       312962
BM_Matrix_typemat_matmult<double, 3>/2048         4.54 us         4.54 us       151194
BM_Matrix_typemat_matmult<float, 6>/512           6.41 us         6.41 us       109319
BM_Matrix_typemat_matmult<float, 6>/1024          12.8 us         12.8 us        54908
BM_Matrix_typemat_matmult<float, 6>/2048          25.2 us         25.1 us        27832
BM_Matrix_typemat_matmult<double, 6>/512          11.4 us         11.4 us        60546
BM_Matrix_typemat_matmult<double, 6>/1024         22.6 us         22.6 us        30222
BM_Matrix_typemat_matmult<double, 6>/2048         44.5 us         44.5 us        15488
BM_Matrix_typemat_matmult<float, 24>/512           294 us          294 us         2388
BM_Matrix_typemat_matmult<float, 24>/1024          588 us          588 us         1185
BM_Matrix_typemat_matmult<float, 24>/2048         1177 us         1177 us          598
BM_Matrix_typemat_matmult<double, 24>/512          604 us          604 us         1167
BM_Matrix_typemat_matmult<double, 24>/1024        1201 us         1201 us          582
BM_Matrix_typemat_matmult<double, 24>/2048        2416 us         2416 us          291

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.06 us         1.06 us       657339
BM_Matrix_typemat_matmult<float, 3>/1024          2.14 us         2.14 us       332844
BM_Matrix_typemat_matmult<float, 3>/2048          4.27 us         4.27 us       164750
BM_Matrix_typemat_matmult<double, 3>/512          1.13 us         1.13 us       610176
BM_Matrix_typemat_matmult<double, 3>/1024         2.30 us         2.30 us       311717
BM_Matrix_typemat_matmult<double, 3>/2048         4.50 us         4.50 us       157442
BM_Matrix_typemat_matmult<float, 6>/512           5.94 us         5.94 us       119149
BM_Matrix_typemat_matmult<float, 6>/1024          11.7 us         11.7 us        58265
BM_Matrix_typemat_matmult<float, 6>/2048          23.6 us         23.6 us        29901
BM_Matrix_typemat_matmult<double, 6>/512          16.3 us         16.3 us        42924
BM_Matrix_typemat_matmult<double, 6>/1024         32.5 us         32.5 us        21619
BM_Matrix_typemat_matmult<double, 6>/2048         64.5 us         64.5 us        10772
BM_Matrix_typemat_matmult<float, 24>/512           215 us          215 us         3213
BM_Matrix_typemat_matmult<float, 24>/1024          433 us          433 us         1616
BM_Matrix_typemat_matmult<float, 24>/2048          865 us          865 us          808
BM_Matrix_typemat_matmult<double, 24>/512          400 us          400 us         1753
BM_Matrix_typemat_matmult<double, 24>/1024         799 us          799 us          871
BM_Matrix_typemat_matmult<double, 24>/2048        1596 us         1596 us          438

after (revised)
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512          1.04 us         1.04 us       676786
BM_Matrix_typemat_matmult<float, 3>/1024         2.08 us         2.08 us       334509
BM_Matrix_typemat_matmult<float, 3>/2048         4.19 us         4.19 us       166882
BM_Matrix_typemat_matmult<double, 3>/512         1.11 us         1.11 us       625134
BM_Matrix_typemat_matmult<double, 3>/1024        2.28 us         2.28 us       307440
BM_Matrix_typemat_matmult<double, 3>/2048        4.43 us         4.43 us       157761
BM_Matrix_typemat_matmult<float, 6>/512          5.81 us         5.81 us       119244
BM_Matrix_typemat_matmult<float, 6>/1024         11.6 us         11.6 us        60613
BM_Matrix_typemat_matmult<float, 6>/2048         23.1 us         23.1 us        30321
BM_Matrix_typemat_matmult<double, 6>/512         16.0 us         16.0 us        43933
BM_Matrix_typemat_matmult<double, 6>/1024        31.7 us         31.7 us        22104
BM_Matrix_typemat_matmult<double, 6>/2048        63.9 us         63.8 us        11088
BM_Matrix_typemat_matmult<float, 24>/512          215 us          215 us         3266
BM_Matrix_typemat_matmult<float, 24>/1024         431 us          431 us         1624
BM_Matrix_typemat_matmult<float, 24>/2048         863 us          863 us          809
BM_Matrix_typemat_matmult<double, 24>/512         400 us          400 us         1743
BM_Matrix_typemat_matmult<double, 24>/1024        800 us          799 us          843
BM_Matrix_typemat_matmult<double, 24>/2048       1594 us         1594 us          429

By submitting this pull request, I acknowledge that
I have read, understand, and agree SOFA Developer Certificate of Origin (DCO).


Reviewers will merge this pull-request only if

  • it builds with SUCCESS for all platforms on the CI.
  • it does not generate new warnings.
  • it does not generate new unit test failures.
  • it does not generate new scene test failures.
  • it does not break API compatibility.
  • it is more than 1 week old (or has fast-merge label).

@fredroy fredroy added pr: enhancement About a possible enhancement pr: status to review To notify reviewers to review this pull-request labels Feb 2, 2026
@alxbilger alxbilger added the pr: ai-generated Label notifying the reviewers that part or all of the PR has been generated with the help of an AI label Feb 3, 2026
Copy link
Contributor

@alxbilger alxbilger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must initialize the result before calling the operator +=.

@fredroy fredroy force-pushed the optim_mat_operator_mult branch from 0ca315f to eb57d55 Compare February 4, 2026 01:11
@fredroy fredroy requested a review from alxbilger February 4, 2026 02:24
@fredroy
Copy link
Contributor Author

fredroy commented Feb 4, 2026

You must initialize the result before calling the operator +=.

done , and re-did the benches (no change)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr: ai-generated Label notifying the reviewers that part or all of the PR has been generated with the help of an AI pr: enhancement About a possible enhancement pr: status to review To notify reviewers to review this pull-request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants