-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[RFC] TMS32025: Attempt to optimize the CPU execution #14479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The reason why you're not seeing significant gains is that the '025 is not the bottleneck on System 22 by a factor of over 25:1. Even being as favorable as possible to the numbers by running with Quite simply, you're optimizing the wrong thing if it's more performance you want on System 22, but I seem to recall your previous attempt at optimizing System 22 by focusing on its renderer code didn't go in because it changed various functional semantics and had many oversights. On System 21, the playfield is a bit more even, with All numbers above are from AMD uProf, Time-Based Sampling, timer interval of 1 millisecond. Prop Cycle was used for namcos22, Starblade was used for namcos21 (as namcos21 only includes the Winning Run games; there's also namcos21_c67 and namcos21_de). All runs used I'm a bit shocked that you were even able to eke out a nearly 5% speed increase on System 21 with what's in this PR. Keep in mind that your "Net average" values are misleading, because it's not a speed increase of a hair over 12%, it's a speed increase of just under 5%: 771.92 / 735.68 - 1 is 0.04926 or so. These aren't optimizations, but incorrect assumptions, more or less. They're predicated on the assumption that function prologues and epilogues are costly. They can be, that's true - which is why the compiler already inlines many of these functions. Modern compilers have pretty decent heuristics as to when to inline and when not to. The result of this sort of "inline all the things" attempt is that you're robbing Peter to pay Paul: You might now have fewer instructions associated with stack-frame manipulation, but only at the cost of instruction-cache pressure. And on modern modern systems, caching is key. If a CPU is having to punt out past L1, L2, or L3 all the way to system RAM, more cycles are wasted by a couple orders of magnitude than are gained by not having to run the instructions for the function prologue or function epilogue. You're also clearly trying to flatten branches, but some of the branches that you're removing aren't going to be branches on ARM systems - like your phone - anyway. The removal of this bit: probably felt like something worth doing, but on ARM, that's going to get compiled down to three branchless instructions, CMP, MOVEQ, MOVNE. Since all instructions on ARM can be conditional, these sorts of simple branches are inherently flattened to begin with. I hate to say it, but the simple answer is that you're just plain not going down the right path here. You're focusing on the wrong thing to optimize, and for the most part, your optimizations are pessimizations that occasionally get lucky. |
|
Hi @MooglyGuy, thanks for taking the time to put up a lengthy review. Yea, you're right that "inlining as much function as possible" is a bad approach and can cause pessimizations due to instruction cache pressure as you just outlined, however the functions I'm inlining are just very small helper functions for common instruction operations (fetching data, setting overflow/carry flags...), in the majority of cpu devices they are written in macros, when inlined they don't affect much the instruction functions size. If you observe the diff carefully, you will notice I just added the "force inline" compiler attribute (force them behave like their macro counterparts in other cpu devices), the "inline" keyword were already there in the original code, in fact, it follows the very bad approach you just outlined above inlining as much as possible (I think I'll remove the inline keyword for
I'm aware that they get compiled down to the very nice conditional moves, the problem is another bottleneck still persists -- you need to interact with the very slow external memory (in terms of hundred cycles) to write the variable, hence I wrote that out. Besides, that variable was rarely used since neither system 21 & 22 use the hold pin.
I don't think it's fair to call my changes "pessimizations that occasionally get lucky" since I get a consistent >%6 speed improvement with namcos21 Winning Run games (what a luck!), and yes, the speed improvement is almost negligible (1-2%) with namcos22 since the bottleneck isn't there, but I would take any (abeit small) improvement over noone. HOWEVER, I won't take it when a single game (Victory Lap) seems to regress over 10% in speed on my benchmarks for no plausible reason at all, and hence I requested input about it. I'm very confident my changes wouldn't cause it to regress that much, specially when it improves namcos21 games speed, I need to solve this paradoxical situation. Is it better to test every other namcos22 game to show whether someone else show such big perfomance regressions? |
This is an attempt of mine to optimize tms3202x dsp emulation used by systems like namcos21, namcos22, etc, because these systems run like crap on my android phone, and out of all the cpu devices used by the systems, this one is frequently rated higher clockwise.
There is no structural change to the code besides touching the relevant hot spots that [seemingly] take the most chunk of time.
Benchmarks
All the benchmark numbers were done using the command
-bench 300with the CI gcc-x64 build of mame before & after the changes applied.PS: these numbers might not be 100% accurate as I don't have a dedicated gig for testing (im poor :-( ), these were all done on my home computer and it might have been influenced by the constant thermal throttling while the benches were running.
Namcos21:
Namcos22:
Compared to namcos21, the namcos22 improvements were not as surprising only reaching up to 2% at most.
In fact, one particular game (Victory Lap), took a big performance regression at an astouding 10%, I really don't know what would cause such big performance regression on my code for this particular game, that is why I'm tagging this PR as [RFC] as I need some guidance from you guys.