Skip to content

Conversation

@FlykeSpice
Copy link
Contributor

@FlykeSpice FlykeSpice commented Nov 4, 2025

This is an attempt of mine to optimize tms3202x dsp emulation used by systems like namcos21, namcos22, etc, because these systems run like crap on my android phone, and out of all the cpu devices used by the systems, this one is frequently rated higher clockwise.

There is no structural change to the code besides touching the relevant hot spots that [seemingly] take the most chunk of time.

Benchmarks

All the benchmark numbers were done using the command -bench 300 with the CI gcc-x64 build of mame before & after the changes applied.

PS: these numbers might not be 100% accurate as I don't have a dedicated gig for testing (im poor :-( ), these were all done on my home computer and it might have been influenced by the constant thermal throttling while the benches were running.

Namcos21:

Benchmarking winrun
239.150000% -> 252.420000% (13.270000%)
246.060000% -> 259.980000% (13.920000%)
250.470000% -> 259.520000% (9.050000%)
Net average: 12.079999999999993

Benchmarking winrungp
244.400000% -> 251.000000% (6.600000%)
244.890000% -> 251.140000% (6.250000%)
244.670000% -> 252.770000% (8.100000%)
Net average: 6.983333333333339

Benchmarking winrun91
258.590000% -> 264.190000% (5.600000%)
255.600000% -> 268.820000% (13.220000%)
255.970000% -> 263.610000% (7.640000%)
Net average: 8.820000000000013

Namcos22:

Benchmarking ridgeracb
168.840000% -> 170.550000% (1.710000%)
174.290000% -> 176.420000% (2.130000%)
175.980000% -> 176.400000% (0.420000%)
Net average: 1.4200000000000064

Benchmarking acedrive
137.660000% -> 137.990000% (0.330000%)
140.030000% -> 138.250000% (-1.780000%)
139.540000% -> 138.920000% (-0.620000%)
Net average: -0.6899999999999977

Benchmarking alpinerd
104.150000% -> 107.260000% (3.110000%)
104.750000% -> 105.740000% (0.990000%)
104.160000% -> 107.870000% (3.710000%)
Net average: 2.603333333333334

Benchmarking raverace
139.190000% -> 139.850000% (0.660000%)
139.090000% -> 139.870000% (0.780000%)
137.790000% -> 139.410000% (1.620000%)
Net average: 1.0200000000000007

Benchmarking cybrcomm
153.310000% -> 156.470000% (3.160000%)
156.580000% -> 157.070000% (0.490000%)
156.760000% -> 157.400000% (0.640000%)
Net average: 1.4299999999999973

Benchmarking victlapa
258.650000% -> 247.350000% (-11.300000%)
255.440000% -> 246.320000% (-9.120000%)
258.250000% -> 248.060000% (-10.190000%
Net average: -10.203333333333328

Compared to namcos21, the namcos22 improvements were not as surprising only reaching up to 2% at most.

In fact, one particular game (Victory Lap), took a big performance regression at an astouding 10%, I really don't know what would cause such big performance regression on my code for this particular game, that is why I'm tagging this PR as [RFC] as I need some guidance from you guys.

@MooglyGuy
Copy link
Contributor

MooglyGuy commented Nov 5, 2025

The reason why you're not seeing significant gains is that the '025 is not the bottleneck on System 22 by a factor of over 25:1. Even being as favorable as possible to the numbers by running with -numprocessors 1 to hobble the multithreaded renderer, that ratio drops to 17:1.

Quite simply, you're optimizing the wrong thing if it's more performance you want on System 22, but I seem to recall your previous attempt at optimizing System 22 by focusing on its renderer code didn't go in because it changed various functional semantics and had many oversights.

On System 21, the playfield is a bit more even, with tms3202x_device::execute_run occupying 10.41 seconds of a 57.1-second profiling run. Summing up all of the '025-related functions that show up in the timing metrics, the number I get is 25.67, not even amounting to half of the total execution time. Heck, 6.42 seconds of the profile run was spent in namco_c355spr_device::copybitmap.

All numbers above are from AMD uProf, Time-Based Sampling, timer interval of 1 millisecond. Prop Cycle was used for namcos22, Starblade was used for namcos21 (as namcos21 only includes the Winning Run games; there's also namcos21_c67 and namcos21_de). All runs used -window -nodebug -nolog -sound none -video d3d -nohlsl -noafs -nothrottle -str 90 for application options, and were running on an AMD Ryzen 9 3950X at 4.1GHz.

I'm a bit shocked that you were even able to eke out a nearly 5% speed increase on System 21 with what's in this PR. Keep in mind that your "Net average" values are misleading, because it's not a speed increase of a hair over 12%, it's a speed increase of just under 5%: 771.92 / 735.68 - 1 is 0.04926 or so.

These aren't optimizations, but incorrect assumptions, more or less.

They're predicated on the assumption that function prologues and epilogues are costly. They can be, that's true - which is why the compiler already inlines many of these functions. Modern compilers have pretty decent heuristics as to when to inline and when not to.

The result of this sort of "inline all the things" attempt is that you're robbing Peter to pay Paul: You might now have fewer instructions associated with stack-frame manipulation, but only at the cost of instruction-cache pressure. And on modern modern systems, caching is key. If a CPU is having to punt out past L1, L2, or L3 all the way to system RAM, more cycles are wasted by a couple orders of magnitude than are gained by not having to run the instructions for the function prologue or function epilogue.

You're also clearly trying to flatten branches, but some of the branches that you're removing aren't going to be branches on ARM systems - like your phone - anyway. The removal of this bit:

	{
		m_external_mem_access = 1;  /* Pause if hold pin is active */
	}
	else
	{
		m_external_mem_access = 0;
	}

probably felt like something worth doing, but on ARM, that's going to get compiled down to three branchless instructions, CMP, MOVEQ, MOVNE. Since all instructions on ARM can be conditional, these sorts of simple branches are inherently flattened to begin with.

I hate to say it, but the simple answer is that you're just plain not going down the right path here. You're focusing on the wrong thing to optimize, and for the most part, your optimizations are pessimizations that occasionally get lucky.

@FlykeSpice
Copy link
Contributor Author

Hi @MooglyGuy, thanks for taking the time to put up a lengthy review.

Yea, you're right that "inlining as much function as possible" is a bad approach and can cause pessimizations due to instruction cache pressure as you just outlined, however the functions I'm inlining are just very small helper functions for common instruction operations (fetching data, setting overflow/carry flags...), in the majority of cpu devices they are written in macros, when inlined they don't affect much the instruction functions size.

If you observe the diff carefully, you will notice I just added the "force inline" compiler attribute (force them behave like their macro counterparts in other cpu devices), the "inline" keyword were already there in the original code, in fact, it follows the very bad approach you just outlined above inlining as much as possible (I think I'll remove the inline keyword for process_timer' and process_IRQs)

You're also clearly trying to flatten branches, but some of the branches that you're removing aren't going to be branches on ARM systems - like your phone - anyway. The removal of this bit:

	{
		m_external_mem_access = 1;  /* Pause if hold pin is active */
	}
	else
	{
		m_external_mem_access = 0;
	}

probably felt like something worth doing, but on ARM, that's going to get compiled down to three branchless instructions, CMP, MOVEQ, MOVNE. Since all instructions on ARM can be conditional, these sorts of simple branches are inherently flattened to begin with.

I'm aware that they get compiled down to the very nice conditional moves, the problem is another bottleneck still persists -- you need to interact with the very slow external memory (in terms of hundred cycles) to write the variable, hence I wrote that out. Besides, that variable was rarely used since neither system 21 & 22 use the hold pin.

I hate to say it, but the simple answer is that you're just plain not going down the right path here. You're focusing on the wrong thing to optimize, and for the most part, your optimizations are pessimizations that occasionally get lucky.

I don't think it's fair to call my changes "pessimizations that occasionally get lucky" since I get a consistent >%6 speed improvement with namcos21 Winning Run games (what a luck!), and yes, the speed improvement is almost negligible (1-2%) with namcos22 since the bottleneck isn't there, but I would take any (abeit small) improvement over noone.

HOWEVER, I won't take it when a single game (Victory Lap) seems to regress over 10% in speed on my benchmarks for no plausible reason at all, and hence I requested input about it. I'm very confident my changes wouldn't cause it to regress that much, specially when it improves namcos21 games speed, I need to solve this paradoxical situation.

Is it better to test every other namcos22 game to show whether someone else show such big perfomance regressions?

@FlykeSpice FlykeSpice marked this pull request as draft November 6, 2025 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants