runtime: add runtime.Yield() #8

dt · 2025-10-06T22:17:59Z

No description provided.

petermattis

I'm good with experimenting with this Go runtime enhancement. I definitely want to see experimental evidence of the benefit.

PS Should probably update the print in schedtrace to include sched.bgqsize.

petermattis · 2025-10-08T11:31:00Z

src/runtime/proc.go

+	if gp.lockedm != 0 || gp.m.lockedg != 0 || gp.m.locks > 0 {
+		return
+	}
+	if sched.runqsize > 0 || (gp.m.p != 0 && !runqempty(gp.m.p.ptr())) {


Don't you need to hold sched.lock when checking sched.runqsize and sched.bgsqize? Hmm, it looks like runqsize is sometimes checked without the lock and sometimes with, though it is always modified with the lock held. Seems like the runtime authors just assume this is understood. Worth dropping a comment here about the safety.

Yeah, I had a similar question when I was reading findRunnable and noticed reads of runqsize without holding the lock and it is not using atomics and indeed specifically asked ChatGPT to look for all such cases and outline what's going on. My take-away was that it being done where the compiler isn't going to hoist the load entirely out of the loop or something, so it will actually be a load that's just a the mercy of cache coherence delays, and that slightly stale read is good enough for the sort of "first pass" decisions it is used for, with critical decisions then falling back to a locked read like the one at the bottom of findRunnable if the unlocked reads didn't find anything. Yielding seems like just such a case, where it is more important we be cheap than perfect, as we'll just yield a few checks later once our cache updates.

src/runtime/proc.go

sumeerbhola · 2025-10-09T22:39:13Z

src/runtime/runtime2.go

 	runqsize int32
+	// Global background-yield queue: goroutines that voluntarily yielded
+	// while the scheduler was busy. Does NOT contribute to runqsize.
+	bgq     gQueue


(just clarifying) So there is no per-P run queue for these (like p.runq) because:

We expect the number of such background goroutines to be few?

Even if they are not few, they are only run when the foreground goroutines are not runnable, which should be rarer (when P utilization is high, and if it is low, this goroutine wouldn't have had to yield), so grabbing them from the global queue is ok wrt concurrent performance?

Yeah, that was my thinking: we are fine taking the hit of going to a global queue to park the important thing is getting out of the way _under saturation and we are willing to take a small hit of going to/from a global queue to do so (since we also don't expect to remain in saturation much, particularly once the majority of some spike in runnable has been parked on bgq to be slowly retrieved as capacity allows).

That said this comment got me wondering if I could do better: I really don't want to make new global and new local shadow queues and duplicate everything, reimplement stealing, etc. That feels like an unpleasant patch to keep on our fork if upstream won't take this (not optimistic). But maybe we don't need to go park on bgq after all if our local runq has work: getting behind that work is just as good as going to the global bqq insofar as it gets out of the way of some work. I revised the patch to do just that: jump to the back of the local run queue (no bgq here) if we see there is work in the local runq and only go park on bgq is we see an overload signal and don't see local work to let by.

To avoid bg -> bg thrashing I conditioned this on running for 100us, otherwise I go to the bgq.

This (or the change in saturation signal) appears to have given us a bit of a perf boost; previously we were close to but slightly lower with end to end performance (throughput) of of CRDB master on concurrent, heavily CPU-contented IMPORT jobs, while getting p99 sched lat of <1ms vs 200ms+. With the changes I made here -- local yields when possible + dedicated saturation signal -- we're still doing <1ms sched lat p99 (~.3ms), but end to end throughput of the IMPORTs actually went up by 10%+, presumably because parking on bgq avoids the runtime's 10ms preemptions or something (though that seems like an outsized effect).

src/runtime/proc.go

sumeerbhola · 2025-10-09T22:58:27Z

src/runtime/proc.go

+	// Fast path: tiny, inlineable checks only.
+
+	// Check if we need to yield at all and early exit fast if not.
+	if sched.npidle.Load() > 0 {


Given we have global atomic of npidle already, should we consider adding one for npWithNonEmptyRunQ? That would eliminate the unfairness from my previous comment.

What do you mean here?

We are deciding to add this goroutine to bgq when there is no idle P and:

The global runq is non-empty, or

This P's runq is non-empty

Both the above could be false, but some other P could have a non-empty runq (which, if this P became idle, it would steal from). This is not desirable. We can fix this by having a schedt.npWithNonEmptyRunQ atomic: each P when it transitions from runq empty to non-empty would increment this atomic, and decrement on the reverse transition. The second bullet above would change to npWithNonEmptyRunQ > 0.

Yeah, I figured if global runq is empty and this G had nothing in its local that’s cheap to check. I was reluctant initially, to add a new atomic or anything that needs to be maintained in any non-background paths in case this is a patch we have to carry on our fork, thus trying to stick to just what we already have: npidle, global queue, and local queue, but not other m’s queues.

This has me thinking: if we were willing to leave a little utilization on the table, we could just say npidle < 1 is our signal. Then we can just infer that all the runqs, global and local, are empty or could be if they wanted to be since we’re leaving a whole p idle, and that possibly has even better latency characteristics than waiting for runq to be no -empty to jump out of the way at the last minute. But again, at the cost of leaving a whole p on the table. But maybe that’s ok (still better utilization than today for maxprocs>=4).

I guess another option is to make a num runnable atomic (if we have to keep it on a fork it isn’t that hard to grep for the casStatus(runnable) calls) and stop looking at either npidle or runqsize?

I added a new ngqueued atomic that is bumped when an attempt to schedule fails to find an idle P (e.g. in wakep) and is reset to 0 when findRunnable doesn't find anything. Then I told myself a story, after I did that, about how actually maybe if we have no local runq or global, we shouldn't park in bgq anyway just to free up a P to steal, since we'd prefer the head of the busy P's like yield instead of having bg work hop between Ps , and then compromised with a heuristic based on time running.

The heuristic performs pretty well in tests.

This has me thinking: if we were willing to leave a little utilization on the table, we could just say npidle < 1 is our signal.

That seems ok to me.
I haven't read your revisions yet.

src/runtime/proc.go

sumeerbhola · 2025-10-09T23:04:14Z

src/runtime/proc.go

+	// Yielded goroutines were runnable but voluntarily deprioritized themselves
+	// to waiting instead by calling BackgroundYield. If we have nothing runnable
+	// we can bring a yielded goroutine back to runnable and run it.
+	if sched.bgqsize != 0 {


Does the go compiler know something special about certain fields and makes their unsynchronized reads "less stale"?

Not that I can find. The runtime and in particular scheduler uses these a-bit-stale reads liberally (though in important things like findRunnable there are locked reads backstopping them after fast paths are exhausted). I was, uh, curious about this too. As far as I can tell -- and what chatgpt tells me as well -- is they're just good old unlocked, un-syned reads. They're uint32 so no worry about tearing a write (on 32b), so staleness is the only concern. They aren't in loops out of which the load might be lifted by the compiler, so it'd be an actual load, and just up to MESI or whatever I guess. Though I did my initial testing with a cruder version of this patch, before I cleaned it up for a PR, and one of my cleanups was to split to up to ensure the cheap checks could be inlined; I wonder if that is a mistake, since if it is inlined into a for loop -- the place we expect this to be called -- maybe the load does get lifted and then never sees new values? I guess I should re-test with the optimized version.

src/runtime/proc.go

tbg · 2025-10-10T11:10:42Z

src/runtime/backgroundyield_test.go

+	defer runtime.GOMAXPROCS(orig)
+
+	runtime.GOMAXPROCS(target)
+	runtime.Gosched()


what's this for?

tbg · 2025-10-10T11:15:11Z

src/runtime/export_test.go

 	}
 }

+// RunBackgroundYieldQueueCheck exercises the background queue enqueue/dequeue


I'm confused by this method. What does it exercise? What does success mean? Why does it return skipped==true if there's something in the bgq? Why is this only used in a test that doesn't seem to do anything?

Can’t use testing.T or testing at all in the runtime package, only in runtime_test external test package. So the common pattern, if you want to test anything non-exported, seems to be a RunX func (in this _test.go file that is not _test package) that exercises the internal code / returns an exported version, then a usually very thin TestX in the external runtime_test package that calls it.

That’s the reason for the overall setup, but I’ll go back and review the actual logic herein in this one: I added these last couple these tests while just chasing coverage % on the train so might be some room for cleanup/commenting here.

sumeerbhola · 2025-10-17T16:15:32Z

src/runtime/proc.go

+// yield_put is the gopark unlock function for Yield. It enqueues the goroutine
+// onto the global yield queue. Returning true keeps the G parked until another
+// part of the scheduler makes it runnable again. The G remains in _Gwaiting
+// after this returns.


I'm confused about this. Is this saying that G's on the yieldq appear as _Gwaiting and not _Grunnable? If yes, why do we need to do this?

sumeerbhola · 2025-10-17T16:17:21Z

src/runtime/proc.go

+// Yield is intended to be very low overhead, particularly in the no-op case
+// where the scheduler is not at capacity, to ensure it can be called often
+// enough in tasks wishing to yield promptly to waiting work when needed.
+// needed. When the scheduler is busy, the yielded goroutine can be parked in a


nit: superfluous "needed".

sumeerbhola · 2025-10-17T16:19:33Z

src/runtime/proc.go

+// If there is work on the local runq, the cheapest option is to just hop behind
+// it in the local runq to let it run and then pick back up. However this will
+// end up thrashing if the work we yield to also then yields right back. We
+// don't mark goroutines in any way when they yield so we cannot directly detech


nit: detect

sumeerbhola · 2025-10-17T16:22:21Z

src/runtime/proc.go

+//
+// At the same time, we *do* want to yield -- that's why we are here -- if there
+// is work waiting for a chance to run. We can balance our preference to give
+// the other P a chance to just run it vs not making it wait too longwith a


nit: long with

sumeerbhola · 2025-10-17T16:29:07Z

src/runtime/proc.go

+		unlock(&sched.lock)
+		if bg != nil {
+			// Transition from _Gwaiting (yield) to _Grunnable.
+			trace := traceAcquire()


Is making it look like _Gwaiting to avoid polluting our runnable stats?

sumeerbhola · 2025-10-17T16:31:40Z

src/runtime/proc.go

 		return gp, false, false
 	}
+	if sched.yieldqsize != 0 {
+		unlock(&sched.lock)


is this because we did an unsynchronized read earlier? A code comment would be helpful.

sumeerbhola · 2025-10-17T16:36:06Z

Given that we are getting closer to merging this, I think we need a plan for how we will maintain this. Specifically the couple of lines of code scattered over various parts in proc.go e.g. we need a list of what cases we need to integrate with the changes, and how to go about finding those cases in the scheduler code.

dt requested review from sumeerbhola and tbg October 6, 2025 22:17

petermattis reviewed Oct 8, 2025

View reviewed changes

rickystewart force-pushed the cockroach-go1.23.12 branch 3 times, most recently from 84fef0d to ec86954 Compare October 9, 2025 21:21

sumeerbhola requested changes Oct 9, 2025

View reviewed changes

sumeerbhola reviewed Oct 10, 2025

View reviewed changes

src/runtime/proc.go Outdated Show resolved Hide resolved

tbg reviewed Oct 10, 2025

View reviewed changes

dt force-pushed the yield branch 4 times, most recently from 610fd17 to e673937 Compare October 15, 2025 21:37

runtime: add runtime.Yield

642d058

dt force-pushed the yield branch from e673937 to 642d058 Compare October 16, 2025 02:33

dt changed the title ~~runtime: add runtime.BackgroundYield()~~ runtime: add runtime.Yield() Oct 16, 2025

sumeerbhola requested changes Oct 17, 2025

View reviewed changes

runtime: add runtime.Yield() #8

Are you sure you want to change the base?

runtime: add runtime.Yield() #8

Conversation

dt commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dt commented Oct 6, 2025 •

edited

Loading