-
Notifications
You must be signed in to change notification settings - Fork 2
runtime: add runtime.Yield() #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: cockroach-go1.23.12
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with experimenting with this Go runtime enhancement. I definitely want to see experimental evidence of the benefit.
PS Should probably update the print in schedtrace
to include sched.bgqsize
.
src/runtime/proc.go
Outdated
if gp.lockedm != 0 || gp.m.lockedg != 0 || gp.m.locks > 0 { | ||
return | ||
} | ||
if sched.runqsize > 0 || (gp.m.p != 0 && !runqempty(gp.m.p.ptr())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to hold sched.lock
when checking sched.runqsize
and sched.bgsqize
? Hmm, it looks like runqsize
is sometimes checked without the lock and sometimes with, though it is always modified with the lock held. Seems like the runtime authors just assume this is understood. Worth dropping a comment here about the safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had a similar question when I was reading findRunnable
and noticed reads of runqsize
without holding the lock and it is not using atomics and indeed specifically asked ChatGPT to look for all such cases and outline what's going on. My take-away was that it being done where the compiler isn't going to hoist the load entirely out of the loop or something, so it will actually be a load that's just a the mercy of cache coherence delays, and that slightly stale read is good enough for the sort of "first pass" decisions it is used for, with critical decisions then falling back to a locked read like the one at the bottom of findRunnable if the unlocked reads didn't find anything. Yielding seems like just such a case, where it is more important we be cheap than perfect, as we'll just yield a few checks later once our cache updates.
84fef0d
to
ec86954
Compare
src/runtime/runtime2.go
Outdated
runqsize int32 | ||
// Global background-yield queue: goroutines that voluntarily yielded | ||
// while the scheduler was busy. Does NOT contribute to runqsize. | ||
bgq gQueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just clarifying) So there is no per-P run queue for these (like p.runq
) because:
- We expect the number of such background goroutines to be few?
- Even if they are not few, they are only run when the foreground goroutines are not runnable, which should be rarer (when P utilization is high, and if it is low, this goroutine wouldn't have had to yield), so grabbing them from the global queue is ok wrt concurrent performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was my thinking: we are fine taking the hit of going to a global queue to park the important thing is getting out of the way _under saturation and we are willing to take a small hit of going to/from a global queue to do so (since we also don't expect to remain in saturation much, particularly once the majority of some spike in runnable has been parked on bgq to be slowly retrieved as capacity allows).
That said this comment got me wondering if I could do better: I really don't want to make new global and new local shadow queues and duplicate everything, reimplement stealing, etc. That feels like an unpleasant patch to keep on our fork if upstream won't take this (not optimistic). But maybe we don't need to go park on bgq after all if our local runq has work: getting behind that work is just as good as going to the global bqq insofar as it gets out of the way of some work. I revised the patch to do just that: jump to the back of the local run queue (no bgq here) if we see there is work in the local runq and only go park on bgq is we see an overload signal and don't see local work to let by.
To avoid bg -> bg thrashing I conditioned this on running for 100us, otherwise I go to the bgq.
This (or the change in saturation signal) appears to have given us a bit of a perf boost; previously we were close to but slightly lower with end to end performance (throughput) of of CRDB master on concurrent, heavily CPU-contented IMPORT jobs, while getting p99 sched lat of <1ms vs 200ms+. With the changes I made here -- local yields when possible + dedicated saturation signal -- we're still doing <1ms sched lat p99 (~.3ms), but end to end throughput of the IMPORTs actually went up by 10%+, presumably because parking on bgq avoids the runtime's 10ms preemptions or something (though that seems like an outsized effect).
src/runtime/proc.go
Outdated
// Fast path: tiny, inlineable checks only. | ||
|
||
// Check if we need to yield at all and early exit fast if not. | ||
if sched.npidle.Load() > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we have global atomic of npidle
already, should we consider adding one for npWithNonEmptyRunQ
? That would eliminate the unfairness from my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are deciding to add this goroutine to bgq
when there is no idle P and:
- The global
runq
is non-empty, or - This P's
runq
is non-empty
Both the above could be false, but some other P could have a non-empty runq
(which, if this P became idle, it would steal from). This is not desirable. We can fix this by having a schedt.npWithNonEmptyRunQ
atomic: each P when it transitions from runq
empty to non-empty would increment this atomic, and decrement on the reverse transition. The second bullet above would change to npWithNonEmptyRunQ > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I figured if global runq is empty and this G had nothing in its local that’s cheap to check. I was reluctant initially, to add a new atomic or anything that needs to be maintained in any non-background paths in case this is a patch we have to carry on our fork, thus trying to stick to just what we already have: npidle, global queue, and local queue, but not other m’s queues.
This has me thinking: if we were willing to leave a little utilization on the table, we could just say npidle < 1 is our signal. Then we can just infer that all the runqs, global and local, are empty or could be if they wanted to be since we’re leaving a whole p idle, and that possibly has even better latency characteristics than waiting for runq to be no -empty to jump out of the way at the last minute. But again, at the cost of leaving a whole p on the table. But maybe that’s ok (still better utilization than today for maxprocs>=4).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess another option is to make a num runnable atomic (if we have to keep it on a fork it isn’t that hard to grep for the casStatus(runnable) calls) and stop looking at either npidle or runqsize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new ngqueued
atomic that is bumped when an attempt to schedule fails to find an idle P (e.g. in wakep) and is reset to 0 when findRunnable
doesn't find anything. Then I told myself a story, after I did that, about how actually maybe if we have no local runq or global, we shouldn't park in bgq anyway just to free up a P to steal, since we'd prefer the head of the busy P's like yield instead of having bg work hop between Ps , and then compromised with a heuristic based on time running.
The heuristic performs pretty well in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has me thinking: if we were willing to leave a little utilization on the table, we could just say npidle < 1 is our signal.
That seems ok to me.
I haven't read your revisions yet.
src/runtime/proc.go
Outdated
// Yielded goroutines were runnable but voluntarily deprioritized themselves | ||
// to waiting instead by calling BackgroundYield. If we have nothing runnable | ||
// we can bring a yielded goroutine back to runnable and run it. | ||
if sched.bgqsize != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the go compiler know something special about certain fields and makes their unsynchronized reads "less stale"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I can find. The runtime and in particular scheduler uses these a-bit-stale reads liberally (though in important things like findRunnable there are locked reads backstopping them after fast paths are exhausted). I was, uh, curious about this too. As far as I can tell -- and what chatgpt tells me as well -- is they're just good old unlocked, un-syned reads. They're uint32 so no worry about tearing a write (on 32b), so staleness is the only concern. They aren't in loops out of which the load might be lifted by the compiler, so it'd be an actual load, and just up to MESI or whatever I guess. Though I did my initial testing with a cruder version of this patch, before I cleaned it up for a PR, and one of my cleanups was to split to up to ensure the cheap checks could be inlined; I wonder if that is a mistake, since if it is inlined into a for loop -- the place we expect this to be called -- maybe the load does get lifted and then never sees new values? I guess I should re-test with the optimized version.
src/runtime/backgroundyield_test.go
Outdated
defer runtime.GOMAXPROCS(orig) | ||
|
||
runtime.GOMAXPROCS(target) | ||
runtime.Gosched() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for?
src/runtime/export_test.go
Outdated
} | ||
} | ||
|
||
// RunBackgroundYieldQueueCheck exercises the background queue enqueue/dequeue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this method. What does it exercise? What does success mean? Why does it return skipped==true
if there's something in the bgq? Why is this only used in a test that doesn't seem to do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can’t use testing.T
or testing
at all in the runtime
package, only in runtime_test
external test package. So the common pattern, if you want to test anything non-exported, seems to be a RunX
func (in this _test.go file that is not _test package) that exercises the internal code / returns an exported version, then a usually very thin TestX
in the external runtime_test package that calls it.
That’s the reason for the overall setup, but I’ll go back and review the actual logic herein in this one: I added these last couple these tests while just chasing coverage % on the train so might be some room for cleanup/commenting here.
610fd17
to
e673937
Compare
// yield_put is the gopark unlock function for Yield. It enqueues the goroutine | ||
// onto the global yield queue. Returning true keeps the G parked until another | ||
// part of the scheduler makes it runnable again. The G remains in _Gwaiting | ||
// after this returns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this. Is this saying that G's on the yieldq appear as _Gwaiting and not _Grunnable? If yes, why do we need to do this?
// Yield is intended to be very low overhead, particularly in the no-op case | ||
// where the scheduler is not at capacity, to ensure it can be called often | ||
// enough in tasks wishing to yield promptly to waiting work when needed. | ||
// needed. When the scheduler is busy, the yielded goroutine can be parked in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: superfluous "needed".
// If there is work on the local runq, the cheapest option is to just hop behind | ||
// it in the local runq to let it run and then pick back up. However this will | ||
// end up thrashing if the work we yield to also then yields right back. We | ||
// don't mark goroutines in any way when they yield so we cannot directly detech |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: detect
// | ||
// At the same time, we *do* want to yield -- that's why we are here -- if there | ||
// is work waiting for a chance to run. We can balance our preference to give | ||
// the other P a chance to just run it vs not making it wait too longwith a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: long with
unlock(&sched.lock) | ||
if bg != nil { | ||
// Transition from _Gwaiting (yield) to _Grunnable. | ||
trace := traceAcquire() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is making it look like _Gwaiting to avoid polluting our runnable stats?
return gp, false, false | ||
} | ||
if sched.yieldqsize != 0 { | ||
unlock(&sched.lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this because we did an unsynchronized read earlier? A code comment would be helpful.
Given that we are getting closer to merging this, I think we need a plan for how we will maintain this. Specifically the couple of lines of code scattered over various parts in proc.go e.g. we need a list of what cases we need to integrate with the changes, and how to go about finding those cases in the scheduler code. |
No description provided.