-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
InternPool: Rework the way strings are represented #25384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
an index into this array Should address the issues caused by having too much data stored as strings when using a high thread count. Most importantly, InternPool.String now increments linearly, allowing InternPool to always be able to store at least 4GiB of strings (including all NUL terminators), which is always greater than or equal to the previous maximum. Unfortunately, performance does take roughly a 2% hit when building small executables (I tested an empty main() and hello world).
instead of indexOfSentinel when computing length of NullTerminatedString
Should pretty much always result in better codegen to treat the bool vector as a bitmask. Significantly reduces size of these functions on x86-64 (~30 instructions -> ~3)
I'll write up a separate PR for changes to `std.simd`, as I've noticed a few potential optimizations scattered throughout
Update doc comment for getMutableLargeStrings Add explicit variant values to String.SizeClass Split 190-column line into a few smaller ones LLVM now inlines the smallLength call in NullTerminatedString.length on its own, so no need to force it to be inlined Fix an assert (not quite sure what I was thinking when I wrote it)
02f1cd6 to
138c479
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty much exactly the idea I was considering for solving this.
| else => @compileError("unsupported host"), | ||
| }; | ||
| const Strings = List(struct { u8 }); | ||
| const LargeStrings = List(struct { offset: u32, len: u32 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need to store a length, it can be derived by storing an extra entry in an offset list and subtracting two adjacent offsets (The extra entry being not strictly necessary, but it saves an extra branch in the hot path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change this branch is 0.3-0.4% slower than master. With this change is 0.3-0.4% faster than master, where the difference between using small strings and only using large strings is <0.1%, therefore I have decided that the extra complexity is not worth such a small speedup and we will just use this large string strategy for all strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you get the RSS numbers? Just because the size of the InternPool will directly mirror the size of serialized compiler state in the future. (I'm not really worried though, highly doubt they changed much at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All versions are an extra 1-3MB over master, where master is already a +-2MB variance.
| small = 0, | ||
| large = 1, | ||
|
|
||
| pub fn detect(len: u32, tid: Zcu.PerThread.Id, ip: *InternPool) SizeClass { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name seems weird, maybe something like fromLength would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even just classify or classifyLength would be better.
|
|
||
| const local = ip.getLocal(tid); | ||
|
|
||
| return @enumFromInt(@intFromBool(local.mutate.small_string_bytes.len >= ip.getIndexMask(u31))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but this doesn't seem to take into account the current string that we are attempting to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I might get it, you are saying that it's ok for the length to exceed this bound as long as the start offset can be safety stored, and the previous check ensure that we aren't overflowing by an unreasonable amount anyway.
| empty = 0, | ||
| _, | ||
|
|
||
| pub const max_small_string_len = std.simd.suggestVectorLength(u8) orelse std.atomic.cache_line; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also justify this logic, I would expect much larger strings than this to still be reasonable, and I'm not convinced it should vary by target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way, the distribution of string lengths found in source files is not dependent on on the target that the compiler was compiled for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm recalling correctly the original intent was to:
- Prefer small strings that can be stored in a vector register if possible such that one vector compare can find the index of the null terminator. I originally wrote some code that guaranteed that exactly one vector compare was done in
smallLength, but LLVM was reluctant to inline it in places where it mattered, so I reverted it. - Fallback to a length in the ballpark around which access of nearby small
Strings won't cause excessive cache thrashing for certain values of "nearby." Deciding to usecache_linefor this was admittedly pretty arbitrary.
I did look at the distribution of string lengths, including when (but not where) .toSlice() and .length() were called. The vast majority of strings that were accessed had quite small lengths (<= 16 bytes if I' remembering correctly), hence the rather small length limit in all cases. I'll try to find those results later, analyze them a bit, and do some guesswork to see what works best on my machine (a slightly above-average Zen 3 laptop). I'm 90% sure that tuning this on a per-target basis will give non-negligible performance improvements due to pushing large, infrequently accessed strings out of cache, although I didn't do too much testing with this because compiling a new ReleaseFast compiler takes about 20 minutes on this laptop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although I didn't do too much testing with this because compiling a new ReleaseFast compiler takes about 20 minutes on this laptop.
In that case, I will take over this PR so that I take some benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference the optimal length ended up being either 16 or 32, definitely not the 64 or 128 produced by this expression on my computer.
Also note that this can always be a followup enhancement, the extra functionality is worth the minor slowdown regardless. |
|
Closing in favor of #25464. |
Marked as a draft due to a shower thought I had about a potential way to reduce the performance impact of these changes. I still need to look into that idea, but I'm opening this PR regardless so people know that I'm working on it. I'll write a more in-depth description once this PR is ready for review.
Anyway, the goal of this PR is to make some changes to the internal representation of
InternPool.Stringto make it harder crash the compiler when dealing with large amounts of string data, particularly when building with a large number of parallel jobs.Current perf (stage3 is master, stage4 is this branch):
Regardless of what ends up going forward...
Fixes #22867
Fixes #25297
Fixes #25339