Skip to content

Conversation

@gabeuehlein
Copy link
Contributor

Marked as a draft due to a shower thought I had about a potential way to reduce the performance impact of these changes. I still need to look into that idea, but I'm opening this PR regardless so people know that I'm working on it. I'll write a more in-depth description once this PR is ready for review.

Anyway, the goal of this PR is to make some changes to the internal representation of InternPool.String to make it harder crash the compiler when dealing with large amounts of string data, particularly when building with a large number of parallel jobs.

Current perf (stage3 is master, stage4 is this branch):

gabeu@gu /t/state> sudo nice -n-20 sudo -u gabeu -- poop './stage3 build-exe hello-world.zig -fno-emit-bin --zig-lib-dir /home/gabeu/devel/zig/lib' './stage4 build-exe hello-world.zig -fno-emit-bin --zig-lib-dir /home/gabeu/devel/zig/lib' -d 30000
Benchmark 1 (70 runs): ./stage3 build-exe hello-world.zig -fno-emit-bin --zig-lib-dir /home/gabeu/devel/zig/lib
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           429ms ± 4.40ms     423ms …  449ms          4 ( 6%)        0%
  peak_rss            112MB ±  163KB     112MB …  113MB          0 ( 0%)        0%
  cpu_cycles          655M  ± 4.85M      649M  …  673M          11 (16%)        0%
  instructions        934M  ± 57.8K      934M  …  934M           0 ( 0%)        0%
  cache_references   93.0M  ±  282K     92.4M  … 94.0M           1 ( 1%)        0%
  cache_misses       13.7M  ±  178K     13.4M  … 14.4M           3 ( 4%)        0%
  branch_misses      3.03M  ± 11.1K     3.01M  … 3.07M           4 ( 6%)        0%
Benchmark 2 (70 runs): ./stage4 build-exe hello-world.zig -fno-emit-bin --zig-lib-dir /home/gabeu/devel/zig/lib
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           433ms ± 4.94ms     424ms …  445ms          0 ( 0%)          +  1.0% ±  0.4%
  peak_rss            114MB ±  173KB     113MB …  114MB          0 ( 0%)        💩+  1.1% ±  0.0%
  cpu_cycles          662M  ± 7.23M      653M  …  679M           0 ( 0%)          +  1.0% ±  0.3%
  instructions        934M  ± 59.4K      934M  …  934M           0 ( 0%)          -  0.0% ±  0.0%
  cache_references   91.5M  ±  382K     90.8M  … 92.8M           3 ( 4%)        ⚡-  1.6% ±  0.1%
  cache_misses       13.9M  ±  254K     13.5M  … 14.7M           1 ( 1%)        💩+  1.8% ±  0.5%
  branch_misses      3.04M  ± 15.8K     3.02M  … 3.08M           2 ( 3%)          +  0.5% ±  0.1%

Regardless of what ends up going forward...

Fixes #22867
Fixes #25297
Fixes #25339

an index into this array

Should address the issues caused by having too much data stored as
strings when using a high thread count. Most importantly, InternPool.String
now increments linearly, allowing InternPool to always be able to store at least 4GiB of strings
(including all NUL terminators), which is always greater than or equal to the previous maximum.
Unfortunately, performance does take roughly a 2% hit when building small executables (I tested
an empty main() and hello world).
instead of indexOfSentinel when computing length of NullTerminatedString
whoops
@alexrp alexrp requested a review from mlugg September 29, 2025 09:38
Should pretty much always result in better codegen to treat the bool vector as a bitmask. Significantly reduces size of these functions on x86-64 (~30 instructions -> ~3)
I'll write up a separate PR for changes to `std.simd`, as I've noticed a few potential optimizations scattered throughout
Update doc comment for getMutableLargeStrings
Add explicit variant values to String.SizeClass
Split 190-column line into a few smaller ones
LLVM now inlines the smallLength call in NullTerminatedString.length on its own, so no need to force it to be inlined
Fix an assert (not quite sure what I was thinking when I wrote it)
Copy link
Member

@jacobly0 jacobly0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much exactly the idea I was considering for solving this.

else => @compileError("unsupported host"),
};
const Strings = List(struct { u8 });
const LargeStrings = List(struct { offset: u32, len: u32 });
Copy link
Member

@jacobly0 jacobly0 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to store a length, it can be derived by storing an extra entry in an offset list and subtracting two adjacent offsets (The extra entry being not strictly necessary, but it saves an extra branch in the hot path).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change this branch is 0.3-0.4% slower than master. With this change is 0.3-0.4% faster than master, where the difference between using small strings and only using large strings is <0.1%, therefore I have decided that the extra complexity is not worth such a small speedup and we will just use this large string strategy for all strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you get the RSS numbers? Just because the size of the InternPool will directly mirror the size of serialized compiler state in the future. (I'm not really worried though, highly doubt they changed much at all)

Copy link
Member

@jacobly0 jacobly0 Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All versions are an extra 1-3MB over master, where master is already a +-2MB variance.

small = 0,
large = 1,

pub fn detect(len: u32, tid: Zcu.PerThread.Id, ip: *InternPool) SizeClass {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name seems weird, maybe something like fromLength would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even just classify or classifyLength would be better.


const local = ip.getLocal(tid);

return @enumFromInt(@intFromBool(local.mutate.small_string_bytes.len >= ip.getIndexMask(u31)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but this doesn't seem to take into account the current string that we are attempting to add?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I might get it, you are saying that it's ok for the length to exceed this bound as long as the start offset can be safety stored, and the previous check ensure that we aren't overflowing by an unreasonable amount anyway.

empty = 0,
_,

pub const max_small_string_len = std.simd.suggestVectorLength(u8) orelse std.atomic.cache_line;
Copy link
Member

@jacobly0 jacobly0 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also justify this logic, I would expect much larger strings than this to still be reasonable, and I'm not convinced it should vary by target.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, the distribution of string lengths found in source files is not dependent on on the target that the compiler was compiled for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm recalling correctly the original intent was to:

  1. Prefer small strings that can be stored in a vector register if possible such that one vector compare can find the index of the null terminator. I originally wrote some code that guaranteed that exactly one vector compare was done in smallLength, but LLVM was reluctant to inline it in places where it mattered, so I reverted it.
  2. Fallback to a length in the ballpark around which access of nearby small Strings won't cause excessive cache thrashing for certain values of "nearby." Deciding to use cache_line for this was admittedly pretty arbitrary.

I did look at the distribution of string lengths, including when (but not where) .toSlice() and .length() were called. The vast majority of strings that were accessed had quite small lengths (<= 16 bytes if I' remembering correctly), hence the rather small length limit in all cases. I'll try to find those results later, analyze them a bit, and do some guesswork to see what works best on my machine (a slightly above-average Zen 3 laptop). I'm 90% sure that tuning this on a per-target basis will give non-negligible performance improvements due to pushing large, infrequently accessed strings out of cache, although I didn't do too much testing with this because compiling a new ReleaseFast compiler takes about 20 minutes on this laptop.

Copy link
Member

@jacobly0 jacobly0 Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although I didn't do too much testing with this because compiling a new ReleaseFast compiler takes about 20 minutes on this laptop.

In that case, I will take over this PR so that I take some benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference the optimal length ended up being either 16 or 32, definitely not the 64 or 128 produced by this expression on my computer.

@jacobly0
Copy link
Member

jacobly0 commented Oct 3, 2025

Marked as a draft due to a shower thought I had about a potential way to reduce the performance impact of these changes.

Also note that this can always be a followup enhancement, the extra functionality is worth the minor slowdown regardless.

@gabeuehlein
Copy link
Contributor Author

Closing in favor of #25464.

@gabeuehlein gabeuehlein closed this Oct 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible memory corruption in the compiler zig segfaults when building ghostty on 32 core threadripper @embedFile corrupts compiler memory

3 participants