InternPool: Rework the way strings are represented #25384

gabeuehlein · 2025-09-28T05:16:07Z

Marked as a draft due to a shower thought I had about a potential way to reduce the performance impact of these changes. I still need to look into that idea, but I'm opening this PR regardless so people know that I'm working on it. I'll write a more in-depth description once this PR is ready for review.

Anyway, the goal of this PR is to make some changes to the internal representation of InternPool.String to make it harder crash the compiler when dealing with large amounts of string data, particularly when building with a large number of parallel jobs.

Current perf (stage3 is master, stage4 is this branch):

gabeu@gu /t/state> sudo nice -n-20 sudo -u gabeu -- poop './stage3 build-exe hello-world.zig -fno-emit-bin --zig-lib-dir /home/gabeu/devel/zig/lib' './stage4 build-exe hello-world.zig -fno-emit-bin --zig-lib-dir /home/gabeu/devel/zig/lib' -d 30000
Benchmark 1 (70 runs): ./stage3 build-exe hello-world.zig -fno-emit-bin --zig-lib-dir /home/gabeu/devel/zig/lib
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           429ms ± 4.40ms     423ms …  449ms          4 ( 6%)        0%
  peak_rss            112MB ±  163KB     112MB …  113MB          0 ( 0%)        0%
  cpu_cycles          655M  ± 4.85M      649M  …  673M          11 (16%)        0%
  instructions        934M  ± 57.8K      934M  …  934M           0 ( 0%)        0%
  cache_references   93.0M  ±  282K     92.4M  … 94.0M           1 ( 1%)        0%
  cache_misses       13.7M  ±  178K     13.4M  … 14.4M           3 ( 4%)        0%
  branch_misses      3.03M  ± 11.1K     3.01M  … 3.07M           4 ( 6%)        0%
Benchmark 2 (70 runs): ./stage4 build-exe hello-world.zig -fno-emit-bin --zig-lib-dir /home/gabeu/devel/zig/lib
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           433ms ± 4.94ms     424ms …  445ms          0 ( 0%)          +  1.0% ±  0.4%
  peak_rss            114MB ±  173KB     113MB …  114MB          0 ( 0%)        💩+  1.1% ±  0.0%
  cpu_cycles          662M  ± 7.23M      653M  …  679M           0 ( 0%)          +  1.0% ±  0.3%
  instructions        934M  ± 59.4K      934M  …  934M           0 ( 0%)          -  0.0% ±  0.0%
  cache_references   91.5M  ±  382K     90.8M  … 92.8M           3 ( 4%)        ⚡-  1.6% ±  0.1%
  cache_misses       13.9M  ±  254K     13.5M  … 14.7M           1 ( 1%)        💩+  1.8% ±  0.5%
  branch_misses      3.04M  ± 15.8K     3.02M  … 3.08M           2 ( 3%)          +  0.5% ±  0.1%

Regardless of what ends up going forward...

Fixes #22867
Fixes #25297
Fixes #25339

an index into this array Should address the issues caused by having too much data stored as strings when using a high thread count. Most importantly, InternPool.String now increments linearly, allowing InternPool to always be able to store at least 4GiB of strings (including all NUL terminators), which is always greater than or equal to the previous maximum. Unfortunately, performance does take roughly a 2% hit when building small executables (I tested an empty main() and hello world).

instead of indexOfSentinel when computing length of NullTerminatedString

…cs that I wrote

whoops

Should pretty much always result in better codegen to treat the bool vector as a bitmask. Significantly reduces size of these functions on x86-64 (~30 instructions -> ~3)

I'll write up a separate PR for changes to `std.simd`, as I've noticed a few potential optimizations scattered throughout

Update doc comment for getMutableLargeStrings Add explicit variant values to String.SizeClass Split 190-column line into a few smaller ones LLVM now inlines the smallLength call in NullTerminatedString.length on its own, so no need to force it to be inlined Fix an assert (not quite sure what I was thinking when I wrote it)

jacobly0

This is pretty much exactly the idea I was considering for solving this.

jacobly0 · 2025-10-03T15:41:05Z

src/InternPool.zig

        else => @compileError("unsupported host"),
    };
-    const Strings = List(struct { u8 });
+    const LargeStrings = List(struct { offset: u32, len: u32 });


You shouldn't need to store a length, it can be derived by storing an extra entry in an offset list and subtracting two adjacent offsets (The extra entry being not strictly necessary, but it saves an extra branch in the hot path).

Without this change this branch is 0.3-0.4% slower than master. With this change is 0.3-0.4% faster than master, where the difference between using small strings and only using large strings is <0.1%, therefore I have decided that the extra complexity is not worth such a small speedup and we will just use this large string strategy for all strings.

Did you get the RSS numbers? Just because the size of the InternPool will directly mirror the size of serialized compiler state in the future. (I'm not really worried though, highly doubt they changed much at all)

All versions are an extra 1-3MB over master, where master is already a +-2MB variance.

jacobly0 · 2025-10-03T15:43:47Z

src/InternPool.zig

+        small = 0,
+        large = 1,
+
+        pub fn detect(len: u32, tid: Zcu.PerThread.Id, ip: *InternPool) SizeClass {


This name seems weird, maybe something like fromLength would be better?

Even just classify or classifyLength would be better.

jacobly0 · 2025-10-03T15:47:44Z

src/InternPool.zig

+
+            const local = ip.getLocal(tid);
+
+            return @enumFromInt(@intFromBool(local.mutate.small_string_bytes.len >= ip.getIndexMask(u31)));


I might be missing something, but this doesn't seem to take into account the current string that we are attempting to add?

Hmm, I might get it, you are saying that it's ok for the length to exceed this bound as long as the start offset can be safety stored, and the previous check ensure that we aren't overflowing by an unreasonable amount anyway.

jacobly0 · 2025-10-03T15:54:06Z

src/InternPool.zig

    empty = 0,
    _,

+    pub const max_small_string_len = std.simd.suggestVectorLength(u8) orelse std.atomic.cache_line;


Can you also justify this logic, I would expect much larger strings than this to still be reasonable, and I'm not convinced it should vary by target.

Put another way, the distribution of string lengths found in source files is not dependent on on the target that the compiler was compiled for.

If I'm recalling correctly the original intent was to:

Prefer small strings that can be stored in a vector register if possible such that one vector compare can find the index of the null terminator. I originally wrote some code that guaranteed that exactly one vector compare was done in smallLength, but LLVM was reluctant to inline it in places where it mattered, so I reverted it.

Fallback to a length in the ballpark around which access of nearby small Strings won't cause excessive cache thrashing for certain values of "nearby." Deciding to use cache_line for this was admittedly pretty arbitrary.

I did look at the distribution of string lengths, including when (but not where) .toSlice() and .length() were called. The vast majority of strings that were accessed had quite small lengths (<= 16 bytes if I' remembering correctly), hence the rather small length limit in all cases. I'll try to find those results later, analyze them a bit, and do some guesswork to see what works best on my machine (a slightly above-average Zen 3 laptop). I'm 90% sure that tuning this on a per-target basis will give non-negligible performance improvements due to pushing large, infrequently accessed strings out of cache, although I didn't do too much testing with this because compiling a new ReleaseFast compiler takes about 20 minutes on this laptop.

although I didn't do too much testing with this because compiling a new ReleaseFast compiler takes about 20 minutes on this laptop.

In that case, I will take over this PR so that I take some benchmarks.

For reference the optimal length ended up being either 16 or 32, definitely not the 64 or 128 produced by this expression on my computer.

jacobly0 · 2025-10-03T16:25:49Z

Marked as a draft due to a shower thought I had about a potential way to reduce the performance impact of these changes.

Also note that this can always be a followup enhancement, the extra functionality is worth the minor slowdown regardless.

gabeuehlein · 2025-10-04T16:25:15Z

Closing in favor of #25464.

gabeuehlein added 4 commits September 27, 2025 20:14

InternPool: store string length alongside offsets; use this length

d9fee8e

instead of indexOfSentinel when computing length of NullTerminatedString

add length of strings in Compilation.saveState + actually read the do…

0ed4a05

…cs that I wrote

zig fmt

4acfb98

whoops

alexrp requested a review from mlugg September 29, 2025 09:38

gabeuehlein added 5 commits September 29, 2025 06:48

std.simd: optimize (first|last)True

31ce4ef

Should pretty much always result in better codegen to treat the bool vector as a bitmask. Significantly reduces size of these functions on x86-64 (~30 instructions -> ~3)

InternPool: split large and small strings

7c634a8

Revert "std.simd: optimize (first|last)True"

203d13a

I'll write up a separate PR for changes to `std.simd`, as I've noticed a few potential optimizations scattered throughout

doc fix

5489998

gabeuehlein force-pushed the moar-strings branch from 02f1cd6 to 138c479 Compare October 1, 2025 15:06

Merge branch 'master' into moar-strings

78c60fd

jacobly0 requested changes Oct 3, 2025

View reviewed changes

jacobly0 reviewed Oct 3, 2025

View reviewed changes

jacobly0 mentioned this pull request Oct 4, 2025

InternPool: use sequential string indices instead of byte offsets #25464

Merged

gabeuehlein closed this Oct 4, 2025


		const local = ip.getLocal(tid);

		return @enumFromInt(@intFromBool(local.mutate.small_string_bytes.len >= ip.getIndexMask(u31)));

Uh oh!

InternPool: Rework the way strings are represented #25384

InternPool: Rework the way strings are represented #25384

Uh oh!

Conversation

gabeuehlein commented Sep 28, 2025

Uh oh!

jacobly0 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobly0 Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobly0 Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobly0 Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobly0 Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobly0 commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabeuehlein commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacobly0 left a comment •

edited

Loading

jacobly0 Oct 3, 2025 •

edited

Loading

jacobly0 Oct 4, 2025 •

edited

Loading

jacobly0 Oct 3, 2025 •

edited

Loading

jacobly0 Oct 4, 2025 •

edited

Loading

jacobly0 commented Oct 3, 2025 •

edited

Loading