Optimize serialization format for 2 bytes ints #20120

ilevkivskyi · 2025-10-26T00:46:24Z

This is important for serialized ASTs (think line numbers for every node). Also in this PR:

Re-use same integer logic for str/bytes length (it is slightly less optimal, but code re-use is good).
Remove unused field from Buffer type.
Make format the same on 32-bit and 64-bit platforms (we still assume little-endian platform).

JukkaL

Looks good overall, just some minor comments. This will help AST serialization a lot, in particular.

JukkaL · 2025-10-30T15:42:45Z

mypyc/lib-rt/librt_internal.c

+    if (likely(first != LONG_INT_TRAILER)) {
+        return _read_short_int(data, first);
    }
    // People who have literal ints not fitting in size_t should be punished :-)


This comment is out of date, since now anything not fitting in 29 bits triggers this path.

JukkaL · 2025-10-30T15:44:40Z

mypyc/lib-rt/librt_internal.c

+    one byte: last bit 0, 7 bits used
+    two bytes: last two bits 01, 14 bits used
+    four bytes: last three bits 011, 29 bits used
+    everything else: 00000111 followed by serialized string representation


What about using ...01111 (four 1 bits) for the final option, as this would leave ...0111 available for a possible 60-bit format in the future. (I don't have a strong opinion.)

JukkaL · 2025-10-30T15:47:08Z

mypyc/lib-rt/librt_internal.c

    return read_str_internal(data);
 }

+static inline char


Add comment explaining that this assumes that real_value is within allowed range (29 bits).

JukkaL · 2025-10-30T15:49:46Z

mypyc/lib-rt/librt_internal.c

+    _CHECK_READ(data, 3, CPY_INT_TAG)
+    // TODO: check if compilers emit optimal code for these two reads, and tweak if needed.
+    second = _READ(data, uint8_t)
+    two_more = _READ(data, uint16_t)


Since this code path will be quite rare, we could also read one byte at a time without any real performance impact. This would make this work the same on little and big endian systems.

JukkaL · 2025-10-30T15:52:22Z

mypyc/lib-rt/librt_internal.c

+#define MAX_ONE_BYTE_INT 117  // 2 ** 7 - 1 - 10
+#define MIN_TWO_BYTES_INT -100
+#define MAX_TWO_BYTES_INT 16283  // 2 ** (8 + 6) - 1 - 100
+#define MIN_FOUR_BYTES_INT -10000


I think we could extend the negative range further here (e.g. make it symmetric with the positive range), since whether the upper bound is ~512M or ~256M probably won't make much difference in practice.

JukkaL · 2025-10-30T15:54:34Z

mypyc/test-data/run-classes.test

        b = Buffer(b.getvalue())
        assert read_int(b) == i
-    for i in (-12345, -12344, -11, 118, 12344, 12345):
+    for i in (-100, -11, 118, 12344, 16283):


Test also each power of two up to, say, 200, just in case (positive and negative)? Maybe test both n**2 (lower bits all zero) and n**2 - 1 (lower bits all ones).

ilevkivskyi added 2 commits October 26, 2025 01:35

Optimize serialization format for 2 bytes ints

e0e945c

Remove extra whitespace in tests

d0d5291

ilevkivskyi requested a review from JukkaL October 26, 2025 00:46

ilevkivskyi mentioned this pull request Oct 30, 2025

Planned work related to fixed-format serialization #20072

Open

5 tasks

JukkaL reviewed Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize serialization format for 2 bytes ints #20120

Optimize serialization format for 2 bytes ints #20120

ilevkivskyi commented Oct 26, 2025

Uh oh!

JukkaL left a comment

Uh oh!

JukkaL Oct 30, 2025

Uh oh!

JukkaL Oct 30, 2025

Uh oh!

JukkaL Oct 30, 2025

Uh oh!

JukkaL Oct 30, 2025

Uh oh!

JukkaL Oct 30, 2025

Uh oh!

JukkaL Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Optimize serialization format for 2 bytes ints #20120

Are you sure you want to change the base?

Optimize serialization format for 2 bytes ints #20120

Conversation

ilevkivskyi commented Oct 26, 2025

Uh oh!

JukkaL left a comment

Choose a reason for hiding this comment

Uh oh!

JukkaL Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

JukkaL Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

JukkaL Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

JukkaL Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

JukkaL Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

JukkaL Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants