Skip to content

Conversation

@ilevkivskyi
Copy link
Member

This is important for serialized ASTs (think line numbers for every node). Also in this PR:

  • Re-use same integer logic for str/bytes length (it is slightly less optimal, but code re-use is good).
  • Remove unused field from Buffer type.
  • Make format the same on 32-bit and 64-bit platforms (we still assume little-endian platform).

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just some minor comments. This will help AST serialization a lot, in particular.

if (likely(first != LONG_INT_TRAILER)) {
return _read_short_int(data, first);
}
// People who have literal ints not fitting in size_t should be punished :-)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is out of date, since now anything not fitting in 29 bits triggers this path.

one byte: last bit 0, 7 bits used
two bytes: last two bits 01, 14 bits used
four bytes: last three bits 011, 29 bits used
everything else: 00000111 followed by serialized string representation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using ...01111 (four 1 bits) for the final option, as this would leave ...0111 available for a possible 60-bit format in the future. (I don't have a strong opinion.)

return read_str_internal(data);
}

static inline char
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment explaining that this assumes that real_value is within allowed range (29 bits).

_CHECK_READ(data, 3, CPY_INT_TAG)
// TODO: check if compilers emit optimal code for these two reads, and tweak if needed.
second = _READ(data, uint8_t)
two_more = _READ(data, uint16_t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this code path will be quite rare, we could also read one byte at a time without any real performance impact. This would make this work the same on little and big endian systems.

#define MAX_ONE_BYTE_INT 117 // 2 ** 7 - 1 - 10
#define MIN_TWO_BYTES_INT -100
#define MAX_TWO_BYTES_INT 16283 // 2 ** (8 + 6) - 1 - 100
#define MIN_FOUR_BYTES_INT -10000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could extend the negative range further here (e.g. make it symmetric with the positive range), since whether the upper bound is ~512M or ~256M probably won't make much difference in practice.

b = Buffer(b.getvalue())
assert read_int(b) == i
for i in (-12345, -12344, -11, 118, 12344, 12345):
for i in (-100, -11, 118, 12344, 16283):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test also each power of two up to, say, 200, just in case (positive and negative)? Maybe test both n**2 (lower bits all zero) and n**2 - 1 (lower bits all ones).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants