Optimize toByteString and toASCIIBytes#80
Conversation
|
@phadej This PR is waiting for a review. |
phadej
left a comment
There was a problem hiding this comment.
I don't believe that anything can be faster than toASCIIBytes uuid = BI.unsafeCreate 36 (pokeASCII uuid). It's literally just allocating exactly 36 bytes and poking stuff at the right place. Maybe there's something GHC doesn't see (missing bang somewhere)? But the current code is literally as little work as possible already.
I don't see a point in complicating it.
|
It is hard to tell what makes the high-level implementation faster. The Yet, the benchmarks show that a high-level implementation is faster. It might be that it unlocks some GHC optimizations or that bytestring has some other non-trivial optimizations. If we compare only the functions that do not have any extra overhead ( If There is also more room for speedup: |
|
If I had to guess, the reason the |
|
@clyring clarified that the module |
|
With the recent update this branch has @phadej Would you give this another look? This is a significant improvement and it reduces the amount of code to maintain in |
@phadej I appreciate that you'd like to keep the code uncomplicated, but benchmarking seems to indicate a solid improvement, so I don't understand the resistance to the improvement. |
|
@iand675 the original patch was more complicated. And my review comment made lykahb improve on it. Your comment is not fair. |
|
It's true that @lykahb made a further improvement, but I don't think it's unfair to state that he provided benchmarks prior to you saying that you didn't want to "complicate the code" that were a solid performance increase. |
| where | ||
| (w0, w1, w2, w3) = toWords uuid | ||
| wordFixedPrim :: BBP.FixedPrim (Word32, (Word16, (Word16, (Word16, (Word16, Word32))))) | ||
| wordFixedPrim = BBP.word32HexFixed BBP.>*< |
There was a problem hiding this comment.
Note to self, the word<N>HexFixed calls into C function in bytestring:
char* _hs_bytestring_uint_hex (unsigned int x, char* buf) {
// write hex representation in reverse order
char c, *ptr = buf, *next_free;
do {
*ptr++ = digits[x & 0xf];
x >>= 4;
} while ( x );
// invert written digits
next_free = ptr--;
while(buf < ptr) {
c = *ptr;
*ptr-- = *buf;
*buf++ = c;
}
return next_free;
};Fascinating that loop is faster (?) than unrolled version. Maybe GCC does magic. For "random" UUIDs, ther e shouldn't be win in short circuiting the loop.
There was a problem hiding this comment.
No, that's wrong, the Fixed variants just split the number into halves:
-- | Encode a 'Word8' using 2 nibbles (hexadecimal digits).
{-# INLINE word8HexFixed #-}
word8HexFixed :: FixedPrim Word8
word8HexFixed = fixedPrim 2 $ \x op -> do
enc <- encode8_as_16h lowerTable x
unalignedWriteU16 enc op
-- | Encode a 'Word16' using 4 nibbles.
{-# INLINE word16HexFixed #-}
word16HexFixed :: FixedPrim Word16
word16HexFixed =
(\x -> (fromIntegral $ x `shiftR` 8, fromIntegral x))
>$< pairF word8HexFixed word8HexFixed
-- | Encode a 'Word32' using 8 nibbles.
{-# INLINE word32HexFixed #-}
word32HexFixed :: FixedPrim Word32
word32HexFixed =
(\x -> (fromIntegral $ x `shiftR` 16, fromIntegral x))
>$< pairF word16HexFixed word16HexFixed
-- | Encode a 'Word64' using 16 nibbles.
{-# INLINE word64HexFixed #-}
word64HexFixed :: FixedPrim Word64
word64HexFixed =
(\x -> (fromIntegral $ x `shiftR` 32, fromIntegral x))
>$< pairF word32HexFixed word32HexFixed
If we don't understand why code works as it does, does it work? In my opinion it doesn't in the long term. I'm the one who maintaining this, so I'm making the judgment calls. This PR has a lot of good in it, but it's not perfect, I'll take if from here. Note to self: benchmark on x86_64 |
|
For the sake of clarity, is your objection about the use |
Is there a trick to reducing these variance values? On an i9-13900K: |
This PR leverages the bytestring fixed-length builders to simplify and speed up the conversions. The re-implementation of
toASCIIBytesis now more high-level and safe.I bumped the bytestring dependency lower bound to the version that introduces
Data.ByteString.Builder.Prim. It was released in 2012, so that is plenty of backwards compatibility.Here are benchmarks on MacBook M1 Max:
Before
After