-
Notifications
You must be signed in to change notification settings - Fork 602
Description
Description
In a nutshell, pp_reverse reverses strings using a loop that in each iteration swaps a byte from the front of the char buffer with a byte from the back.
while (down > up) {
const char tmp = *up;
*up++ = *down;
*down-- = tmp;
}
That seems like a common and portable string reversal technique. However, Fast array reversal with SIMD! is a lovely little explainer that shows how string reversal can be done much more quickly by reversing and swapping whole registers full of bytes at a time. Maximum speedup (see the SSSE3, AVX, AVX2 sections) might involve using intrinsics for which CPU support is far from guaranteed. (I imagine that the implementation complexity and maintenance overhead would therefore be too great at this time.)
However, it looks like it should be possible to use a generic and portable bit-shifting implementation (see the bswap section), which should be faster than pp_reverse's current method AND decent compilers can recognize this implementation and replace it with faster CPU instructions (bswap, rev, movbe) where possible.
The C++ code from that post:
inline std::uint64_t Swap64(std::uint64_t x)
{
return (
((x & 0x00000000000000FF) << 56) |
((x & 0x000000000000FF00) << 40) |
((x & 0x0000000000FF0000) << 24) |
((x & 0x00000000FF000000) << 8) |
((x & 0x000000FF00000000) >> 8) |
((x & 0x0000FF0000000000) >> 24) |
((x & 0x00FF000000000000) >> 40) |
((x & 0xFF00000000000000) >> 56)
);
}
inline std::uint32_t Swap32(std::uint32_t x)
{
return(
((x & 0x000000FF) << 24) |
((x & 0x0000FF00) << 8) |
((x & 0x00FF0000) >> 8) |
((x & 0xFF000000) >> 24)
);
}
inline std::uint16_t Swap16(std::uint16_t x)
{
// This tends to emit a 16-bit `rol` or `ror` instruction
return (
((x & 0x00FF) << 8) |
((x & 0xFF00) >> 8)
);
}
Note: I've had this on my ideas queue for ages and am just not likely to get around to it any time soon. So it's totally waiting for someone enthusiastic to take a look.
Perl configuration
pp_reverse's 2-bytes-at-a-time implementation goes way back.