Update 2024/12/26: After posting on reddit, I got a lot of good feedback. Someone pointed out the byte order fallacy, an article by Rob Pike on how the endianness of the system does not matter, but only the endianness of the underlying file/stream does, and endianness should be taken care off only when crossing the boundary between system and file. There are other good blogs on endianness, and in the light of these opinions the implementation below is less than useful. However I’ll leave it here in case someone picks up a trick or two from it.
Someone had also pointed out how they despised the font sizing in my source code. This was a chroma bug, and it turned up only on a mobile. Trying to reproduce on chrome dev tools with a smaller screen size didn’t work. I ended up connecting my iPhone to my computer and debugging it via Safari dev tools remotely, but still couldn’t find out why this occurs. Some others have had similar issues (1, 2), and after finding no chroma fix, I’ve switched to highlight.js
Motivation
Oasis’ virtio spec, Section 1.4 defines the endian-specific types le16
,
le32
, le64
to be little-endian unsigned integers. Likewise for be16
, be32
and be64
. I was perusing this because I wanted to make my own userspace
networking drivers which I could run on a virtual machine. Practically every
machine you could choose to run a VM on today is little endian, and you would have to
go out of your way to test that your code runs on big endian machines. Ergo,
it would have been good enough to do using le16 = uint16_t
and move on.
A challenge is a challenge though. And I was doing this for fun, so let’s have some fun.
Requirements
We want to create a templated type which would take endianness and size of the integer as it’s parameters. The main requirements of this type are:
- zero-overhead: the compiler should be capable of optimizing out the class and it’s methods as needed.
- equivalence: any C code interfacing with it should not be able to tell the difference between our class and an integer. This (among other things) implies they both take up 2 bytes
- transparency: I should be able to do arithmetic on the type as if it is an integer in C++, without worrying about casting and other issues. If it is operated on with native types, it’s value should be cast into the native system’s endianness transparently.
Implementation
Here’s the 16-bit implementation. The 32 and 64-bit implementations are very similar but have a longer byteswap, so I’ve omitted them for clarity.
#include <cstdint>
#include <type_traits>
#include <bit>
using namespace std;
template <endian E, typename T,
typename Enable = std::enable_if_t<std::is_integral_v<T>>> class endianint;
static_assert(endian::native == endian::little or endian::native == endian::big,
"Mixed endian machines are not supported");
template <endian E>
class endianint<E, uint16_t> {
static constexpr uint16_t to_E(uint16_t val) {
if constexpr ((E == endian::big and endian::native == endian::little) or
(E == endian::little and endian::native == endian::big)) {
return ((val & 0xFF00) >> 8u) |
((val & 0x00FF) << 8u);
}
return val;
}
static constexpr uint16_t from_E(uint16_t val) {
return to_E(val); // commutative op!
}
public:
uint16_t value;
endianint() : value(0) {}
endianint(uint16_t val) : value(to_E(val)) {}
endianint& operator=(uint16_t val) {
value = to_E(val);
return *this;
}
operator uint16_t() const {
return from_E(value);
}
};
Some things immediately stand out:
<bit>
is a C++20 feature: this header enables the endian enum, and the native endian checks. My guess is it would be possible to package this up and ship it with the header for non-C++20 platforms, but I haven’t looked into it yet.constexpr
everywhere for zero-overhead: We want the compiler to do the heavy lifting, and these operations should take at most one instruction in codea single member for equivalence: storing the endianness or any other parameters would take up more space than the underlying type
=
and casting overriden for transparency: overriding more operators would not make a difference, as the endianness of the underlying system determines the endianness you can do arithmetic operations in. For example, if we do the following on a little endian system:using le16 = endianint<endian::little, uint16_t>; using be16 = endianint<endian::big, uint16_t>; le16 n1(1); be16 n2(2); be16 n3; n3 = n2 + n1; // equivalent to n3 = be16(uint16_t(n2) + uint16_t(n1))
we would need to do two byteswaps regardless, as we can’t add big-endian numbers on a little endian machine. The sensible thing to do is to convert to the system endianness whenever we perform any arithmetic operations. If there are any tricks to do big-endian addition ie shift carries to the preceding byte rather than the following, I am not aware. Maybe some combination of sse operations could do that, but I claim it would be slower than doing the byteswap, the operation and then swapping back.
manual byteswap?: C++23 introduced
std::byteswap
, which would use a system-specific implementation (on x86 it’s thebswap
instruction). Most compilers also have intrinsics for this, with__builtin_bswap16
working on both clang and gcc. However, the easiest method is to let the compiler optimize out the swap itself. Most compilers are smart enough to recognize that it’s a byteswap, and will optimize it away into abswap
by themselves without any coaxing.
Results
Let’s first test out transparency: in a little-endian system, code generated
using le16
should be identical to code generated when using uint16_t
And it is, even with -O1
! Let’s see if we can add two little-endian points (a
ordered pair of (x, y) le16’s) together
And we get identical assembly yet again (I’ve left it on -O2
because the SIMD
instructions are more succinct, but they also generate the same code on -O1
).
Let’s switch it up a bit, by comparing big-endian addition with little-endian addition.
Note the rol
instructions, which rotate si and di by 8, effectively flipping
the endianness. For 32/64 bit integers, this is replaced with bswap
.
For the other side of the coin, let’s look at a big-endian native system such as MIPS. To keep things simpler, I’ve used the 32-bit version to remove extra shift right instructions the compiler inserts for widening 16-bit integers, keeping the assembly simpler to focus on the moving parts. I’m also adding a little-endian number to a big-endian number, and returning a big-endian number.
Unfortunately, MIPS doesn’t have a native byteswap instruction being RISC, and
you can see it breaking the shifts down instruction by instruction: $6
stores
b
, and it is shifted by 24, -24 and -8 and put into $7
, $3
and $2
respectively. We then and
it together with the masks, noting that the ff0000
mask is too large to fit into the 32-bit instruction and has to be loaded into
$2
instead. Once the bitflip is done, we finally add them together.
This is a tricky example, so feel free to play around with it in godbolt. MIPS
calling conventions are not very well documented online, but from what I can
make out, $4
is the return register, $5
and $6
are arguments, and the
sw
after the jr
return is some pipelining optimization, if I recall
correctly from my uni architecture classes.
Conclusion
With a working library that achieves my requirements in place, two questions still remain:
Have I looked at other endian compatibility libraries, such as
boost::endian
?: Yes, and they do very similar things.boost::endian
splits it’s implementation across three main files:conversion.hpp
: includesendian_load.hpp
andendian_store.hpp
. These impelement the byteswaps for varying sizes of integers (more than I have supported)buffer.hpp
: represents unaligned bytes i.e. buffers for moving around ints in the libraryarithmetic.hpp
: overrides the operators to perform endian-specific arithmetic across types
This is probably a better fit if you don’t have access to a C++20 compiler, or don’t want to roll out your own endian library. However it is too many moving parts for my taste, and plugging it into my project would take the challenge and fun away from crafting it by myself.
What about supporting integer types? This is a good question. However, then I would also need to manage underlying integer representations. Just as little endian machines are ubiquitous, two’s complement machines are even more so. Yet, because we are accounting for the edge case which is big-endianness here, I would also need to handle more esoteric representations such as one’s complement/sign bits in order to consider my implementation complete. It also doesn’t make sense to implement them for my usecase, which would mostly use this for unsigned counters/raw data in virtio.
The complete implementation, for including can be found here, and is released under MIT. Feel free to point out errors, tests or other things I should analyse.