Byte Ordered Streams
— 2020-01-22
- bytes and endianness
- reading bytes from a stream
- readbytes / writebytes
- future possibilities
- alternatives
- prior art
- conclusion
While on vacation in Japan I dug into streaming parsing, for fun. It's
something that piqued my interest ever since I saw Mafintosh's
csv-parser
package back in 2014.
This led to the creation of the omnom
library
1, and as part of it I came up with an ergonomic way to parse and
write numbers with a specified endianness to and from streams. An example:
use std::fs::File;
use omnom::prelude::*;
let mut file = File::open("foo.txt")?;
file.write_le(56_i32)?; // Write an i32 as little-endian to a byte stream.
let n: u16 = file.read_be()?; // Read a u16 as big-endian from a byte stream.
let n: u32 = file.read_be()?; // Read a u32 as big-endian from a byte stream.
The name omnom
is a pun on the excellent
nom
package, which you should probably
check out if you're looking to write a maintainable parser.
Bytes and Endianness
"Endianness" is a word that you might hear occasionally, but if you've never written network drivers or serialization formats you might not have had to learn about before. Basically what it means is "the order in which bytes form numbers".
A byte is the smallest representation of data in a computer, and in Rust it's
referred to as a u8
. A u16
is two bytes long, which means it's two u8
s.
let b1 = [0u8]; // length 1 byte, size equal to u8
let b2 = [0u8, 0u8]; // length 2 bytes, size equal to u16
However in the second case, do we interpret the bytes from left-to-right, or right-to-left? Endianness refers to the order we read bytes in. Little-endian means we interpret bytes right-to-left. And big-endian means we interpret them left-to-right. Different protocols will interpret bytes in different orders, but it will usually say so in the protocol. The most important thing to be aware of is that there's a difference between the two.
In Rust's stdlib there exists "little-endian", "big-endian", and "native-endian". The latter refers to the system's default representation.
Reading Bytes from a stream
Say we wanted to read a u16
encoded as little-endian from a stream. We
would write something like this:
use std::io::prelude::*;
use std::fs::File;
use std::mem;
let mut f = File::open("foo.txt")?;
let mut buf = [0; mem::size_of::<u16>()];
f.read_exact(&mut buf)?;
let num = u16::from_le_bytes(buf);
This uses mem::size_of
to ensure we get the right size for the type we're trying to read. Then uses
Read::read_exact
to read bytes from the stream, and
u16::from_le_bytes
to convert it to a u16 2.
If you haven't seen the bytes methods before: they're fairly new, and have been implemented as part of Rust 1.32. In the tracking issue there's a fair bit of talk about operating on streams. Which is what we're trying to improve today with this post (:
To write a u16 as little-endian to a stream, we can do:
use std::io::prelude::*;
use std::fs::File;
let mut file = File::create("foo.txt")?;
file.write_all(&99_u16.to_le_bytes())?; // Write a u16 to a stream.
This approach uses
u16::to_le_bytes
and
Write::write_all
to write bytes to a stream.
As you can see writing bytes to a stream is somewhat straight forward; combining two API calls to create the buffer and write it. But expressively reading bytes is much more involved: a combination of 3 APIs that must be used in the right order to read from a stream.
Having to write this for each number read from a stream would feel repetitive, making it harder to read and write. But it's also not very discoverable for people who want to use these APIs. In order to figure it out you have to learn about several different APIs, and the interaction between them 3.
For the sake of this post I put together these examples. I know this is anecdotal evidence, but it took me about 10 minutes to find the right APIs and make something that compiles. This is after having written a library that makes extensive use of both. The end-result of the std APIs might look pretty decent when presented as-is. But we shouldn't underestimate the time it takes to find the APIs and make them work together.
ReadBytes / WriteBytes
The
Iterator::collect
function is one of the functions that best embodies Rust's values of
expressiveness, performance, and correctness. There's entire
articles
written about just that one function. And rightfully so; it's because of the
type system and the borrow checker that we can have a single API that can
just as easily convert an Iterator
to a HashMap
, as to a Result<Vec>
--
all without any performance penalty. That's quite impressive for a language
that has such a strong focus on correctness as Rust.
It turns out we can write a similar API for number encoding and decoding
number types as well. In omnom
we've done this by implementing two traits for
each number:
ReadBytes
and
WriteBytes
4. The
APIs look like this:
I'm not particularly attached to the naming of these methods. It's
mostly that alternatives that I came up with, such as ByteOrderedRead
,
seemed really long. If this ever makes it to an RFC, the naming of traits is
something might need to be reconsidered.
use std:io::{Read, Write};
pub trait ReadBytes: Sized {
fn read_be<R: Read>(reader: &mut R) -> Result<Self>;
fn read_le<R: Read>(reader: &mut R) -> Result<Self>;
fn read_ne<R: Read>(reader: &mut R) -> Result<Self>;
}
pub trait WriteBytes {
fn write_be<W: Write>(&self, writer: &mut W) -> Result<usize>;
fn write_le<W: Write>(&self, writer: &mut W) -> Result<usize>;
fn write_ne<W: Write>(&self, writer: &mut W) -> Result<usize>;
}
We implement this for each individual number type so that it knows how to
encode and decode itself from streams. We then add various endian-based
read/write methods to the Read
and Write
traits that can take any T: ReadBytes
or T: WriteBytes
and write it to themselves:
pub trait ReadExt: Read + Sized {
fn read_be<B: ReadBytes>(&mut self) -> Result<B> { ... }
fn read_le<B: ReadBytes>(&mut self) -> Result<B> { ... }
fn read_ne<B: ReadBytes>(&mut self) -> Result<B> { ... }
}
pub trait WriteExt: Write + Sized {
fn write_be<B: WriteBytes>(&mut self, num: B) -> Result<usize> { ... }
fn write_le<B: WriteBytes>(&mut self, num: B) -> Result<usize> { ... }
fn write_ne<B: WriteBytes>(&mut self, num: B) -> Result<usize> { ... }
}
impl<T: Read> ReadExt for T {}
impl<T: Write> WriteExt for T {}
When put together, this allow reading and writing bytes from any number type
since they all implement ReadBytes
and WriteBytes
. This removes all
boilerplate we saw earlier, and replaces it with a few straight forward APIs.
use std::fs::File;
use omnom::prelude::*;
let mut file = File::open("foo.txt")?;
file.write_le(56_i32)?; // write an i32 as little-endian to a stream.
let n: u16 = file.read_be()?; // read a u16 as big-endian from a stream.
Future possibilities
RFC
Making people use the omnom
library for such a small API addition doesn't
make much sense. I think this would be a fantastic addition to the standard
library, making something commonplace both easier to author, review, and
learn about. Perhaps this would make for a good RFC?
Floating point numbers
Since Rust 1.41 endianness conversions are now available for floats as well.
This isn't part of omnom
yet, but would be a 2-line change to add.
Multi-value reads and writes
Something else that would be a lot of fun to add would be multi-value reads and writes:
// Proposed
let head: [u32; 4] = file.read_be()?;
// Current
let b1: u32 = file.read_be()?;
let b2: u32 = file.read_be()?;
let b3: u32 = file.read_be()?;
let b4: u32 = file.read_be()?;
let head = [b1, b2, b3, b4];
Support for async streams
Another interesting avenue to explore would be to implement these methods for
AsyncRead
and
AsyncWrite
.
Reading bytes asynchronously from streams is very common, which would create
benefits comparable to their synchronous counterparts.
use async_std::fs::File;
use omnom::prelude::*;
let mut file = File::open("foo.txt").await?;
file.write_le(56_i32).await?; // write an i32 as little-endian to a byte stream.
let n: u16 = file.read_be().await?; // read a u16 as big-endian from a byte stream.
In async-std
we've implemented
Stream::collect
using FromStream
and IntoStream
traits. I recently wrote about the
challenges of these
traits,
and given how similar they'd be to the traits we're proposing in this post,
they'd likely face the same challenges. In short: we can probably implement
them today, but not as efficient as theoretically possible until we support
async fn
in traits.
Alternatives
Back in November I wrote a rather long survey of byte-ordered reads and
writes,
and discussed a few alternatives. The design implemented in omnom
ended up
being the third option we discussed. But it's probably worthwhile discussing
the other two options we considered.
Alternative 1: Inherent Methods on Read/Write
Instead of implementing a new trait for each number type, we could add
number-specific methods on the Read
and Write
traits. If we followed
std's naming conventions of prefixing with be_
, le_
, and ne_
we would
end up implementing 96 new methods on Read
and Write
(2 traits * 3
endianness kinds * 16 number types).
use std::io::prelude::*;
let mut reader = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, reader.read_u16_be()?);
assert_eq!(768, reader.read_u16_be()?);
That's a lot of methods that would be shown in the documentation of Read
and Write
, making it generally harder to browse and explore. But it would
also prevent us from implementing multi-value reads / writes later on (see
Future Possibilities).
Alternative 2: ByteOrder inspired
Another option would be to follow
byteorder's approach of adding
32 methods on the Read and Write traits, and using Endianness
types as type
parameters to choose the endianness.
use std::io::prelude::*;
use std::io::BigEndian;
let mut reader = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, reader.read_u16::<BigEndian>()?);
assert_eq!(768, reader.read_u16::<BigEndian>()?);
Compared to the first option that's still a lot of methods (16 per trait),
and would also prevent us from implementing multi-value read/writes later on.
Additionally we would need to introduce at least 3 additional Endianness
types that are used to switch between endianness, which means additional
imports in order to use the traits 5.
pub mod io {
pub mod endian {
pub struct BigEndian;
pub struct LittleEndian;
pub struct NativeEndian;
}
}
Unfortunately Rust's current type system does not support constraining generic paramaters by enum members. This means that "endianness" needs to have 3 different structs instead of a single enum containing different kinds of endianness. When added to the stdlib it might make sense to create a submodule for this so that it still feels like we'd be contraining by an enum's member, even if we aren't actually doing that. Because familiarity and all.
Prior Art
byteorder
The byteorder
crate uses a
variation of the second alternative. byteorder
was one of the first crates
to provide convenience APIs for byte order aware operations, so the naming
predates the additions to the stdlib.
As of Rust 1.32 numeric types have started to provide built-in methods such
as to_le
and from_le
that partially cover the same functionality as
byteorder
. The purpose of this post is to share additional designs that
could replace the remaining uses of byteorder
.
tokio-byteorder
The
tokio-byteorder
crate is a port of byteorder
for tokio
's own
AsyncRead
and AsyncWrite
traits. The design mostly matches byteorder
,
with the exception that it operates on asynchronous byte streams instead of
synchronous ones.
use std::io::Cursor;
use tokio_byteorder::{BigEndian, AsyncReadBytesExt};
let mut rdr = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, rdr.read_u16::<BigEndian>().await?);
assert_eq!(768, rdr.read_u16::<BigEndian>().await?);
tokio
Two months ago tokio
added support for
reading
and
writing
numbers as big-endian only:
use tokio::io::{self, AsyncReadExt};
use std::io::Cursor;
let mut reader = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, reader.read_u16().await?);
assert_eq!(768, reader.read_u16().await?);
This currently includes 20 new methods total for AsyncRead
and
AsyncWrite
. A design to add
support for little-endian writes has been proposed, which closely follows the
first alternative we
outlined
(e.g. write_u16_le
). If accepted it would introduce an additional 40
top-level methods for AsyncRead
and AsyncWrite
. At the time of writing
tokio
equivalents of stdlib features such as support for floating point
numbers or platform-native endianness do not seem to have been proposed.
nom
nom
is a parser-combinator library that operates on
byte buffers instead of streams. As such it isn't quite the same as the
others in this list. However it does have support for streaming operations,
and provides several
methods to read
bytes with a given endianness:
use nom::number::streaming::be_u16;
let parser = |s| {
be_u16::<(_, ErrorKind)>(s)
};
assert_eq!(parser(b"\x00\x01abcd"), Ok((&b"abcd"[..], 0x0001)));
assert_eq!(parser(b"\x01"), Err(Err::Incomplete(Needed::Size(2))));
This approach is different from any API we've discussed so far, though the naming scheme somewhat matches the first alternative.
Conclusion
In this post we've covered what endianness is, why streaming parsing is
useful, and showed how byte-ordered aware parsing can be performed today.
We've then gone on to introduce a novel approach for streaming byte-ordered
aware parsing, inspired by the stdlib's collect
function, and shown how to
use a reference implementation. We then covered alternative API designs we
and prior art.
I originally wanted to publish this post in early December after I came back from Japan. But the holidays happened, so I figured I'd make the effort now that we're approaching the end of the first month of the new year.
Overall I'm pretty excited about this API. It's very close to: "What if collect worked for numbers?"; being able to read and write numbers using 3 new methods on each trait. I think that's something that would be really nice.
let head: [u32; 4] = file.read_be()?;
file.write_le(56_i32)?;
Thanks to Joshua Gould (Technetos) and Florian Gilcher (Skade) for proof-reading this post.