Encoding Techniques in Verified Serializers
Written by Ethan Shea, edited by James Wilcox.
Serialization converts in-memory data to an external representation, typically a list or stream of bytes, which is then ready to be stored on disk or sent over the network.
This post describes Cheerios, a verified library for serialization in Coq. Cheerios uses typeclasses to make it easy to create new serializers by composing existing serializers, such that the correctness proofs also compose. We first give an overview of the core definitions of Cheerios and show how to build simple serializers for booleans, natural numbers, and pairs. Then, we describe two generic strategies for serializing recursive “container-like” types, such as lists and trees, and discuss the tradeoffs in proof effort between the strategies. A recurring theme is the challenge of expressing decoders via structural recursion.
This post is generated from a literate Coq file, which we encourage you to step through.
Defining Serialization
In order to define serialization, three things are needed, types for the serialization and deserialization functions, and a correctness specification. The correctness spec should roughly show that serialization and deserialization are inverses.This enables the proof that any object can be serialized then deserialized into the same object. We’ll start with serialization because it conceptually comes first in the process.
In order to serialize something, all of it’s information must be mapped into bits. It makes sense then to define a serializer for some type A
as A -> list bool
. Take the following type for example, representing olympic medals:
Inductive medal := Gold | Silver | Bronze.
A serialization function should map each case to a symbol of bits. There are many ways this could be done, each with different trade offs that will be explored later. For now, we just pick one.
Definition medal_serialize (m: medal) : list bool :=
match m with
| Gold => [true; true]
| Silver => [true; false]
| Bronze => [false]
end.
As it turns out, this first attempt at a type will be exactly what is needed.
Now a type for the deserializer can be determined. We want something that acts as an
inverse to the serialization function we picked. At first thought, list bool -> A
seems
like a good option. This would allow the correctness spec to be deserialize (serialize a) = a
.
However, this runs into problems pretty quickly.
Fail Definition medal_deserialize (bools: list bool) : medal :=
match bools with
| [true; true] => Gold
| [true; false] => Silver
| [false] => Bronze
end.
Coq catches the mistake and points out that the bools
is not exhaustively matched on. What if
it’s empty? Fundamentally, this problem is encountered because not every sequence of booleans
decodes into a medal
. Even non-empty sequences such as [false; true]
pose issues. Since
these sequences are not produced by the serializer, they can be considered erroneous.
In cheerios we handle this case by returning the option
constructor None
to indicate an
error.
This makes the spec become deserialize (serialize a) = Some a
. In English: deserialization
on any serialized stream always succeeds and returns the correct value.
Definition medal_deserialize1 (bools: list bool):option medal :=
match bools with
| [true; true] => Some Gold
| [true; false] => Some Silver
| [false] => Some Bronze
| _ => None
end.
This works for a single medal being encoded in the bitstream, but there are problems when the work from above is reused to a type which requires composition, like a pair of medals. Serialization works just fine, but deserialization is problematic.
Definition medal_serialize_pair (medals: medal * medal) :=
medal_serialize (fst medals) ++ medal_serialize (snd medals).
Fail Definition medal_deserialize_pair (bools: list bool)
: option (medal * medal) :=
(medal_deserialize1 bools, medal_deserialize1 hmmm).
When deserializing the first medal, the entire list is consumed. There is nothing to pass into the second call to medal_deserialize1
because it is not known how much of the list has been deserialized. The definition of deserialize needs a way to communicate how much of the stream is remaining back to the caller. In Cheerios, this is represented with the type medal * list bool
where
the deserialized medal and remaining portion of the stream are returned. This is wrapped in an option to allow the entire
deserialization operation to fail. Failure happens at this level because once an error is encountered, it is
impossible in general to resume serialization of the remaining content.
Definition medal_deserialize (bools: list bool)
: option (medal * list bool) :=
match bools with
| true :: true :: bools => Some (Gold, bools)
| true :: false :: bools => Some (Silver, bools)
| false :: bools => Some (Bronze, bools)
| _ => None
end.
As we will see shortly, this type is sufficient to support both composition and malformed inputs. Let’s take a moment to generalize these definitions before continuing so we can arrive at a definition for the spec.
Definition serializer (A: Type) := A -> list bool.
Definition deserializer (A: Type) :=
list bool -> option (A * list bool).
How does this alter the correctness specification? We can start by taking what we had last time and making it typecheck:
deser (ser a) = Some (a, [])
However this still doesn’t address the problem with the remaining bools. How do we reason about deserialization with any other input following? Another attempt leads us to something like this:
deser (ser a ++ ser b) = Some (a, ser b)
This works, but now exactly two objects must be encoded in the stream. We can’t easily reason about deserializing multiple objects or a single object this way. Generalizing again for what comes after gives:
deser (ser a ++ bools) = Some (a, bools)
Now the dependence on a second object is removed and as a side effect the spec becomes stronger, allowing any data to follow rather than just data produced by some serializer.
Note that the spec only needs to worry about encodings which the serializer produces. This eliminates our need to reason about the error cases that were necessary in the deserializer. However, in doing this, nothing is said about how malformed bitstrings are parsed, or that every deserialized value can be generated by exactly one bit string. These may be useful properties to prove, but cheerios does not handle deserialization from unknown and unverified sources since this minimum spec is enough.
Definition ser_deser_spec A
(ser : serializer A)
(deser : deserializer A) :=
forall (a : A) (bools: list bool),
(deser (ser a ++ bools)) = Some (a, bools).
Wrapping this up in a class gives us the following definition which includes the following three things: a serializer, a deserializer, and a proof of correctness.
Class Serializer (A : Type) : Type := {
serialize : A -> list bool;
deserialize : list bool -> option (A * list bool);
ser_deser_identity : ser_deser_spec A serialize deserialize
}.
In general, the correctness proofs tend to be straightforward and repetitive, but this first one is included here to show the structure. Concretely this becomes:
Theorem medal_ser_deser_identity :
ser_deser_spec medal medal_serialize medal_deserialize.
Proof.
unfold ser_deser_spec.
unfold medal_deserialize.
unfold medal_serialize.
intros m.
destruct m; reflexivity.
Qed.
Instance MedalSerializer : Serializer medal.
Proof.
exact {| serialize := medal_serialize;
deserialize := medal_deserialize;
ser_deser_identity := medal_ser_deser_identity;
|}.
Defined.
Generalizing this pair deserailizer for arbitrary types A
and B
comes
naturally now that there are better type signatures for serialization
and deserialization. Wrapping all three components in a section avoids some
boilerplate. Note that the type system requires a serializer for A
and B
in order
for the A * B
serializer to function.
Section PairSerializer.
Variable A : Type.
Variable B : Type.
Variable serA : Serializer A.
Variable serB : Serializer B.
Definition pair_serialize (p : A * B) : list bool :=
serialize (fst p) ++ serialize (snd p).
Definition pair_deserialize bools
: option ((A * B) * list bool) :=
match deserialize bools with
| Some (a, bools) =>
match deserialize bools with
| Some (b, bools) => Some ((a, b), bools)
| None => None
end
| None => None
end.
Theorem pair_ser_deser_identity :
ser_deser_spec (A * B) pair_serialize pair_deserialize.
Proof.
unfold ser_deser_spec.
intros.
unfold pair_serialize.
rewrite app_ass.
unfold pair_deserialize.
rewrite ser_deser_identity, ser_deser_identity.
rewrite <- surjective_pairing.
reflexivity.
Qed.
Instance PairSerializer : Serializer (A * B).
Proof.
exact {| serialize := pair_serialize;
deserialize := pair_deserialize;
ser_deser_identity := pair_ser_deser_identity;
|}.
Defined.
End PairSerializer.
Note that the variable bools
is shadowed several times in this definition. Normally this can complicate
code, but in this case it improves clarity because bools
always refers to “what’s left to parse”.
Now, we will build a simple (inefficient1) serializer/deserializer for a more useful datatype, nat
s.
The encoding is essentially the unary representation of the natural number.
Fixpoint nat_serialize (n : nat) : list bool :=
match n with
| O => [false]
| S n => [true] ++ (nat_serialize n)
end.
Fixpoint nat_deserialize bools : option (nat * list bool) :=
match bools with
| true :: bools =>
match nat_deserialize bools with
| None => None
| Some (n, bools) => Some (S n, bools)
end
| false :: bools => Some (O, bools)
| [] => None (* Deserializing an empty stream *)
end.
Theorem nat_ser_deser_identity :
ser_deser_spec nat nat_serialize nat_deserialize.
Proof.
unfold ser_deser_spec.
intros n; induction n; intros.
- simpl. reflexivity.
- simpl.
rewrite IHn.
reflexivity.
Qed.
Instance NatSerializer : Serializer nat.
Proof.
exact {| serialize := nat_serialize;
deserialize := nat_deserialize;
ser_deser_identity := nat_ser_deser_identity;
|}.
Defined.
Notice that the information about when to stop deserialization of each element must be encoded
into the stream itself. For example with the following definition of nat_serialize
, deserialization
of nat * nat
would become problematic.
Fixpoint nat_serialize_broken (n : nat) : list bool :=
match n with
| O => []
| S n => [true] ++ (nat_serialize n)
end.
Under this definition, it’s unclear what deserializing [true, true true]
as a pair of nat
s should
return. It could be (0,3)
, (1,2)
, (2,1)
or (3,0)
. To remove this ambiguity, the information about when to stop must be
encoded in the stream itself in one form or another rather than implicitly by using the end of the stream as a token.
Consider the serialized pair of nat
s [true, false, true, true, false]
, serialized using the not-broken serializer.
It is unambiguously (1, 2)
. When deserializing it is known precisely when each nat
finishes (when false
is reached), and when the pair finishes (when the second nat
finishes).
This information about the structure of the encoded
data plays a crucial part in showing ser_deser_identity
.
List Serialization
When serializing lists (or any variable sized collection) there must be some information about the structure in the serialized stream. Imagine this is not done, and a pair of lists is serialized into the byte stream. This would produce an encoding which looks like the figure below. It’s impossible to tell where one list stops and the next begins just by looking at the stream.
This serializer is broken for the same reason as the broken nat
serializer, the information in a serialized
object must be entirely contained within the bitstream. Note that we don’t run into this problem with any
collection of fixed size, like a pair or vector. It is clear when to stop deserializing a Vec 5
because 5
elements have been deserialized. In this case, the information about the shape of the data in this case is encoded in the
type. Since the type is known to the serializer and the deserializer, it does not need to be encoded
in the bitstream.
Let’s start with solving this problem by including a “continue” bit before every element. If it is true an element follows, and if it is false, the end of the list has been reached. This appears as follows:
Let’s see what this looks like in code.
Fixpoint list_serialize_inter (l : list A) : list bool :=
match l with
| [] => [false]
| h :: t => [true] ++ serialize h ++ list_serialize_inter t
end.
With this scheme, deserialization again proves to be difficult. In the definition below, because bools_after_elem
is not a syntactic subterm of bools
, the termination checker refuses to accept this definition. The fact that bools_after_elem
is returned from a function hides the subterm property from the typechecker. When executed, the
definition does terminate, since bools_after_elem
is a strict suffix of bools
,
but the type system does not see this. An attempted definition is given below:
Fail Fixpoint list_deserialize_inter
(bools: list bool) : option (list A * list bool) :=
match bools with
| [] => None
| false :: bools => Some ([], bools)
| true :: bools =>
match deserialize bools with
| None => None
| Some (a, bools_after_elem) =>
match list_deserialize_em bools_after_elem with
| None => None
| Some (tail, bools_after_list) =>
Some (a :: tail, bools_after_list)
end
end
end.
It is intuitively impossible to define this deserialization function without using general recursion. To solve this recursion problem, the same information encoded in the continuation bits can be moved to the front of the list’s encoding in the form of a size. Then the rest of the deserializer can recurse on the number of elements remaining.
Programmatically,
Fixpoint list_serialize_elts (l : list A) : list bool :=
match l with
| [] => []
| h :: t => serialize h ++ list_serialize_elts t
end.
Definition list_serialize (l : list A) : list bool :=
nat_serialize (length l) ++ list_serialize_elts l.
Fixpoint list_deserialize_elts (size : nat) (bools : list bool)
: option (list A * list bool) :=
match size with
| O => Some ([], bools)
| S size =>
match deserialize bools with
| None => None
| Some (n, bools) =>
match list_deserialize_elts size bools with
| None => None
| Some (tail, bools) => Some (n :: tail, bools)
end
end
end.
Definition list_deserialize bools :=
match deserialize bools with
| None => None
| Some (size, bools) => list_deserialize_elts size bools
end.
This gives a definition which can be defined using only structural recursion, just
by moving the information around. It’s worth noting that because the size information
is grouped together instead of spread apart, it would be much easier to make the encoding
format more efficient by swapping in a more efficient nat
serializer. The only property
lost with this encoding is that it is now impossible to reason about any tail of the
list in isolation, the concept of a size must also be considered.
Binary Trees
To continue exploring this idea of serializing shape, we need to look at a more complicated data structure such as a binary tree. Our definition of a binary tree is straightforward:
Inductive tree: Type :=
| leaf : tree
| node : A -> tree -> tree -> tree.
Just as with lists, there are two general approaches to serializing trees: interleaved and up-front.
Interleaved Tree Serializer
For the interleaved shape tree serializer, the concept of a “path” is needed. A path is simply the list of
directions taken from the root to reach some node. We’ll use true
to represent left and false
to represent right. These directions are stored with the head at the top of the tree.
Below is the path [true, false]
.
Using the concept of a path, the position and data of any node can be serialized. When this is done for all nodes in the tree, all information captured by the original data structure has been encoded.2
Even though an interleaved structure is impossible to deserialize without general recursion, using an interleaved structure is still possible if there is just enough information up front to recurse on. The number of nodes in the tree provides a nice metric. Our serializer will not be truely interleaved since we require this header, but information about the shape will still be interleaved in the encoding.
The encoding using an interleaved structure looks like this:
Serialization is performed as follows:
Fixpoint tree_size (t : tree A) : nat :=
match t with
| leaf => 0
| node _ l r => 1 + tree_size l + tree_size r
end.
Fixpoint tree_serialize_subtree_inter
(t: tree A) (path: list bool) :=
match t with
| leaf => []
| node a l r => serialize path ++ serialize a
++ tree_serialize_subtree_inter l (path ++ [true])
++ tree_serialize_subtree_inter r (path ++ [false])
end.
Definition tree_serialize_inter (t: tree A) : list bool :=
nat_serialize (tree_size t) ++
tree_serialize_subtree_inter t [].
Deserialization is more complicated. As elements are parsed, they are inserted into the tree structure parsed already. The insertion function used is not particularly robust, however during deserialization as long as any given node is preceded by all of its parents no issues arise. This is the case with a preorder traversal, and other traversals like BFS, so it meets our needs.
Fixpoint tree_insert (into t: tree A)(path: list bool): tree A :=
match into with
| leaf => t
| node a l r =>
match path with
| [] => t (* not supported *)
| true :: path => node a (tree_insert l t path) r
| false :: path => node a l (tree_insert r t path)
end
end.
Fixpoint tree_deserialize_inter_impl
(remaining : nat) (root : tree A) (bools : list bool)
: option (tree A * list bool) :=
match remaining with
| S n =>
match deserialize bools with
| None => None
| Some (path, bools) =>
match deserialize bools with
| None => None
| Some (a, bools) =>
tree_deserialize_inter_impl
n
(tree_insert root (node a leaf leaf) path)
bools
end
end
| O => Some (root, bools)
end.
Definition tree_deserialize_inter bools :=
match nat_deserialize bools with
| Some (size, bools) =>
tree_deserialize_inter_impl size leaf bools
| None => None
end.
Because of this concept of a path, which is a global address of any particular node, reasoning about a tree becomes much more difficult. In particular, we must now prove that every insertion is made on a leaf of the tree so it does not overwrite data or fall off the end.
Fixpoint leaf_insertable (into: tree A)(path: list bool): Prop :=
match into with
| leaf =>
(* Only if the path and tree run out at the same time
should we be able to insert *)
match path with
| [] => True
| _ => False
end
| node a l r =>
match path with
| [] => False
| true :: path => (leaf_insertable l path)
| false :: path => (leaf_insertable r path)
end
end.
The proof for this serializer is quite large (about 150 lines) and uninteresting, so it has been omitted. It can be found here.
Up-front Tree Serializer
Alternatively, the structure may be recorded at the beginning and then filled in as the tree is parsed. To do this, a tree’s shape can be reasoned about as the type tree unit
, and it’s elements as the type list A
.
This technique requires serialization and deserialization to be a two step process, which has the advantage of better mapping to the information stored in the tree (shape and element data), but the disadvantage of being more complicated.
The shape is encoded similarly to HTML with three symbols:
[true; true]
: The beginning of anode
[true; false]
: The end of anode
[false]
: A leaf node
Each node
requires exactly two subtrees between its start and end marker. Storing the shape as tree unit
works because unit
contains no information, so tree unit
only contains the information that
the tree
portion of tree A
describes, which is the shape. Since the shape is recorded in a preorder
traversal, the elements are also encoded in the same order, which makes it easy to marry the two together.
A visual representation of this encoding:
And in code:
Fixpoint tree_serialize_shape (t : tree A) : list bool :=
match t with
| leaf => [false]
| node _ l r => [true; true] ++ tree_serialize_shape l ++
tree_serialize_shape r ++ [true; false]
end.
Fixpoint tree_serialize_data_preorder (t : tree A) : list bool :=
match t with
| leaf => [] (* No data contained within leaf nodes *)
| node a l r => serialize a ++
tree_serialize_data_preorder l ++
tree_serialize_data_preorder r
end.
Definition tree_serialize_front (t: tree A) : list bool :=
tree_serialize_shape t ++ tree_serialize_data_preorder t.
Fixpoint tree_deserialize_shape
(bools: list bool) (progress: list (list (tree unit)))
: option (tree unit * list bool) :=
match bools with
| false :: bools =>
match progress with
| [] => Some (leaf, bools)
| level :: progress =>
tree_deserialize_shape
bools
((leaf :: level) :: progress)
end
| true :: true :: bools =>
tree_deserialize_shape bools ([] :: progress)
| true :: false :: bools =>
match progress with
| [] => None (* end without a beginning *)
| level :: [] =>
match level with
| [r; l] => Some (node tt l r, bools)
| _ => None
end
| level :: parent :: progress =>
match level with
| [r; l] =>
tree_deserialize_shape
bools
((node tt l r :: parent) :: progress)
| _ => None
end
end
| _ => None
end.
Fixpoint tree_deserialize_front_elts
(shape : tree unit) (bools : list bool)
: option (tree A * list bool) :=
match shape with
| leaf => Some (leaf, bools)
| node _ l r =>
match deserialize bools with
| None => None
| Some (a, bools) =>
match tree_deserialize_front_elts l bools with
| None => None
| Some (l, bools) =>
match tree_deserialize_front_elts r bools with
| None => None
| Some (r, bools) => Some (node a l r, bools)
end
end
end
end.
Definition tree_deserialize_front (bools : list bool)
: option (tree A * list bool) :=
match tree_deserialize_shape bools [] with
| None => None
| Some (shape, bools) =>
tree_deserialize_front_elts shape bools
end.
Because of the more recursive nature of the encoding, reasoning is significantly easier. We can consider any portion of the shape in isolation from all others because there are no ties to any global state.
Again, the proof for this serializer is large (about 70 lines) and uninteresting, so it has been omitted. It can be found here.
Conclusion
It’s worth noting that possible encodings for a given type are restricted by information dependencies within that type. Imagine a list is encoded as follows:
Since the size of the list is at the end, rather than at the beginning, information about how to deserialize the structure isn’t known until its too late. Similarly, the size can’t be put anywhere in the middle (say after the first element), because of the possibility of an empty list. Before deserializing each element, it must be known that it actually is an element of the list, and not some other data coming after the list.
This is why the interleaved list serializer is able to work. Right before each element is deserialized, we mark that the list continues with the continue bit.
This is also why the tree serializers are able to encode the shape at the front or the end. In both cases, the size is known so deserializing additional elements is justified. The question of how to arrange these elements can be reasoned about independently of the elements themselves, therefore the shape of the tree can be encoded without regard to where the element data is located.
One might expect to be able to speculatively parse elements of the bitstream and stop when an invalid element is reached. But this requires that we don’t accidentally interpret whatever came after in the bit stream as an element. If the encoding of different types are guaranteed to not overlap, then this would be possible. But in our model, serializers can choose arbitrary encodings, so this is not possible.
Beyond practical necessity, serialization can be used as a forcing function to understand the information contained within data structures. By requiring a well defined format, the information contained in that structure may be deduced and formalized. For example, a list needs to have a length, and a tree needs to have a shape. From there, the encoding of this information is flexible, although some encodings are easier to work with than others.
-
A linked list of booleans is not computationally efficient, and could be replaced with another more sensible structure such as a stream of bytes. ↩
-
It’s worth noting that this representation could be made more efficient by recording locations relative to the previous node instead of absolute ones. However, this fact does not significantly change how hard it is to reason about the tree. Recording relative locations would allow us to reason about subtrees instead of parts of some tree, but we still must reason about insertions. ↩