Protocol Buffer (aka Protobuf) is an efficient and fast way to serialise data into a binary format. It is much more compact than Java serialisation or any text-based format (Json, XML, CSV, …).
Protobuf is schema based – it needs a description (in a .proto file) of the data structures to be serialised/deserialised.
On the JVM, protoc (the Protobuf compiler) reads the .proto description files and generates corresponding classes.
For Scala there is a very good sbt plugin “scalaPB” that follows the same process and generates case classes corresponding to the .proto files definitions.
The .proto files are an easy way to describe a protocol between 2 components (e.g. services). However there are some cases (e.g. writing to persistent storage) where the .proto files definition are just unnecessary and add superfluous complexity. (Who likes to read auto-generated code?).
In such cases it would be much easier to serialise an object directly into protobuf (using its class definition as a schema). Afterall this is what the protobuf java binding does: it serialises (auto-generated) java classes into protobuf binary format.
To that matter, let me introduce – PBDirect – a scala library to directly encode scala objects into protobuf.
Usage
It’s usage is simple and straightforward. By importing pbdirect._
you add an implicit method .toPB
to your case classes and another method .pbTo[Type]
on byte arrays.
import cats.instances.list._ import cats.instances.option._ import pbdirect._ case class Person( name: String, id: Int, email: Option[String] phones: List[String] ) val person = Person("John", 12, Some("john@doe.com"), Nil) // serialise to Protobuf val bytes = person.toPB // unserialise from Protobuf val john = bytes.pbTo[Person]
Principle
PBDirect provides 2 type-classes:
PBWriter
to serialise to protobufPBReader
to unserialise from protobuf
As the case class itself is used for the schema definition it must obey some basic rules:
required
field in protobufOption
field maps to optional
field in protobufList
field maps to repeated
field in protobufAn example would probably make things clearer:
case class Person( name: String, id: Int, email: Option[String] phones: List[String] )
This class corresponds to the following protobuf definition:
message Person { required string name = 1; required int32 id = 2; optional string email = 3; repeated string phones = 4; }
As you can see the protobuf field numbering corresponds to the declaration order of the fields in the scala case class.
You can also nest message into one another. For example we can replace a Person
‘s phone number (which is just a String
) with a PhoneNumber
type that holds the number itself plus the number type (e.g. Home, Work, Mobile, …)
case class PhoneNumber( number: String, numberType: Optional[String] ) case class Person( name: String, id: Int, email: Option[String] phones: List[PhoneNumber] )
which corresponds to
message PhoneNumber { required string number = 1; optional string numberType = 2; } message Person { required string name = 1; required int32 id = 2; optional string email = 3; repeated PhoneNumber phones = 4; }
Enum
In this example it would be more convenient to replace the numberType
with an enum type (i.e. a list of possible values). We need something equivalent to this protobuf definition:
enum PhoneType { Mobile = 0; Home = 1; Work = 2; } message PhoneNumber { required string number = 1; optional PhoneType numberType = 2; }
The obvious things that comes to mind is probably to use Scala Enumeration
. However in order to work with Shapeless’s Generic
, we need to declare our Enumeration as a case object.
case object PhoneType extends Enumeration { val Mobile, Home, Work = Value }
A popular alternative approach is to use a sealed trait with a bunch of case objects.
sealed trait PhoneType object PhoneType { case object Mobile extends PhoneType case object Home extends PhoneType case object Work extends PhoneType }
There is still one problem here because Shapeless doesn’t provide us with the enum values in the correct order. In fact the order is completely indeterministic.
We can use an implicit Ordering
to sort the values in the correct order. E.g. we can sort them alphabetically but it’s not always what we want. Imagine an enum with the days of the week, you probably don’t want Friday
to come first and Wednesday
last.
How can we solve this? Well, we need to give a number to every member of the enum. That’s why PBDirect has a Pos
trait. The Pos
trait can be used to indicate the position of an object into the enum.
sealed trait PhoneType extends Pos object PhoneType { case object Mobile extends PhoneType with Pos._0 case object Home extends PhoneType with Pos._1 case object Work extends PhoneType with Pos._2 }
Pos
defines an implicit ordering so that the enum values appears at their specified position and this definition is now equivalent to the protobuf definition we wanted.
Limitation
PBDirect doesn’t support default values for optional fields. When deserialising an object from protobuf if an optional field is missing it will be set to None
in the deserialised object (even though the object provides a default value) – this is simply because there is no way to grab that default value using Shapeless (because it’s not part of the field’s type).
PBDirect doesn’t support oneof
encoding. In protobuf it is defined as
message Person { required string name = 1; required int32 id = 2; oneof contact { string email = 3; string phone = 4; } }
In this example a Person
can have either an email or a phone number but not both. However it is encoded as:
message Person { required string name = 1; required int32 id = 2; optional string email = 3; optional string phone = 4; }
which corresponds to the following scala case class
case class Person( name: String, id: Int, email: Option[String], phone: Option[String] }
It’s only the generated code that ensures the oneof requirement. It is not encoded into the serialised message itself.
Using Shapeless it would have been possible to define something like this
sealed trait Contact case class Email(address: String) extends Contact case class Phone(number: String) extends Contact case class Person( name: String, id: Int, contact: Contact )
which would corresponds to
message Email { required string address = 1; } message Phone { required string number = 1; } message Person { required string name = 1; required int32 id = 2; oneof contact { Email email = 3; Phone phone = 3; } }
As you can see we have a problem because both fields map to the position 3
.
The best way to deal with oneof
is to use optional fields and add validation after deserialisation.
case class Person( name: String, id: Int, email: Option[String], phone: Option[String] ) val bytes: Array[Byte] = ??? // get the protobuf encoded data val person = bytes.pbTo[Person] match { case p@Person(_, _, Some(_), None) => p // correct, email is defined case p@Person(_, _, None, Some(_)) => p // correct, phone is defined case _ => throw new Exception("Invalid oneof email or phone") }
Performance
Under the hood the serialisation/deserialisation is performed using the protobuf-java
library so you can expect the same performance for serialisation as with scalaPB or using protoc.
The deserialisation might be slightly slower as the byte array is traversed several time (1 time for each field to deserialise).
Conclusion
One more time Shapeless has proven useful to simplify serialisation of case classes – this time into protobuf binary format. It allows to design simple to use library while keeping close with the performances of the original code.
If you go and give it a try, don’t hesitate to leave some feedback in the comments below. You can also check the source code and the usage and setup instructions on Github.