PBDirect – Protobuf without the .proto files

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Share on Reddit

Protocol Buffer (aka Protobuf) is an efficient and fast way to serialise data into a binary format. It is much more compact than Java serialisation or any text-based format (Json, XML, CSV, …).

Protobuf is schema based – it needs a description (in a .proto file) of the data structures to be serialised/deserialised.

On the JVM, protoc (the Protobuf compiler) reads the .proto description files and generates corresponding classes.
For Scala there is a very good sbt plugin “scalaPB” that follows the same process and generates case classes corresponding to the .proto files definitions.

The .proto files are an easy way to describe a protocol between 2 components (e.g. services). However there are some cases (e.g. writing to persistent storage) where the .proto files definition are just unnecessary and add superfluous complexity. (Who likes to read auto-generated code?).

In such cases it would be much easier to serialise an object directly into protobuf (using its class definition as a schema). Afterall this is what the protobuf java binding does: it serialises (auto-generated) java classes into protobuf binary format.

To that matter, let me introduce – PBDirect – a scala library to directly encode scala objects into protobuf.

Usage

It’s usage is simple and straightforward. By importing pbdirect._ you add an implicit method .toPB to your case classes and another method .pbTo[Type] on byte arrays.

import cats.instances.list._
import cats.instances.option._
import pbdirect._

case class Person(
  name: String,
  id: Int,
  email: Option[String]
  phones: List[String]
)

val person = Person("John", 12, Some("john@doe.com"), Nil)
// serialise to Protobuf
val bytes  = person.toPB
// unserialise from Protobuf
val john = bytes.pbTo[Person]

Principle

PBDirect provides 2 type-classes:

  • PBWriter to serialise to protobuf
  • PBReader to unserialise from protobuf

As the case class itself is used for the schema definition it must obey some basic rules:

  • Regular field maps to required field in protobuf
  • Option field maps to optional field in protobuf
  • List field maps to repeated field in protobuf
  • The field declaration order determines the field numbers in protobuf
  • An example would probably make things clearer:

    case class Person(
      name: String,
      id: Int,
      email: Option[String]
      phones: List[String]
    )
    

    This class corresponds to the following protobuf definition:

    message Person {
      required string name = 1;
      required int32 id = 2;
      optional string email = 3;
      repeated string phones = 4;
    }
    

    As you can see the protobuf field numbering corresponds to the declaration order of the fields in the scala case class.

    You can also nest message into one another. For example we can replace a Person‘s phone number (which is just a String) with a PhoneNumber type that holds the number itself plus the number type (e.g. Home, Work, Mobile, …)

    case class PhoneNumber(
      number: String,
      numberType: Optional[String]
    )
    
    case class Person(
      name: String,
      id: Int,
      email: Option[String]
      phones: List[PhoneNumber]
    )
    

    which corresponds to

    message PhoneNumber {
      required string number = 1;
      optional string numberType = 2;
    }
    message Person {
      required string name = 1;
      required int32 id = 2;
      optional string email = 3;
      repeated PhoneNumber phones = 4;
    }
    

    Enum

    In this example it would be more convenient to replace the numberType with an enum type (i.e. a list of possible values). We need something equivalent to this protobuf definition:

    enum PhoneType {
      Mobile = 0;
      Home = 1;
      Work = 2;
    }
    message PhoneNumber {
      required string number = 1;
      optional PhoneType numberType = 2;
    }
    

    The obvious things that comes to mind is probably to use Scala Enumeration. However in order to work with Shapeless’s Generic, we need to declare our Enumeration as a case object.

    case object PhoneType extends Enumeration {
      val Mobile, Home, Work = Value
    }
    

    A popular alternative approach is to use a sealed trait with a bunch of case objects.

    sealed trait PhoneType
    object PhoneType {
      case object Mobile extends PhoneType
      case object Home   extends PhoneType
      case object Work   extends PhoneType
    }
    

    There is still one problem here because Shapeless doesn’t provide us with the enum values in the correct order. In fact the order is completely indeterministic.

    We can use an implicit Ordering to sort the values in the correct order. E.g. we can sort them alphabetically but it’s not always what we want. Imagine an enum with the days of the week, you probably don’t want Friday to come first and Wednesday last.

    How can we solve this? Well, we need to give a number to every member of the enum. That’s why PBDirect has a Pos trait. The Pos trait can be used to indicate the position of an object into the enum.

    sealed trait PhoneType extends Pos
    object PhoneType {
      case object Mobile extends PhoneType with Pos._0
      case object Home   extends PhoneType with Pos._1
      case object Work   extends PhoneType with Pos._2
    }
    

    Pos defines an implicit ordering so that the enum values appears at their specified position and this definition is now equivalent to the protobuf definition we wanted.

    Limitation

    PBDirect doesn’t support default values for optional fields. When deserialising an object from protobuf if an optional field is missing it will be set to None in the deserialised object (even though the object provides a default value) – this is simply because there is no way to grab that default value using Shapeless (because it’s not part of the field’s type).

    PBDirect doesn’t support oneof encoding. In protobuf it is defined as

    message Person {
      required string name = 1;
      required int32 id = 2;
      oneof contact {
        string email = 3;
        string phone = 4;
      }
    }
    

    In this example a Person can have either an email or a phone number but not both. However it is encoded as:

    message Person {
      required string name = 1;
      required int32 id = 2;
      optional string email = 3;
      optional string phone = 4;
    }
    

    which corresponds to the following scala case class

    case class Person(
      name: String,
      id: Int,
      email: Option[String],
      phone: Option[String]
    }
    

    It’s only the generated code that ensures the oneof requirement. It is not encoded into the serialised message itself.

    Using Shapeless it would have been possible to define something like this

    sealed trait Contact
    case class Email(address: String) extends Contact
    case class Phone(number: String) extends Contact
    case class Person(
      name: String,
      id: Int,
      contact: Contact
    )
    

    which would corresponds to

    message Email {
      required string address = 1; 
    }
    message Phone {
      required string number = 1;
    }
    message Person {
      required string name = 1;
      required int32 id = 2;
      oneof contact {
        Email email = 3;
        Phone phone = 3;
      }
    }
    

    As you can see we have a problem because both fields map to the position 3.
    The best way to deal with oneof is to use optional fields and add validation after deserialisation.

    case class Person(
      name: String,
      id: Int,
      email: Option[String],
      phone: Option[String]
    )
    
    val bytes: Array[Byte] = ??? // get the protobuf encoded data
    val person = bytes.pbTo[Person] match {
       case p@Person(_, _, Some(_), None) => p // correct, email is defined
       case p@Person(_, _, None, Some(_)) => p // correct, phone is defined
       case _ => throw new Exception("Invalid oneof email or phone")
    }
    

    Performance

    Under the hood the serialisation/deserialisation is performed using the protobuf-java library so you can expect the same performance for serialisation as with scalaPB or using protoc.

    The deserialisation might be slightly slower as the byte array is traversed several time (1 time for each field to deserialise).

    Conclusion

    One more time Shapeless has proven useful to simplify serialisation of case classes – this time into protobuf binary format. It allows to design simple to use library while keeping close with the performances of the original code.

    If you go and give it a try, don’t hesitate to leave some feedback in the comments below. You can also check the source code and the usage and setup instructions on Github.