Protobuf and Go : Handling Large Data Sets

Google Protobuf development guide notes the following —

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

This blog walks through implementation of one such strategy. Basic familiarity with Go and Protobuf is assumed

The Message

Lets define a simple message, Book as following —

message Book {    string title = 1;    string author = 2;    string isbn = 3;    string overview =4;
}

Goal : We would like to marshal arbitrarily large number of Book messages in Protobuf encoding to a file, and later read(unmarshal) them successfully.

For the purpose of this blog, I downloaded a list of books from https://www.usabledatabases.com/database/books-isbn-covers/sample/#table_book in csv format. Its also available with the full source code of this blog

Each record in this CSV has many details of the Book including title, author, ISBN, publisher, pages, publication date etc. We chose 4 of those in our message definition (shown above)

Marshalling

The key idea is that we encode the length of the each marshalled message and prepend it to the marshalled message. While reading we can use the same information to unmarshal the message. The process is fairly simple —

  • Marshal the message and get the length of marshalled bytes
  • Encode the length in some form that results in fixed byte size. We chose to encode the length using Golang Binary ByteOrder
  • Append the marshalled slice to the encoded length slice and write it to the output stream (in this case a file)

So each message looks like below

Sequence of bytes with length and message encoded together

Here is the source code of the function that does it —

Line 28–32 show the process described earlier.

Using this method arbitrary large payloads can be marshalled without any restriction.

UnMarshalling

UnMarshalling follows the same principle.

  • Read first 4 bytes to find the length of encoded Message
  • Read the next “len” bytes and UnMarshal

The code

  • Reads 4 bytes and converts to uint32, the length of subsequent message
  • Allocates a buffer of that length, reads the file and then marshal’s it to Book struct.

Can we optimize it?

The code above works but it has couple of performance issues

  1. It needs to make two reads of the file, first to get the length and then to get the message. The number of reads is 2X number of messages. This incurs I/O cost
  2. Since each message has a different size, every read allocates and de-allocates a buffer which is also a performance (and memory) overhead

With can address both of these with a strategy — decouple the I/O from reading of the message —

The code above does the following

  • Reads a fixed size buffer (var rbuf of size 1KB) from the file
  • Keeps them in a bytes.Buffer, used to read variable size bytes on demand.
  • The code reads 4 bytes first to get length of the message and then the number of len bytes from the Buffer to unmarshal the message. If sufficient number of bytes are not present in the Buffer , then it reads from file again to fill the buffer.
  • Rinse and repeat until whole file is read

The full source code of this blog can be found at : https://github.com/monmohan/protobuf-examples

Software Factotum