Retiring pbutil (github.com/matttproud/golang_protobuf_extensions)

January 10, 2024

Zürich, Schweiz

Today, I am announcing that the Go package pbutil (import paths rooted under github.com/matttproud/golang_protobuf_extensions) is on the road to retirement.

History of the Project

It was born of necessity. The year was 2012–13. I smelled of döner kebab and Pilsner.¹ Julius and I were furiously hacking away at the pile of code that would become the Prometheus that everyone came to know.²

Prometheus needed a serialization/deserialization technology that was efficient, cross-platform, and battle-tested for its Client Data Exposition Format. At that time, JSON was too slow and inefficient for interchange, particularly because the prevailing implementations were naive. Moreover, streaming with JSON proved rather unwieldy, too — especially across the programming languages we wanted to target our initial client libraries for. Early versions of Prometheus prototypes trialed JSON for both configuration and client–server data exchange, which bore out the aforementioned observations. We needed something more — and something with lighter dependencies.

That’s where Protocol Buffers came in. They were a great bedrock. With the right design discipline in the data model design, one could evolve the the message schema over time while preserving backward and forwards compatibility. Knowing how telemetry would be used in the wild with clients and servers running with different drifting versions (sometimes many versions away from each other), such a seamless version/schema management capability would be key. Prometheus being an early product, we needed both performance and forward-evolution capability. So the choice of Protocol Buffers was really driven by requirements.³

During my first stint of working at Google, I was introduced to RecordIO⁴, a sequential data serialization format that was often used with logs and other batch data. Google’s version was relatively robust with record compression and record integrity checks if I recall correctly. RecordIO would have been an excellent solution for piecewise data versus requiring metrics payloads to be decoded in full. But — alas — there was no general solution like RecordIO to be found in the public ecosystem. I needed a mechanism to stream partial data so that it can be acted on incrementally without all data having to be available for the reader.

Consider this simplified message schema:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// telemetry.proto: a skeleton of a metric transmission data model.

message Sample {
  optional string name  = 1;
  optional double value = 2;
}

message Samples {
  repeated Sample samples = 1;
}

The server requests Samples from the client over HTTP requests. At this time, there was no gRPC, and Thrift and similar RPC protocols were extremely half-baked at the time. So the ability to stream client exposition data back to the server was limited. For this reason, we wanted to avoid having the server to need to receive receive Samples in full before it could start processing but instead receive individual Sample values and begin as soon as the first Sample was received.

One can make a poor man’s version of RecordIO by using length-delimited Protocol Buffer message streams, and that’s where this package came in. The C++ and Java Protocol Buffer packages provided a convenience API to making these delimited streams through a facility called MessageLite#writeDelimitedTo and Message.Builder#mergeDelimitedFrom. Occasionally developers used this out of laziness.

I decided to be lazy and practical and attempt to do the same in Go. But — woops — at the time, the Go implementation of Protocol Buffers had no such API.

It took no more than half an afternoon to prototype an initial implementation. I spent a good amount of time studying the Java implementation and trying to match constants and other library functions from that version with those available in Go.

I initially proposed having this capability added to the formal Go version of Protocol Buffers as a language-idiomatic analogue of the C++ and Java API. It seemed reasonable to me, given that the official C++ and Java Protocol Buffer suites offered such API. The Gerrit Code Review #9102043 was thus born on 2013-05-01.

Now it was a waiting game: would github.com/golang/protobuf/proto accept this change?

It turns out the maintainers weren’t interested. I registered my disagreement politely but moved on. I left the package in a free-standing repository so that I could gather usage statistics to later consider proposing its addition to the package. It turns out I never had the time to do that.

The API came to live in import path github.com/matttproud/golang_protobuf_extensions/pbutil.

About ten years later, the new maintainers of Protocol Buffers in Go saw the utility of this API and added it as a first-class citizen. And here we are, retiring this old package, as a good citizen would do. While it does feel slightly nice to have history vindicate my position that there should be such an API, it also feels better knowing that I am no longer on the hook for such a load bearing API in the Go ecosystem.

Lessons Learned

It’s a small library, but don’t let its size fool you: I learned a lot along the way. I want to share these reflections.

Import Paths and Package Names

This package was created scarcely a year after Go’s 1.0 release. Even though a small but vibrant community had been programming with the language extensively before the 1.0 release, the body of knowledge, wisdom, and documentation for good design were scant and incomplete!

Both the package’s import path and package name are suboptimal in retrospect.

The import path:

Is damn long, clocking in at 55 characters. Luckily that’s largely a non-issue with goimports and other language server tooling.
Contains “golang” in the name. In my defense, I never referred to the language as “Golang” colloquially or in my writing. I was trying to mitigate poor SEO associated with the bare word “Go” in 2013. We can all agree that calling this github.com/matttproud/the_go_programming_language_protobuf_extensions/pbutil is kind of insane, right?
Could have sufficed with perhaps just a well-named top-level name like pbutil, delimitedpb, or similar (e.g., github.com/matttproud/pbutil). That said, the erstwhile ancillary functionality like pbtest for doing blackbox testing might have been hard to have relocated as a subdirectory underneath just a top-level name. There’s a real possibility of import cycles.
Contains extensions, yet this project was nothing about Protocol Buffer’s extensions feature. This was clearly a bad decision on my side. I needed a name quickly; I recall that much.

The package name:

Was under the name of ext in the initial release. That’s a pretty awful package name. It’s a shame that the Package Names article on the Go Blog was only published about two–three years later. I consumed all of the officially published Go documentation and much ancillary secondary material rather voraciously. Nevertheless, I biffed here.

See above for other ideas on names. Nevertheless, I am delighted that we open sourced the Google-internal Go Style Guide, which would have been helpful for me at the time in retrospect. I’ll write on that guide another day.

Test Data

Out of expedience, I used dot imports to import the Go Protocol Buffer project’s example test data for smoke testing my implementation. This was a bad idea. As the upstream project changed its test data, my tests broke, which without continuous integration of something like TAP, for which there was no analogue on GitHub, I wouldn’t have known until I modified the project or ran its tests. Recall: this project predates Go Modules by many years! It was build-at-HEAD-times-baby!

I eventually copied the upstream test data into this project to use. It reduced some toil. The Go Proverbs of “[a] little copying is better than a little dependency” was a sage koan that I hadn’t fully internalized yet.

Years later, I am delighted to see the project float a proposal to remove dot imports: #29326.

Module Versioning and Dependencies

Like many people, I viewed Go as needing a vendoring and version management solution. That said, I had no desire to pick a winning horse or involve myself much. That meant I sat on the sidelines even long after Go Modules debuted.⁵

So when it came time to apply modules to such an impactful project, I did so very late and rather amateur. I had long internalized Semantic Versioning, but I had not really considered the impact of changing the upstream Protocol Buffer library when I was amending a point release for the project v1.0.3.

Contrary to what I am going to describe, I am rather familiar with the implementation of Go Protocol Buffers (esp. in the confines of the monorepo of my employer), but I am less familiar with how they have evolved in the outside world (there are some differences). Namely, I had assumed that the public API was backwards compatible with old generated code from the IDL, as how it had appeared to have been within the monorepo.⁶

So when I upgraded from github.com/golang/protobuf/proto to google.golang.org/protobuf/proto in this change, I felt a great disturbance in the Build, as if millions of developers suddenly cried out in terror and were suddenly silenced.

All of the protoc-generated .go files checked into version control broke when they were recompiled against v1.0.3. The goods news was that v1.0.3 underwent code review; the bad news was that we didn’t catch the problem.

To be fair on everyone, I was in the throes of being a parent of a young child again — not just one child, but twins! My faculties were not as great as I had wanted them to be.

API Signatures

The io.Reader and io.Writer approach with the API design was the right call. When working with the http.Response and http.ResponseWriter types, natively supporting io.Reader and io.Writer makes everyone’s life easier.

I wonder retrospectively whether the package should have also supported something like this that operates on plain []byte values:

func UnmarshalDelimited(b []byte, m proto.Message) (n int, err error)
func MarshalDelimited(b []byte, m proto.Message) (n int, err error)

Pure speculation, but certain workloads could have benefitted from amortized buffer allocation costs that arise from reuse. At the end of the day, nobody complained loudly enough to make this happen.

Going Forward

There is some work for everyone — including me.

Users of v1 API

Please migrate your project from github.com/golang/protobuf/proto to google.golang.org/protobuf/proto. The newer Protocol Buffer API opens up many doors for your project in terms of ergonomics, efficiency, reliability, and future features. You won’t regret it.

After this migration, you will no longer be able to use the v1 branch of the project. You can either use the upstream feature in google.golang.org/protobuf/proto under google.golang.org/protobuf/encoding/protodelim, or you can use the v2 branch. See section below on v2.

Important: I expect to make no major changes to v1 modulo urgent security or stability updates in legacy Protocol Buffer library dependency.

Users of the v2 API

Very soon, the v2 APIs will be be implemented internally in terms of google.golang.org/protobuf/encoding/protodelim. That reimplementation will become release v2.0.1. You are free to continue using v2 tagged at v2.0.0, which does not use package protodelim.

Important: I expect to make no major changes to v2 after v2.0.1. This, of course, assumes that v2.0.1 is correct and stable.

A post v2.0.1 migration pathway looks like this:

`package pbutil` (from)	`package protodelim` (to)
`pbutil.WriteDelimited`	`protodelim.MarshalTo`
`pbutil.ReadDelimited`	`protodelim.UnmarshalFrom`

Note: These APIs are not necessarily drop-in-place but are close enough for all intents and purposes.

My Tasks

Assuming all goes well with v2.0.1, I will eventually place deprecation notices on the old APIs in v1 and v2.

Closing Remarks

Thank you for the trust in these simple packages all of these years — nearly 11. It was my pleasure serving you.

Over the years, more people than I can count (whom I’ve never met before) introduced themselves in meatspace with:

Oh, you’re Matt T. Proud; I used your package!

Fun as that was (even for introverted me), it’s more important to put old software to rest when it’s time.

Eine logische Folge meines damaligen Alltages in Berlin. Definitiv geht es meiner Gesundheit besser, da ich jetzt in Zürich wohne. ↩︎
This was one of the funnest periods in my life. Both Julius and I were working on opposite ends of the project and tunneling through the metaphorical mountain to meet each other and have our pieces come together. He was a great partner to work with in this; I’d work with him again without hesitation. As my career has gone on, I’ve had the experience of doing this kind of tunneling a few more times. It’s an exhilarating high — at least for me. ↩︎
It’s important to contextualize the timeline of this: it was 2012–13. Protocol Buffers were a new thing in the outside world. JSON was king. Thrift existed, but it had numerous problems: a notable one was how the Thrift repository included two specifications that disagreed with each other on how the first enum value was to be represented in the wire format when the enum element lacked an explicit ordinal value assignment. Was it to be a 0 or a 1? I recall taking a Thrift release at the time (these releases included multiple programming language clients in them) and creating echo servers and clients in the languages and sending messages back and forth to them. I discovered that enums lacked fidelity. What sent me down that anal retentive fidelity checking rabbit hole was prototyping early Prometheus with Cassandra (I wanted a Columnar data store, and I knew that Bigtable would have fit that bill internally at Google), which was the closest external analogue to Bigtable at that time. I discovered that Cassandra rejected some of my mutation operations with an error that the client requests were invalid. It claimed that some enum value was bar when I set it to foo. Stepping through both my prototype and Cassandra in a debugger led me to this discovery. So Thrift was out. The other alternative was Avro, but I recalled that Avro support was not universal and we didn’t need some of its sundry features (e.g., self-encoding of the data schemas in the records). All that really left us with was Protocol Buffers. ↩︎
RecordIO has never been formally described in published literature from Google, but it has been alluded to in https://news.ycombinator.com/item?id=16813030, https://code.google.com/archive/p/szl/wikis/Sawzall_Table_Types.wiki, https://github.com/google/or-tools/blob/stable/ortools/base/recordio.h, and https://github.com/google/szl/blob/master/src/utilities/recordio.cc. ↩︎
It might sound strange to do this, but I found the dithering and acrimony around versioning to be especially painful. I remember the verbiage around dep being framed as an “official experiment.” I mean this without disparagement for anyone involved, but the experience and communications didn’t instill a lot of confidence. Rob Pike’s What We Got Right, What We Got Wrong talk alludes to this. ↩︎
The reason this was so deceiving is all of the kind souls who are running LSC and tending the code garden who seamlessly forward-ported old Protocol Buffers APIs usages to the new implementation. This happened so well that it seemed transparent to me. ↩︎

Navigation:

Tags: