Kubernetes: The Road to 1.0

Published in

ITNEXT

11 min readJun 7, 2024

I spoke about the journey to Kubernetes 1.0 at the Kubernetes 1.0 celebration last night across a creek from where I sat when the project started, but 10 minutes is very short and I could just scratch the surface. Kelsey wasn’t kidding that I had 30 slides, though that was in part because I wasn’t sure what the audience would want to hear about, and partly as notes for me.

I’ve written about parts of the background of the design, but this is more about how that came about, and the process of building it, from the Mountain View side, where the Borg team resided. Craig McLuckie, Joe Beda, Brendan Burns, and Ville Aikas were in Seattle, where the Google Compute Engine team resided. I broke it roughly into 4 semesters, and a period of time before the project began.

Lessons from Borg and Omega: 2009–2013

I joined Google’s Borg control-plane team at the beginning of 2009, 15 years ago. I had worked on supercomputers more than 15 years before that, but Borg was a multi-user system with many other components layered on top, underneath, and adjacent to it. My “starter project” was to improve the scalability by handling requests concurrently because for the 1.5 years prior to that I had worked on facilitating moving Google’s many single-threaded C++ applications to multiple threads, with projects across Linux (NPTL wasn’t rolled out yet), g++ (thread-safety annotations), threading primitives (before C++11), multi-threaded HTTP server, improved profiling, documentation, and other projects.

In order to improve the performance, not only did I need to understand the implementation, but I also had to figure out how the system was being used. Over that first year of working on Borg, I found that in a number of ways Borg’s control plane architecture and API weren’t really designed for how it was being used.

For example, Borg wasn’t really extensible, so additional functionality like rollouts, batch scheduling, cron scheduling, and horizontal and vertical autoscaling had to be built in other services and clients. These other services would embed their data in Job resources and continuously poll for changes, such as new Jobs, which accounted for more than 99% of all API requests made of the Borg control plane. The ability to subscribe to changes via a Watch API was only supported for Job Task endpoints, by writing the dynamically scheduled host IP addresses and dynamically allocated host ports to Chubby, the key/value store that inspired Zookeeper.

Borg Alloc and Job Tasks with allocated ports

As an aside, the use of Chubby for service discovery had pervasive effects on the workloads running on Borg because they couldn’t use standard mechanisms for service naming, discovery, load balancing, reverse proxies, authentication, and so on. We wanted existing applications to be able to run on Kubernetes, so we made dynamically allocated Pod IP addresses routable, which was a controversial decision at the time.

I started an R&D project in 2010 called Omega to redesign Borg for how it was being used and to better support the ecosystem around Borg. In many ways, Kubernetes is more “open-source Omega” than “open-source Borg”, but it benefited from the lessons learned from both Borg and Omega.

Omega had a Paxos-based key/value store with a Watch API at at its center. The components, called controllers in Kubernetes, operated asynchronously, watching for desired state objects and writing back observed state. Unlike Kubernetes, these were separate records in the store, which was good for optimistic concurrency, but a bit harder to stitch together. We also never got around to wrapping the store with a unified API, though there was a proposal to do that.

Another example of how Borg wasn’t be used the way it was designed was that Allocs in Borg were collections of resource reservations scheduled across machines, horizontal slices of clusters. Job Tasks could be scheduled into those slots. That was a fairly complex model that made a number of things more complicated like debugging and horizontal autoscaling, which few users took advantage of. Most users of Allocs pinned specific sets of Job Tasks into the instances. This led to the idea of making those bundles of containers first-class units of replication and scheduling in Omega, called Scheduling Units, which were eventually named Pods in Kubernetes.

We made Labels a central concept in Kubernetes. Borg didn’t have labels originally. The idea was inspired by users trying to pack metadata about their Jobs into Job names up to 180 characters long and then parsing it out with regular expressions. The corresponding concept in Omega was more elaborate, but the additional substructure wasn’t needed. A simple map was enough. Similarly, Annotations were inspired by Borg clients trying to cram info into a single `notes` string, which was kind of like User-Agent (which we didn’t have in Google’s RPC library), but was persisted in Jobs.

Origin of labels (showing only label values for simplicity)

The cpu and memory request and limit specification in Kubernetes was more consistent than Borg’s and simplified compared to Omega’s.

We were able to cherrypick approaches that worked, discard ones that didn’t, simply some that were a bit too complicated, and iterate on some one more time. A number of concepts from Omega, such as Scheduling Units, were reused pretty directly in Kubernetes. Some, like Taints and Tolerations, were simplified, but named the same as in Omega. The term “claim” also comes from Omega. The idea of disruption budgets came from Omega, inspired by a disruption broker service from Borg.

This 10 years of lessons learned gave Kubernetes a head start over projects like libswarm that were starting fresh with Docker. It also made Kubernetes more complex earlier than would have otherwise been the case, but most of the features were used quite a lot.

Early Container Product API Design: 2H2013

All of that experience from Borg and Omega got us off to the races pretty quickly. In the 2nd half of 2013 when we started to discuss what kind of container product to build, I started to sketch the API. It already had a shape that would be recognized by Kubernetes users today. This is a summary from the presentation I made in that time period in the same meeting as the first prototype demo:

CRUD: same schema for config and APIs
scheduling units (sunits, aka molecules): bundles of resources, tasks, data
sunit prototype for new/updated instances
separate replication spec specifies # desired
potentially heterogeneous sets of sunits identified by labels, label queries; no indices
orthogonal features decoupled

Ramp to Launch: 1H2014

Though we didn’t have approval to open source anything yet and we were still discussing what kind of product to build, we started working aggressively on the project during this period. We pulled in more people, quite a few, actually. I no longer have access to my internal notes, unfortunately, so I probably can’t name them all here, but will name a few.

Some, like Tim Hockin, Dawn Chen, and Eric Tune, worked on independent experiments and projects. For example, we also didn’t know how feasible it would be to implement Pods over Docker. It wasn’t obvious how multiple containers would share an IP address without the network namespace being configurable. There also wasn’t a straightforward way to nest cgroups. We also explored whether we could adapt existing components such as the Omlet node agent and the lmctfy container runtime, and we decided against it.

A few of us went to go chat with Solomon Hykes and Ben Golub at Docker about embedding Docker in Kubernetes and some of the challenges we had discovered. That meeting led to the start of the libcontainer collaboration with Docker, to replace LXC in the stack. Libcontainer and cadvisor, which was released along with Kubernetes, were developed by Victor Marmol, Rohit Jnagal, and Vish Kannan.

Tim also developed the Python container-agent, which was released in May 2014, at which point we still didn’t have approval to open-source Kubernetes. The container manifest from this project was lifted verbatim into the initial v1beta1 Kubernetes Task API, and is where the term “manifest” comes from in Kubernetes.

Others, like Ville Aikas and Daniel Smith, worked on the Go code. The only APIs were for Task (later renamed to Pod), ReplicationController, and Service. No Nodes. I initially documented the API by hand using RAML.

Below is a diagram from Ville’s design doc. Note that there’s no Kubelet and that Kube-proxy read directly from Etcd. Before we released Kubernetes, a minimal Kubelet was added. Kubelet also read directly from Etcd, and the apiserver called down to the Kubelet synchronously to retrieve Task status.

We wanted to launch at Dockercon, so we launched what we had (dates are helpful), and then iterated in the open. The key ideas were there: an API, desired state, multi-container instances, labels, controllers, scheduling/placement, service discovery. There were some cleanups, and the code was copied to a new repo. The repo it was copied from still exists, but the original code.google.com repo and its commit history were lost, as Ville mentioned in his presentation.

Finishing the Implementation of the Design: 2H2014

What we released didn’t have a coherent control plane, had an incomplete, inconsistent API, had an extremely minimal cloudcfgCLI, and was missing some basic features that users would need, so the next 6–7 months after open-sourcing Kubernetes was spent fleshing out those areas, as well as incorporating ideas from Redhat and others in the community.

To solidify the control plane, we implemented Watch in the apiserver. That enabled us to eliminate direct Etcd access in Kubelet and Kube-proxy. We also eliminated direct Etcd access from the scheduler.

To remove the need for the apiserver to call Kubelet (node, pod) or other components (replication controller) to retrieve status information, we implemented the /status API endpoints.

We also split out the controller-manager and scheduler components from the apiserver, and secured inter-component communication (e.g., Kubelet->apiserver).

The API itself went through many changes. Task was renamed to Pod. A Minion API was added and later renamed to Node. The scheduler changed to the method of recording node assignments in a Pod field. The Service API was overhauled to make many changes, including support for multiple ports. I integrated go-restful into the apiserver in order to generate Swagger for the API, because it already wasn’t sustainable to keep up with these changes by hand. Conversion between API versions and an internal representation was added in support of API versioning.

Clayton Coleman drove a sweeping overhaul of the whole API surface. This is where the shape of the Kubernetes API as we know it today really took shape, with the separation of the metadata, desired state (spec), and observed state (status). Annotations were added. Namespaces were inserted into resource paths. Consistency across resource types and fields was increased, and I wrote the first draft of the API conventions. We were so thorough that the v1 API included very few non-backward-compatible changes.

The command-line tool at the time we open-sourced Kubernetes was called cloudcfg. We soon renamed it to kubecfg, but it wasn’t well structured for expansion. Luckily, Sam Ghods, volunteered to rewrite the CLI, which became kubectl. This is when the spf13/cobra CLI framework was integrated and the verb-noun pattern was solidified.

We also created kubeconfig, spun out a client library, implemented bulk operations across multiple files and resource types, and laid the groundwork for declarative operations.

The features we added had multiple objectives. Some features were added to make the system more usable, such as container termination reason reporting and the ability to get logs through the apiserver. Some were for security, such as user authentication, service accounts, ABAC authorization, and namespaces. Others were to flesh out the model, such as service IPs and DNS and PersistentVolume and PersistentVolumeClaim. And some were there to demonstrate thought leadership in what was a crowded space at the time, such as liveness probes and readiness probes.

The Home Stretch: 1H2015

In early 2015 we started to discuss the idea of creating a foundation for Kubernetes and a broader cloud-native ecosystem. We decided to align the 1.0 milestone with the launch event date in July. The goal was to make the system ready for running it in production.

Now that we had a deadline, we had to decide which features to include, and which ones to push out. We instituted the first code freeze of the project. We even ripped some incomplete code out. We included features we felt would be important to real usage, like graceful termination and the ability to view logs of failed containers. We hardened the system for continuous operation with changes like cleaning up dead containers, restarting unhealthy components, and event deduplication.

Many important features were postponed until after 1.0: kubectl apply, Deployment, DaemonSet, StatefulSet, Job, CronJob, ConfigMap, HorizontalPodAutoscaler, node ports and Ingress, iptables for kube-proxy, resource metrics exposed through the apiserver, container QoS, most scheduling features, the Kubernetes dashboard, and third-party resources. This was absolutely the right call for the MVP.

We also fixed P0 bugs, addressed security issues like unauthenticated ports, implemented upgrade testing, added more thorough API validation, and instrumented the components with Prometheus’s client library for observability.

In the final months, we created the kubernetes.io website. We moved some existing documentation there, but we also wrote a new user guide. There was a glitch with the site on the announcement day, but we got it resolved just in time. The home page still includes some of the text I wrote back then, such as the “Production-Grade Container Orchestration” tag line, “Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications”, and some of the feature descriptions, though some were aspirational even at 1.0.

Dozens of people rallied behind the milestone helping in all kinds of ways, from finding and fixing documentation errors, to organizing the event, to evangelizing the project, to a number of things I’m probably forgetting after all these years.

A lot of work had gone into the project at this point. It was about a year after the initial release, but more than a year and a half of work from the start, and years of R&D prior to that. That work played a part in its success.