Kubernetes and Terraform two years on
Overkill in the past begins to pay off in the present
December 29, 2020 · 5 min read
In late 2018 we were looking at overhauling our infrastructure at Cronofy. We had some key pain points, mostly a lack of autoscaling on the compute side, but also an eye on some future challenges. We knew auditors were about to become a key stakeholder of the system. It also felt inevitable that we would be introducing additional services to the platform.
I like to think on longer term horizons. Overhauling our infrastructure was going to a significant undertaking, in both monetary and time terms. I didn’t want to be in a position where we would be having to do it all again in another couple of years.
Doing such work without using a tool like Terraform would have been madness. That seemed a no-brainer.
More controversial was Kubernetes. Kubernetes probably overkill for where we were, and to some extent maybe still is for where we are. However, it had some very desirable features:
Wide community support
I’ve rarely seen any piece of software win over so many, quite so quickly. If anything new was being built on the infrastructure side of things it was either Kubernetes-first, or Kubernetes was an early deployment target.
This all meant we were unlikely to be alone when it came to challenges with our platform.
Leaning on containers, it would eliminate an existing chore of upgrading Ruby version. This is now a single line change to a Dockerfile.
Future services had a strong possibility of not running on Ruby. There are areas of our platform to which Ruby is not well suited, being able to easily run different runtimes within each environment was a big plus.
We had this and did not want to lose it. The deployment primitives built into Kubernetes gave us everything we needed.
Autoscaling out of the box
Through a combination of horizontal pod autoscalers and node autoscalers we would resolve the problems of just-in-time capacity.
Administering Kubernetes was reportedly full of danger, we were more than happy to offload this to Amazon EKS. Support in their Frankfurt data centre coming just in the nick of time for our needs.
Unlimited growth potential
Kubernetes is a platform for running a platform. From a product engineer’s perspective there’s an abstraction over the likes of compute instances and load balancers which allow you to focus on your application.
From an administrators perspective, there’s an environment abstracting the details of the application from the infrastructure.
There’s not any limit to what can be built within a Kubernetes cluster, conversely the more container-as-service offerings like AWS Fargate do come with limitations, and you are also subservient to the tooling the provider supplies.
Kubernetes felt right-sized between the extremes of bare metal and full platform-as-a-service when seen through the lense of what would serve us for the next 5 years.
At the same time as switching to Kubernetes, we wanted to change our AWS account infrastructure. We had organically grown with One Account To Rule Them All. We wanted to switch to multiple accounts so there was a single identity account and then an account per environment. This doubled down on the segregation of our production environments by running them in entirely separate AWS accounts.
I really meant it when I said we were going to sort everything out!
After around three months we were running our compute within the Kubernetes clusters for our production environments, with bridges as necessary back to the old AWS account. After around three more months we had also switched from RDS Postgres to Aurora Postgres in the process of moving the data into the new AWS accounts.
This could have been done quicker, but that would have required “Big Bang” switchovers with potentially hours of downtime. Worst still, no way back if it was not successful. This multi-legged migration allowed us to gradually shift the workload, minimising the risk and virtually avoiding all downtime.
The autoscaling clusters dynamically handled our varying load as hoped. Aurora removed the impending IOPS-based limit we were heading towards. Our infrastructure was consistently, and almost completely managed through Terraform across all environments.
For the immediate pain points, the project was a success.
Not all plain sailing
With a massive shift in infrastructure, there’s always going to be things to learn. Unfortunately, some of those we had to learn the hard way.
DNS resolution was a recurring issue, speaking to thousands of external servers constantly puts a stress here higher than most systems. The first time we updated our worker AMIs we learned about Pod Disruption Budgets the hard way and suffered a brief outage in Germany.
We had gone all-in on spot instances, we generally lean into failure as it makes the platform more resilient. This ended up being the cause of mysterious 504 responses which we spent a lot of time trying to be absolutely certain wasn’t our application before blaming the infrastructure. This ended up being an issue with how the AWS load balancer controller didn’t remove cordoned nodes from the target group and so traffic would be in-flight via a node as it was shut down. It’s never a compiler bug, until it is.
Whilst infrequent, switching Ruby versions is now really straightforward. Related to that, running canary releases alongside main releases is trivial to derisk such switchovers which also served us well for Rails upgrades.
Through Terraform we also have much more flexibility in how we manage our infrastructure.
We’ve added three data centers to the Cronofy platform in 2020, with another being set up in early 2021. The first of those was a significant undertaking as we worked out the kinks and removed manual steps from the initial migration work, but the following two took under a week.
This agility has been of a direct commercial benefit to the company.
Finally, we’re working on a new feature for the platform that will lean on a new service written in Go. Adding this to our continuous deployment pipeline was as trivial as deploying to an additional data centre.
It’s taken a while, but I feel we’re beginning to see the previous over-investment start to pay off. Kubernetes and its supporting tools only get more capable, so it feels like we’re in a great place to take on 2021 and beyond.
Hey, I’m Garry Shutler
CTO and co-founder of Cronofy.
Husband, father, and cyclist. Proponent of the Oxford comma.