Initially, I led that effort for the SOC2 Type 1 audit in Q4 2019, before we hired Karl Bagci as head of operations who steered us through SOC2 Type 2, and ISO 27001 in Q1 and Q2 of 2020.
One of my main concerns entering this process was avoiding it slowing us down as much as possible. However, I needn’t have worried as compliance broadly boils down to:
Have justifiable policies in place, and prove you follow them.
The main policy that affects the day-to-day of an engineering department is the change management policy. This details the considerations and requirements for making changes to your environments.
Broadly, there are three variables to consider:
- the type of environment – dev, staging, production
- the type of change – major, minor, trivial
- the urgency of the change – urgent, non-urgent
The most interesting environment is usually your production environment, and so this is where your controls will be applied to the full. In Cronofy’s case, staging is more a facsimile of production configuration, and so is not subject to many controls. If yours contains a restore of production data it would likely need to be treated similar to production.
The type of change broadly correlates to the impact of the change, not the size. Changing a single line relating to security or network configuration would be major, adding pages of documentation would be a minor change, if not trivial.
Guidelines for managing changes
At its simplest, a change management policy will require that changes are peer reviewed, by someone with sufficient seniority or knowledge. Especially before they can be applied to a sensitive environment such as production.
An important part of the change management policy is also how those changes are applied. The goal generally being to demonstrate that only the approved changes are made, and that the possibility of human error is minimised.
Source control as the system of truth
With already using Github for source control and pull requests, we had a framework of peer review and approval for all changes. Through the use of CI we had automated proof of regression testing on every change.
Git provides the record of who made each change, and Github is providing proof of test suites passing for those changes, and a record of who approved them.
This doesn’t just apply to application code. Terraform means we even automate virtually all infrastructure changes too. Having AWS CloudTrail and similar enabled puts another level of auditing on top of this for good measure.
Continuous deployment to eliminate human error
Continuous deployment provides a single, well-trodden, recorded path of getting approved changes into production environments. There’s no room for human error to creep in when the human interaction is little more than a button press. Of course, the configuration of that path should also be source controlled so that changes to it are audited as well!
In case of emergency, break glass
One recommendation I would make, is to have a documented way to go outside the approved change management process. You never know what might come up, and it is better to have a documented path to follow if things really go sideways, rather than making it up on the spot.
Broadly, this is triggered by classifying a change as urgent, and documenting that processes can be bypassed for urgent changes. We mandate that the use of this escape valve be documented in a similar way to a policy violation, but it does not count as one.
This allows you to do something such as bypassing CI to get a fix out a bit quicker, manually changing something in the AWS console to react during an outage, release a change in the early hours because there isn’t someone around to approve it. These aren’t common scenarios, so this isn’t used very frequently, if at all, but it provides a safety net for extreme situations.
What changed for Cronofy?
Not much changed for us at all in reality, but we made the correct path the easy path to follow. We enabled protected branches with mandatory reviews and passing CI in Github, and that’s about it. In practice we were already following this, but enabling this ensured we would going forward.
Generally, we go above and beyond our policy, because we have made the most thorough path the easy path. Even during incidents we virtually always follow the processes as they don’t really slow us down.
Broadly, our existing strong engineering and operational practices translated directly to strong, auditable compliance practices.
As we grow the team we’ll no doubt encounter some challenges, but we should be able to overcome them with a bit more, audited, automation.