In holding various leadership positions in Operations engineering organizations over more than 20 years - either as a team lead, architect, or leading a global Site Reliability Engineering (SRE) organization - I developed a philosophy that I've used as guidance for my teams. I wanted an easy way for them to make decisions either about what to work on, or how to prioritize their work.
It comes down to two simple rules:
- Keep the site up
- Keep the developers moving as fast as possible
That's it. When Operations (often known as systems, DevOps, Site Reliability, SiteOps, Tech Ops, etc) engineers keep those principles in mind, they will almost always make the right decisions for the business. Because I subscribe to David Marquet's philosophy about the organization achieving its highest performance with active, engaged contributors, this has worked extremely well over the years.
Engineers must decide regularly during their Agile planning meetings which work is most important to the business. Do they make an improvement which ensures faster, more reliable, and easier deployment of new databases? Do they work on being able to ship features faster? There are many considerations that need to be factored. As those engineers are closest to the information, they are also in the best position to help make the right decisions. Often as we move "up the ladder" in an engineering organization, the information we get is filtered and interpreted before we receive it. This is why it's so important to do things like skip-level meetings. This culture is best described as Performance Oriented by Ron Westrum, which encourages collaboration, and cooperation among teams. When our engineers are empowered, high performance is the result.
I have had the opportunity to work for an organization once where these directives were reversed. The organization valued feature delivery over making sure the site was available to end users. Certainly something to try and wrap your head around on your next commute.
So, how does this work in practice? What do these directives mean? Why are they ranked as such?
If you're in the business of running a website or service available over the Internet, the site has to remain available. Even if our primary mission is not developing software, as Watts S. Humphrey said: "Every business is a software business." This means that no matter what kind of software our business delivers, we cannot make money on that software unless it is available.
If we have downtime (even scheduled downtime) the business impacts can be less/lost or delayed transactions, less awareness of our products, reputation damage, or frustrated customers. Regardless of damage, the software must not only be available, but it must also be performant as well. As Charity Majors often points out: "Nines don't matter if users aren't happy."
I once worked for a company that used a metric for capacity planning that was something close to the average page load time when the site was up. The last five words in that sentence are not a typo. This is because the site was operating under the principle that scheduled downtime was ok, and therefore downtime was not to be counted. However, the metric that was used had no component referring to whether or not the downtime was scheduled. If we fell outside this measure, they would look at expanding the size of the database tier or moving some customers to different hardware. By this measure, however, if the site was down for 29 out of 30 days in a month but had very good load times for that single day, then everything was fine! Yes, we did manage to get that changed.
Instead, as operations engineers, we need to focus on both the stability and performance of the website. When working with a team that understands this is a priority, we can make significant changes with very little impact to our customers. Oftentimes, making major changes with a minimum of disruption can require additional planning, and additional steps, but good engineers understand the significance to the business.
I had the opportunity to work with an engineering team that was orchestrating a move from the old Amazon Web Services (AWS) Classic environment to the new Virtual Private Cloud (VPC) environment. Aside from some pretty tricky networking requirements, moving most of the services was a fairly straightforward process.
- Stand up services in the new environment
- Migrate traffic to the new services
- Let traffic drain from the old environment
- Terminate the old service
Pretty simple for stateless (and possibly even stateful) services, until you get to the primary data store. Unless you are going to run some kind of distributed database cluster across a complicated network topology, there will be downtime involved. At some point, database writes are going to have to move from the database in the old environment to that of the new one. Full stop.
The team moved all traffic so that it was running through database proxies, instead of directly from the applications to the database itself. If they had not placed such an emphasis on availability, they could have simply skipped this step. They could have spent an hour or more updating configurations and restarting services all over the infrastructure with an associated amount of downtime. They also setup replica databases in the new environment that were replicas of a replica that would be promoted to the new primary. The process became:
- Ensure all read traffic was coming off the new replicas
- Break the replication from the old environment
- Point all write traffic to the database in the new environment
That's it. For that migration, we took 37 seconds of partial planned downtime. 37 seconds of planned downtime for the year is a pretty enviable achievement, but additionally:
- All read traffic continued uninterrupted
- All writes for data types that had been seen before were queued and written after the migration
- Only brand-new data, that had never been seen before in any capacity received an HTTP 503 error code for those 37 seconds.
Being able to keep a site up, running, and performant is, of course, only part of the responsibility of an operations team. If we are not shipping software, continually, then it is impossible to maintain parity with, or beat the competition. As Dr. Mik Kersten points out in his book From Project to Product, the organizations that master software delivery will survive, those that do not need only look at the Killer B's of Blockbuster, Barnes & Noble, and Borders.
To that end, Kersten describes four types of work that engineering teams can spend their time on - what he calls flow items:
- Feature - new value
- Defect - quality problems, bugs, etc.
- Risk - security, compliance, etc.
- Debt - tech debt
Of course, operations teams keeping the developers moving as fast as possible, is not just merely shipping features. As the litany of security breaches over the years have shown us, mitigating risk can also be extremely important! But regardless of the type of work we ship, the more simply, safely, and quickly that an operations team can facilitate the flow of that work through the system, the more potential for the success of the business. This focus on the smooth flow of work through the system should be recognizable as the 1st way of devops for anyone who's read The Phoenix Project.
As we've learned from the DORA State of DevOps reports, this success holds, regardless of the size of the business. Not just with vehicles, speed kills!
For startups, the ability to ship, safely and often is how we find product market fit. For anyone who's read Eric Ries' The Lean Startup, it's all about how many experiments we can run in a period of time. Ries calls this "validated learning", and if we can do this faster and better than our competitors, then we can find our fit first. The competition can try and imitate and copy, but if we are flat out better at delivering software, we can even make more mistakes, and still maintain our lead position in the market.
For enterprises, the average age for an S&P 500 company is under 20 years, down from 60 years in the 1950s. Dr. Kersten would argue this is because of their inability to deliver software effectively, as well as perhaps, our stage in the Deployment Age as proposed by Carlota Perez.
But a simpler explanation could also be a concept with which we are already familiar, the Innovator's Dilemma. Incumbent companies do not feel the same pressure to innovate as do the startups because of their existing revenue base. From the linked Wikipedia article: "Value to innovation is an S-curve: Improving a product takes time and many iterations. The first of these iterations provide minimal value to the customer but in time the base is created and the value increases exponentially. Once the base is created then each iteration is drastically better than the last." The words leap off the page from the same rationale we used referring to StartUps above. Those who have the ability to experiment quickly and thus innovate, become the disrupters and push the incumbents out.
What better explanation for why it's critically important that Operations engineering enables the enterprise to innovate as quickly as possible?
Perhaps no organization better typifies this balance between stability and speed than Google SRE. As we learned in the Google SRE Book, Google SRE has a concept of error budgets.
These seem pretty easy to understand at first blush. Each product supported by the SRE team reaches an agreement for an amount of downtime the service is permitted each month. Engineering is free to deploy as often as they like until they burn up their error budget. At that time, new feature deployments stop. The budget has been exhausted and we do not want to give the customer a bad user experience. This seems like a great argument for stability!
But there is a flip side to the concept of the error budget. If you team does not use up a sufficient amount of their error budget in a month, there is also a conversation asking engineering why they did not use up enough of their error budget. A conversation about being too stable? This is because Google recognizes the need for speed, the need to push innovation, the need to stay ahead of the competition.
Stability or speed...why not both?
Whether your team is called Operations/System/DevOps, or Site Reliability Engineering, those teams have a very important role to play in the success of the business. I argue that Operations Engineering can be a strategic differentiator for the business, allowing customers constant access to the product, while enabling the developers to outpace the competition.
I'm sometimes asked why the developers can't just do all this work themselves? Well, they can. But what are we giving up? I've seen some of the most talented development teams in the industry try and do Operations type work "on the side" as they try and deliver the four flow items to production. It always looks as if it were done in that manner. I've also seen talented development teams spend more than 60% of their week maintaining the "plumbing" required to keep their websites up, running, and deployable, instead of using that time to ship product.
I look forward to exploring the various ways that we can enable our organizations to deliver with stability and speed in more detail in future columns.