A Foundational Guide to DevOps

Transforming Devops from Build to Run

DevOps is a belief system, set of principles, and mindset that informs how software development and operations teams organize to deliver customer value. Close collaboration between development and operations teams, a belief in shared responsibility for outcomes, and commitment to quality and continuous improvement are all central DevOps themes. When that belief system and mindset are adopted across an organization or even a subset of teams, it begins to drive a larger cultural shift, and consequently, DevOps is often described as a cultural movement.

The challenge with DevOps, though, is that while it embraces a distinct way of thinking, and even has job titles, processes, practices, and underlying tools associated with it, it is not particularly prescriptive. DevOps is focused on how work gets done rather than prioritizing what work needs to be done. So as DevOps has continued to evolve and gain momentum across the enterprise, it is driving both the need and opportunity for organizations to become more focused on what work, exactly, their development and operations teams should be focused on to drive that customer value. Traditionally, that decision has been the domain of product development or business teams, with an emphasis on developing new features and enhancements.

While it’s true that new features and enhancements drive all-important customer value, a DevOps driven, breakneck pace of new releases can also create performance and reliability issues that negatively impact user experience, detracting from that value. Speedy delivery of new features can also put operations teams in a jam: they’re charged with maintaining and/or improving system performance, but instead are forced to spend time reacting to issues. Organizations need to balance innovation with site reliability: they need to make informed decisions about whether they should be focusing their efforts on building and releasing new features or running the system because both are essential to customer happiness.

Site reliability engineering (SRE), a practice that entails applying software engineering thinking to the operational aspects of site reliability, solves this problem. SRE brings both business and operations teams together around objective data about system performance so that they can make collective decisions about whether to prioritize building and releasing new features or proactively ensuring the system is running optimally. While the objective data part of that equation isn’t new—KPIs and metrics have been around for a long time—this collaboration between business and technology teams is.

Ultimately, SRE is a methodology for putting DevOps into practice, or as Niall Murphy, author of Site Reliability Engineering: How Google Runs Production Systems describes it: “SRE is an opinionated implementation of the DevOps philosophy.”

Key Concepts in Site Reliability Engineering

Many software organizations have Service Level Agreements (SLAs) with their customers, which are essentially commitments to them around aspects of system performance and can be tied to key metrics like uptime or how quickly issues will be resolved. SLAs are often written into contracts and financially backed, so organizations want their systems to be as reliable as possible. That they also want their customers to be happy with their user experience is of equal, if not greater, importance.

While SLAs reflect a commitment to customers around specific system performance, Service Level Objectives (SLOs) are internal goals or thresholds that organizations set for system performance to make sure that they are able to meet the parameters outlined in the SLA and/or keep their customers happy. SLOs serve as goals or targets that engineering teams aim for around system reliability.

If SLAs are commitments to customers for system performance, and SLOs are internal goals for system performance, then Service Level Indicators (SLIs) are actual measurements for how the system is performing. SLIs can be tied to almost anything that is an indicator of reliability—uptime, incident management, capacity planning—but it is important to identify the right ones for a given organization.

No service is perfect, and users will always tolerate some degree of error. An error budget is the amount of error an aspect of a given service, like latency or availability, can experience before the user experience is affected to a degree that customers become unhappy. For example, if an organization has an SLO of 99.99% availability, the error budget is .01%. Organizations can use error budgets to help determine whether they can continue to focus their efforts on new features and enhancements, or if they need to direct their efforts to site reliability to remain within their error budget.

Connecting the Concepts

Once SLOs, SLIs, and error budgets are established, organizations can track SLIs (actual measurements) across a system within a given time frame and measure their performance against an SLO—the predetermined goal that represents a positive user experience. Then, using mathematical calculations, they can determine what proportion of those events sit within the goal, and what proportion fall outside of it. By tracking increases in error rates and watching the slope of errors grow relative to the error budget, they can identify growing risks. If errors are increasing and the error budget is dwindling, then both business and engineering teams can agree to focus on improving performance and reliability instead of on new features and enhancements. This critical information also enables operations teams to proactively respond to issues before an outage occurs and user experience is negatively impacted.

Preventing Issues by Shifting Left/Shifting Right

It is important to understand that—while an ounce of prevention is always better than a pound of cure (e.g., it’s better to prevent issues than to resolve them)—issues, even critical ones, will undoubtedly arise because no system is inherently perfect. Even though monitoring SLIs helps organizations identify issues before they happen, they must invest effort into systems and processes that both minimize issues and simplify resolving them efficiently. Using an SRE approach, the reliability of a system should include both building reliability directly into the applications themselves early on, and evaluating backend metrics once the software is released. The concepts of shift left and shift right are critical to achieving this flexibility.

Shift left refers to the software development practice of focusing on quality in the initial development process in order to prevent issues, streamline testing, and ensure a better customer experience when the software is deployed. It is exactly how it sounds, in that it shifts the testing process to the left—or the beginning—of the development process. The overarching goal of shift left is to mitigate risk and reduce defects that might impact customer satisfaction by being proactive early on.

Using the mantra, “Test early and often,” shift left embraces a culture of shared responsibility—it prioritizes building the development pipeline so that the software is as close to “perfect” as possible in the initial release. This approach moves away from focusing exclusively on speed, passing the code off to the testing team, and letting them sort out the issues. Shift left also prioritizes continuous integration and continuous delivery (CI/CD). In CI/CD, building, testing and deployment are automated, and manual effort is minimized so testing can be done quickly, early, and often.

The goal of shift left’s lesser-known counterpart, shift right, enables you to test in production, and prepare for the undefined, unknown, and unexpected. The term refers to the practice of doing more thorough testing after the software is released and continuing to test post-release. The reason for this is that, while staging environments can approximate production environments, they can never truly duplicate them, so post-release testing is necessary to truly understand what users are experiencing, to identify problems that might only exist in that real-life environment, and to get a better understanding of system dependencies.

A benefit of shifting right is that it creates a continuous feedback loop from actual user experience directly back into the development process. Teams are constantly shifting—meaning that they’re going through agile transformations and transitions all the time. Likewise, their feedback will shift and evolve, thus creating an ongoing channel for real-world feedback is imperative to the development process, and ultimately, the success of the software being deployed.

To boil it all down into one, cohesive sentence: Shift left emphasizes prevention and risk reduction as the key to an eventual positive customer experience, whereas shift right emphasizes the user experience as critical information in the feedback loop, leading to ongoing improvements and therefore leading to satisfied customers.

SLO as a Service: Automating the Collection of SLO Metrics

For organizations to implement SRE practices and make objective, data-driven decisions about the health and reliability of their systems, they first need to set SLOs, identify the appropriate SLIs to track, then collect, calculate, and analyze that data. The reality is, that’s easier said than done. There is no limit to the number of SLIs an organization could track, and narrowing it down to the right ones—things that are true indicators of customer happiness—is daunting. Adding to the complexity is the difficulty of the mathematical calculations necessary to read the raw data and calculate error budgets. It’s a big lift that requires significant domain expertise.

When SRE was first introduced, it was largely the domain of enterprise-scale organizations that had the resources to commit staff and man-hours to the problem and spin up custom SLO monitoring systems. However, as the practice spreads, organizations of all sizes are recognizing the need to implement SRE, and many do not have the resources to go it alone. Fortunately, the concept of SLOs as a service is emerging, and platforms are becoming available that collect data from monitoring systems, run the mathematical calculations, analyze and present data as actionable information, and even send alerts as errors budgets are burned.

As organizations progress along their SRE journey, they should consider investing in an SLO platform and automating these practices, as they stand to reduce hours of labor-intensive work to minutes and increase the accuracy and effectiveness of their SRE efforts.

Following are several key considerations organizations should keep in mind as they explore and evaluate automated SLO platforms.

The Site Reliability Glossary

DevOps – DevOps is a cultural movement and set of principles and practices that informs how development and operations teams organize and collaborate in order to build, test, and deploy software faster and more reliably—objectives that drive customer value and result in a better user experience.

To achieve those objectives, DevOps practices incorporate automation and focus on continuous improvement, continuous integration (CI), and continuous delivery (CD).

Error Budget – Error budgets rely on the notion that no service is perfect, and users will tolerate some degree of error. An error budget represents the amount of error an aspect of a given service, like latency or availability, can experience before the user experience is affected to a degree that customers become unhappy. For example, if an organization has an SLA guaranteeing 99.99% availability, the error budget is .01%. Error budgets are used to help organizations determine whether they can continue to focus their efforts on new features and enhancements, or if they need to direct their efforts to work related to site reliability to remain within their error budget.

Service Level Agreement (SLA) – Service Level Agreements are commitments that organizations make to their customers around system performance, and are often tied to key metrics, such as uptime or how quickly issues will be resolved. SLAs are often written into contracts and are sometimes even financially backed.

Service Level Indicator (SLI) – If SLAs are commitments to customers for system performance, and SLOs are internal goals for system performance, then Service Level Indicators (SLIs) are actual measurements for how the system is performing. Organizations can compare SLIs— actual measurements—against SLOs—internal objectives—to determine if they need to focus work on improving performance and reliability or if they can continue to focus on new features and enhancements.

Service Level Objective (SLO) – While SLAs reflect a commitment to customers around a system’s performance, Service Level Objectives (SLOs), are internal thresholds that organizations set for system performance to make sure that they are able to meet parameters outlined in the SLA. In an organization that practices DevOps, SLOs serve as goals or commitments that development and operations agree to around system reliability. SLOs should be focused on the user experience and reflect the minimum performance necessary for a positive user experience.

Service Level Engineering (SRE) – Site Reliability Engineering is an outgrowth of DevOps that applies software engineering thinking to the operational aspects of site reliability. In practice, site reliability engineers collect data and use mathematical formulas to guide decisions about what to work on, in order to create a balance between releasing new features and maintaining and enhancing site reliability.

Shift Left – Shift left is a DevOps concept that refers to the software development practice of focusing on testing early in the development process in order to prevent issues and ensure a better customer experience when the software is initially deployed. Shift left also prioritizes continuous integration and continuous delivery (CI/CD). In CI/CD, building, testing, and deployment are automated so testing can be done quickly, early, and often

Shift Right – Shift right refers to the practice of testing thoroughly in the later stages (i.e post-production phase) of the development process. The goal of shift right is to focus on user experience and production scenarios as important metrics. Any of the issues found in this post-release testing obviously impact customer satisfaction, and serve to inform the developers on what types of changes need to be made to the software.

Getting Started with SLOs

To help your organization level up your SRE efforts and determine their effectiveness, Isos Technology is now offering SLO Bootcamps. Our SLO Bootcamps are hands-on workshops for cross-functional teams to define Service Level Objectives (SLOs) for the products and services they’re responsible for.

Each SLO Bootcamp includes:

2 interactive, virtual sessions (2 hours each), or 2 on-site sessions (4 hours each)
Instruction and guidance from an Isos SRE Coach and Technical Advisor
Short educational lectures on reliability and SLO methadology
Small group exercises with designated "service owners" for the organization

What you'll walk away with:

A defined, achievable initial SLO
A plan for managing your SLO with an error budget
A clear Error Budget Policy
An understanding of how to create and manage additional SLOs

DevOps and SRE can be complicated, but SLOs are a clear way to define, measure, and manage reliability to ensure you are meeting customers’ expectations while building and running. If you’re interested in partnering with Isos to conduct an SLO Bootcamp or multiple Bootcamps with your teams, please provide us with your information here and we’ll be in touch.

Resources

Interested in learning even more about SRE, SLOs, and SLIs? One great place to start is at the birthplace of SRE itself: Google. It has a ton of great resources that can be found at SRE.Google. Following are a few of our other favorite sites for SRE resources:

Nobl9 SloConf
Automating SLOs Help SREs Go Fast by Dynatrace
Awesome SRE Resources on GitHub by Pavlos Ratis

We've also put together a list of some of our favorite books on the subject:

Site Reliability Engineering: How Google Runs Production Systems, by Niall Murphy
The Site Reliability Workbook: Practical Ways to Implement SRE, by Niall Murphy
Implementing Service Level Objectives: A Practical Guide to SLOs, SLIs, and Error Budgets, by Alex Hidalgo

How Isos Technology Can Help

As a premier Atlassian Platinum and Enterprise Solution Partner with an Agile at Scale Specialization, we’re experts in change management for people and processes, as well as tool adoption to support agile practices. Our comprehensive agile consulting services help organizations increase customer and employee satisfaction, improve operations, and enhance their ability to deliver. We offer support for agile transformations, agile coaching, agile training and certification, agile software implementations, maturity assessments, and agile staffing.

To learn more about Isos Technology’s agile services, including Enterprise Agile Coaching, visit https://www.isostech.com/services/agile-services.

PRODUCTS WE SUPPORT

TECHNOLOGY INTEGRATIONS
& PARTNERSHIPS

A Foundational Guide to Site Reliability Engineering, Service Level Objectives, and Service Level Indicators

Transforming DevOps from Build to Run

Key Concepts in Site Reliability Engineering

Preventing Issues by Shifting Left/Shifting Right

SLO as a Service: Automating the Collection of SLO Metrics

Getting Started with SLOs

Transforming Devops from Build to Run