DORA metrics include four factors that can help you measure the productivity, velocity, and efficiency of your software development teams. One of these metrics is change failure rate, which measures the number of commits made to production environments that fail within a given period.
In this article, we’ll discuss how change failure rate is measured, what kinds of commits count as a failure, and how to reduce your rate of failure by instituting best practices.
What is change failure rate?
Change failure rate (CFR) is meant to indicate the stability of your website or application and serve as a measure of potential risks to that stability. It is measured by calculating a percentage of deployments that result in service impairment or service outage, which typically indicate that some portion of your website is down.
CFR requires three measurements to calculate:
- The number of deployments or releases in a given period of time
- The number of bug or hot fixes you released after your initial deployment
- How many of your deployments inclusive of initial and subsequent deployments caused failures or incidents
This information is not only helpful for site reliability engineers (SREs), who are responsible for managing the incidents failures cause, but also for developers engaged in continuous improvement, who have a vested interest in preventing further or repeated failures from the outset. When approached from both sides, SREs and developers can work together to increase the stability and reliability of their website.
Higher change failure rates can indicate that a website is typically unresponsive to user input, making slow connections, or otherwise behaving in unusual or unintended ways. It can also indicate security issues, which may risk sensitive information and cause other problems.
A lower CFR, however, could indicate that a website has fewer issues and is reliably responsive, providing users with valuable information and access to what they need.
Why is measuring change failure rate important?
Reliable websites are a huge part of providing a good customer or user experience. Users want to continue using a good website, especially when it quickly and adequately supports its users through accomplishing a task, performing an action, making a purchase, or gaining information they need.
Measuring your change failure rate is important because it helps create a tangible definition for what a “good” website is. Failed changes reduce the quality and reliability of your website, which make it difficult for users to accomplish their tasks.
In an e-commerce setting, a “sticky” website — or a website that has many return users — has a big impact on other business metrics, like:
- Average customer acquisition cost
- Average purchase size per customer
- Revenue gained from e-commerce vs. brick-and-mortar stores
- Net promoter score
With this in mind, your change failure rate can be seen as a good starting point for driving further organizational change or contributing to wider company objectives related to increasing revenue.
How to measure change failure rate
Before we can measure our change failure rate, we must define failure. This largely depends on your organization’s standards implementation and the services you control — for example, updates from cloud providers that break do not count as failures for your organization.
Some organizations count and track failures by watching for:
- Failed deployments from your CI/CD tools
- Alerts and incidents from your observability or incident management tools
- Ticketing systems
- Deployments with tags, such as “hotfix” or “rollback”
However, you may also want to include standards failures. For example, if your team has set a standard to update all dependencies as soon as possible, you may consider this update a failure. If your team hasn’t yet implemented standards based on industry benchmarks or a set of internal needs, you should consider defining those as well to make sure your change failure rate is as comprehensive a measurement as possible.
Once you have qualified the failures you’ll count toward CFR, determine the period of time over which you want to measure your performance — CFR is not a usable metric unless it accounts for time as a factor. Typically, teams conduct CFR calculations at the quarterly or monthly marks, which can account for multiple sprints per calculation and give a fuller picture of the development team’s performance.
Another factor that may help you accurately measure CFR is to gather as much information about each release as possible at the time of its release and any subsequent hotfixes. Let’s say that your team recently released a large piece of code to production, which led to four different incidents and failures. Each incident was then addressed by a hotfix, one of which failed again, requiring more code and another deployment.
The initial deployment could be considered a single failure, but for a more specific failure measurement, we could also count like this:
- Each incident that occurred as a result of the initial deployment (4)
- The subsequent hotfix that did not successfully resolve one issue (1)
The same can be said for counting total deployments — in our case, we had the initial deployment (1), the four hotfix deployments (4), and the final deployment of a second hotfix (1), for a total of six deployments. Measuring CFR in this more granular way allows for deeper insight into the health of your software development lifecycle.
To calculate your quarterly CFR, take your count of failures and divide it by the total number of deployments to production that you made during the quarter. The resulting number is your rate of failure. In our example, we had five total failures and six deployments during our given period — this results in a CFR of 83% for this release.
This is quite a high CFR, and based on our understanding of the value a stable website provides, this is a significant indicator that this development team could use some organizing and standardization. A lower CFR, which indicates a healthy, stable website, should be the goal for most organizations, with the ideal resting at around 5%.
Improve your change failure rate
Measuring CFR is just the beginning. Once you’ve collected this data about the health of your SDLC, you can figure out where to make changes and institute standards to avoid failures.
There are several ways you can improve your CFR, including:
- Automating code testing: Manual code testing can reduce the risk of human error when testing code, and is also faster and more thorough than humans can typically be!
- Automating deployment helps abstract away the more complex aspects of your deployment process, which protects it from human error but also ensures that each deployment follows your best practices.
- Implementing standards: Engineering standards are a means for unifying development teams around a set of research-based expectations for high-quality software. Setting and raising standards improves communication between teams, developers, and managers by providing guidance on what counts as acceptable code for deployment.
- Adopting an open internal developer portal: Using an internal developer portal makes everything easier for engineering managers and developers to improve CFR, with standards management and self-service actions baked in.
Many of the methods for improving your DORA metrics are similar, so the ways to improve CFR may already be familiar to you. In this case, CFR may be a helpful addition to the metrics you already track to demonstrate stability.
Use Port’s internal developer portal to track your change failure rate
Port offers an open internal developer portal that centralizes all of the data you need to measure your change failure rate. The portal contains:
- Your service catalog, complete with a graphical depiction of your dependencies and the health and status of your services, making it easy to determine what is and isn’t a failure
- Your scorecards, which help manage standards and ensure teams are submitting high-quality code upfront
- Your rollbacks, hotfixes, and bug counts
The portal provides everything you need to calculate your CFR, but it also gives you the ability to take action directly within the portal to make changes and improvements. With Port in particular, its open, unopinionated data model gives you the flexibility you need to build custom metrics designed to your company’s interests and specific definitions of failure.
Want to learn more about internal developer portals and how to improve your change failure rate? Take a look at Port’s open demo and read more about Port’s Insights.