Automated incident management: what is it, benefits, implementation
September 3, 2024
Ready to start?
Introduction
It’s 3AM, the on-call engineer gets an alert about a critical incident. They have to perform a number of manual and time-consuming tasks, at a time when it’s paramount that the incident is resolved. They’re exhausted and have to rely on their own know-how to extract the right information, including:
- Who has the relevant expertise to help
- Where the relevant dashboards are and how to access them
- What the process is to get the relevant permissions for access or to perform actions
This knowledge is usually not documented; and if it is, it gets outdated quickly, has gaps, and isn’t clear.
The ideal scenario is a world where incident management is less confusing, where information is readily available and remediation steps are clear. For instance, being able to immediately create a Slack channel for a responder team, being able to easily find an owner of a service or documentation, and being able to notify the team when the incident is resolved automatically.
This is where automated incident response comes in.
What is automated incident management
Automated incident management is a holistic approach to managing and responding to incidents. The approach leverages automation and self-service tools to streamline the incident response process and empower everyone in your organization to take (their relevant) ownership of incidents.
Automation reduces
- the time to detect an incident
- notify the on-call of the incident
- resolve the incident and
- ensure that this type of incident does not reoccur.
This can subsequently aid under-pressure engineers, by reducing their stress levels and it can also help the engineering organization to reduce mean-time-to-resolution (MTTR). Swift incident resolution means happier customers as uptime is correlated with customer retention, which directly impacts revenue.
Site reliability engineers (SREs) are traditionally tasked with managing incidents as their remit is on ensuring the reliability, availability and performance of systems. Automated incident management is not about replacing SREs but rather augmenting their capabilities and distributing responsibility. By doing so, the engineering team should be able to reduce the time to detect, diagnose and resolve incidents, ultimately improving system reliability and reducing the burden on the SRE team.
The idea of automating documentation or automating steps is positive, but a key concern for DevOps and platform engineers is ensuring that standards are built in so that the on-call doesn’t veer off-course when attempting to resolve the incident. In other words, automated incident response should also factor in golden paths - providing developers with freedom, while ensuring standards are adhered to as they are embedded by platform engineers into tasks that the on-call engineers can accomplish independently through automation or self-service.
Building an incident response framework with a portal
An internal developer portal provides the appropriate framework that factors in automation, golden paths, standards and defined workflows. With a portal, engineers can establish workflows to automate the incident response process, streamlining tasks such as incident detection and on-call team notifications and initiating targeted remediation actions. For instance, if a monitoring system identifies a surge in CPU usage, a workflow could automatically alert the DevOps team, increase resource allocation, and generate a ticket in the incident management system, ensuring a prompt and coordinated response.
Here is how the framework for incident management can be built with a developer portal:
- Centralized hub
The portal is the go-to place for developers to get real-time information via the software catalog, such as:
- Who the owner of a service is
- Who has the specific domain expertise required
- Health metrics about infrastructure and applications
Having all of this information (and more) centralized, ensures that on-call engineers have all the information they need so they don’t have to scramble around through outdated documentation to find owners, dependencies, and the like.
The on-call engineer would get a link sent automatically in the incident channel to easily open the portal catalog, check out the asset's dependencies, see the latest deployments, and dive into monitoring dashboards for further investigation.
- Centralized communication
As well as ensuring that the right on-call engineer is notified of an incident instantly, the portal will open a Slack or Teams channel to handle the incident. It will automatically add in the relevant stakeholders to this channel - including owners, experts and the affected customers’ dedicated customer success managers. The channel will take into consideration domain expertise, time zones and more. This will ensure coordinated communication throughout.
- Centralized remediation
One of the key time lags in dealing with incidents is for on-call engineers to wait for approvals. That could be approvals for things like access to production and rollbacks. These can take hours if the incident has taken place in the night, which significantly impacts the time taken to resolve an incident - and could cause customers’ significant downtime.
The portal’s automations enables users to streamline processes - for example, giving automated access to production to whoever is on-call. Meanwhile, self-service actions can be used to streamline hotfixes like reverting to a previous version, scaling out or restarting a pod - reducing the error prone manual process of runbooks.
- Centralized monitoring
SREs can build scorecards to define standards and monitor production readiness of different services and applications.
Rather than having to manually verify readiness or compliance, they can glance at a scorecard to ensure best practices are being upheld - such as checking monitoring services are active, documentation is available, owner details are established, and the correct number of replica sets are in place.
As the software catalog is connected to all of a company’s monitoring tools, it can be used to trigger alerts based on health metrics or other types of metrics. For example, a low operational readiness score could trigger an email to the service owner.
Benefits of automated incident management:
Faster incident resolution - reduction in MTTR
By automating the different stages of incident response, the on-call engineer will resolve the incident much faster. Rather than using manual methods and waiting for approvals - a portal can provide:
- Immediate targeted alerts
- Automated incident creation (pre-filling the incident management system with relevant details from the Port catalog)
- Automated opening of a Slack/Teams channel
- An automated link provided to an on-call engineer to open the catalog
- Use self-service actions for remediation
- Automated incident status updates (including resolution) in the dedicated Slack channel and the incident management system
Reduced on-call fatigue
The reduction in tedious manual tasks and waiting for approvals can significantly reduce on-call fatigue, too. For example, an incident that is detected in the middle of the night may take the whole night to resolve without automated incident response, rather than a few hours. An on-call engineer is likely to be more exhausted and this can therefore impair their productivity going forward.
Improved Team Collaboration
With everyone that needs to know about the incident automatically alerted and updated in the selected communication channel, there’s a greater sense of teamwork. There’s also more transparency - and colleagues are less likely to get frustrated with constant requests (for access or for information), time waiting for approvals and a general sense of being overwhelmed by reminders, interruptions and follow-ups. Better team collaboration means better productivity and morale.
Enhanced System Reliability
A portal helps to enhance system reliability in a number of ways. First, maturity scorecards ensure that best practices are being upheld, and as a result, system reliability is enhanced. Second, key performance indicators like the number of outages, Mean Time To Resolution (MTTR) and failed deployments can be defined, monitored and analyzed through dashboards. Third, you can track and measure all your engineering metrics and standards in the portal; meaning you can monitor MTTR, keep tabs on how many incidents are open vs resolved, and even understand how metrics are connected and correlated. For instance, you may notice a correlation between a slowdown in deployment frequency and a spike in incidents.
Automated incident response best practices
- Start small: Choose a specific incident type or service to pilot your chosen automated incident management solution. You could survey your on-call engineers or use scorecards to identify which type of incident you should kick off with.
- Help yourself to hit your OKR: Does your team have a specific OKR in place? Is one of the KPIs to reduce MTTR? The best way of hitting your goals is to focus on tasks that will get you there quickly. So instead of automating each stage of incident management immediately, you can begin with a mix of automations and self-service actions, and add more options as your on-call engineers get accustomed with the portal.
- Train your team: Educate your team on incident response best practices and how to use the new tools and processes. That can be through documentation on how to set up or use automations, self-service actions, or by providing details on specific topics such as code reusability and error handling.
- Iterate and improve: Continuously gather feedback, measure your success, and refine your solution over time using a combination of surveys, metrics, monitoring tools and testing.
Automated incident response using a internal developer portal
In a nutshell, automated incident management is a game-changer for handling incidents more smoothly and efficiently. By automating key parts of the process using an internal developer portal, teams can resolve issues faster, cut down on the stress of being on-call, and work together more seamlessly. This approach doesn't just make life easier for engineers; it also strengthens system reliability and helps teams hit important goals like reducing MTTR. The payoff for automated incident response is an upgrade in performance, an improved developer experience, better collaboration, and more satisfied customers.
Further reading: How internal developer portals improve incident management
Check out Port's pre-populated demo and see what it's all about.
No email required
Contact sales for a technical product walkthrough
Open a free Port account. No credit card required
Watch Port live coding videos - setting up an internal developer portal & platform
Check out Port's pre-populated demo and see what it's all about.
(no email required)
Contact sales for a technical product walkthrough
Open a free Port account. No credit card required
Watch Port live coding videos - setting up an internal developer portal & platform
Book a demo right now to check out Port's developer portal yourself
Apply to join the Beta for Port's new Backstage plugin
It's a Trap - Jenkins as Self service UI
Further reading:
Example JSON block
Order Domain
Cart System
Products System
Cart Resource
Cart API
Core Kafka Library
Core Payment Library
Cart Service JSON
Products Service JSON
Component Blueprint
Resource Blueprint
API Blueprint
Domain Blueprint
System Blueprint
Microservices SDLC
Scaffold a new microservice
Deploy (canary or blue-green)
Feature flagging
Revert
Lock deployments
Add Secret
Force merge pull request (skip tests on crises)
Add environment variable to service
Add IaC to the service
Upgrade package version
Development environments
Spin up a developer environment for 5 days
ETL mock data to environment
Invite developer to the environment
Extend TTL by 3 days
Cloud resources
Provision a cloud resource
Modify a cloud resource
Get permissions to access cloud resource
SRE actions
Update pod count
Update auto-scaling group
Execute incident response runbook automation
Data Engineering
Add / Remove / Update Column to table
Run Airflow DAG
Duplicate table
Backoffice
Change customer configuration
Update customer software version
Upgrade - Downgrade plan tier
Create - Delete customer
Machine learning actions
Train model
Pre-process dataset
Deploy
A/B testing traffic route
Revert
Spin up remote Jupyter notebook
Engineering tools
Observability
Tasks management
CI/CD
On-Call management
Troubleshooting tools
DevSecOps
Runbooks
Infrastructure
Cloud Resources
K8S
Containers & Serverless
IaC
Databases
Environments
Regions
Software and more
Microservices
Docker Images
Docs
APIs
3rd parties
Runbooks
Cron jobs