Automated incident management: what is it, benefits, implementation

September 3, 2024

Ready to start?

Automated incident management: what is it, benefits, implementation

Introduction

It’s 3AM, the on-call engineer gets an alert about a critical incident. They have to perform a number of manual and time-consuming tasks, at a time when it’s paramount that the incident is resolved. They’re exhausted and have to rely on their own know-how to extract the right information, including:

  • Who has the relevant expertise to help
  • Where the relevant dashboards are and how to access them
  • What the process is to get the relevant permissions for access or to perform actions

This knowledge is usually not documented; and if it is, it gets outdated quickly, has gaps, and isn’t clear.

The ideal scenario is a world where incident management is less confusing, where information is readily available and remediation steps are clear. For instance, being able to immediately create a Slack channel for a responder team, being able to easily find an owner of a service or documentation, and being able to notify the team when the incident is resolved automatically. 

This is where automated incident response comes in. 

What is automated incident management

Automated incident management is a holistic approach to managing and responding to incidents. The approach leverages automation and self-service tools to streamline the incident response process and empower everyone in your organization to take (their relevant) ownership of incidents.

Automation reduces 

  • the time to detect an incident
  • notify the on-call of the incident
  • resolve the incident and
  •  ensure that this type of incident does not reoccur.

This can subsequently aid under-pressure engineers, by reducing their stress levels and it can also help the engineering organization to reduce mean-time-to-resolution (MTTR). Swift incident resolution means happier customers as uptime is correlated with customer retention, which directly impacts revenue. 

Site reliability engineers (SREs) are traditionally tasked with managing incidents as their remit is on ensuring the reliability, availability and performance of systems. Automated incident management is not about replacing SREs but rather augmenting their capabilities and distributing responsibility. By doing so, the engineering team should be able to reduce the time to detect, diagnose and resolve incidents, ultimately improving system reliability and reducing the burden on the SRE team.

The idea of automating documentation or automating steps is positive, but a key concern for DevOps and platform engineers is ensuring that standards are built in so that the on-call doesn’t veer off-course when attempting to resolve the incident. In other words, automated incident response should also factor in golden paths - providing developers with freedom, while ensuring standards are adhered to as they are embedded by platform engineers into tasks that the on-call engineers can accomplish independently through automation or self-service.

Building an incident response framework with a portal

An internal developer portal provides the appropriate framework that factors in automation, golden paths, standards and defined workflows. With a portal, engineers can establish workflows to automate the incident response process, streamlining tasks such as incident detection and on-call team notifications and initiating targeted remediation actions. For instance, if a monitoring system identifies a surge in CPU usage, a workflow could automatically alert the DevOps team, increase resource allocation, and generate a ticket in the incident management system, ensuring a prompt and coordinated response.

Here is how the framework for incident management can be built with a developer portal:

  1. Centralized hub

The portal is the go-to place for developers to get real-time information via the software catalog, such as:

  • Who the owner of a service is
  • Who has the specific domain expertise required
  • Health metrics about infrastructure and applications

Having all of this information (and more) centralized, ensures that on-call engineers have all the information they need so they don’t have to scramble around through outdated documentation to find owners, dependencies, and the like.
The on-call engineer would get a link sent automatically in the incident channel to easily open the portal catalog, check out the asset's dependencies, see the latest deployments, and dive into monitoring dashboards for further investigation.

  1. Centralized communication

As well as ensuring that the right on-call engineer is notified of an incident instantly, the portal will open a Slack or Teams channel to handle the incident. It will automatically add in the relevant stakeholders to this channel - including owners, experts and the affected customers’ dedicated customer success managers. The channel will take into consideration domain expertise, time zones and more. This will ensure coordinated communication throughout.

  1. Centralized remediation

One of the key time lags in dealing with incidents is for on-call engineers to wait for approvals. That could be approvals for things like access to production and rollbacks. These can take hours if the incident has taken place in the night, which significantly impacts the time taken to resolve an incident - and could cause customers’ significant downtime. 

The portal’s automations enables users to streamline processes - for example, giving automated access to production to whoever is on-call. Meanwhile, self-service actions can be used to streamline hotfixes like reverting to a previous version, scaling out or restarting a pod - reducing the error prone manual process of runbooks.

  1. Centralized monitoring

SREs can build scorecards to define standards and monitor production readiness of different services and applications.
Rather than having to manually verify readiness or compliance, they can glance at a scorecard to ensure best practices are being upheld - such as checking monitoring services are active, documentation is available, owner details are established, and the correct number of replica sets are in place.

As the software catalog is connected to all of a company’s monitoring tools, it can be used to trigger alerts based on health metrics or other types of metrics. For example, a low operational readiness score could trigger an email to the service owner.

Benefits of automated incident management:

Faster incident resolution - reduction in MTTR

By automating the different stages of incident response, the on-call engineer will resolve the incident much faster. Rather than using manual methods and waiting for approvals - a portal can provide:

  • Immediate targeted alerts
  • Automated incident creation (pre-filling the incident management system with relevant details from the Port catalog)
  • Automated opening of a Slack/Teams channel
  • An automated link provided to an on-call engineer to open the catalog
  • Use self-service actions for remediation
  • Automated incident status updates (including resolution) in the dedicated Slack channel and the incident management system

Reduced on-call fatigue

The reduction in tedious manual tasks and waiting for approvals can significantly reduce on-call fatigue, too. For example, an incident that is detected in the middle of the night may take the whole night to resolve without automated incident response, rather than a few hours. An on-call engineer is likely to be more exhausted and this can therefore impair their productivity going forward. 

Improved Team Collaboration

With everyone that needs to know about the incident automatically alerted and updated in the selected communication channel, there’s a greater sense of teamwork. There’s also more transparency - and colleagues are less likely to get frustrated with constant requests (for access or for information), time waiting for approvals and a general sense of being overwhelmed by reminders, interruptions and follow-ups. Better team collaboration means better productivity and morale.

Enhanced System Reliability

A portal helps to enhance system reliability in a number of ways. First, maturity scorecards ensure that best practices are being upheld, and as a result, system reliability is enhanced. Second, key performance indicators like the number of outages, Mean Time To Resolution (MTTR) and failed deployments can be defined, monitored and analyzed through dashboards. Third, you can track and measure all your engineering metrics and standards in the portal; meaning you can monitor MTTR, keep tabs on how many incidents are open vs resolved, and even understand how metrics are connected and correlated. For instance, you may notice a correlation between a slowdown in deployment frequency and a spike in incidents. 

Automated incident response best practices 

  1. Start small: Choose a specific incident type or service to pilot your chosen automated incident management solution. You could survey your on-call engineers or use scorecards to identify which type of incident you should kick off with.

  2. Help yourself to hit your OKR: Does your team have a specific OKR in place? Is one of the KPIs to reduce MTTR? The best way of hitting your goals is to focus on tasks that will get you there quickly. So instead of automating each stage of incident management immediately, you can begin with a mix of automations and self-service actions, and add more options as your on-call engineers get accustomed with the portal.

  3. Train your team: Educate your team on incident response best practices and how to use the new tools and processes. That can be through documentation on how to set up or use automations, self-service actions, or by providing details on specific topics such as code reusability and error handling.

  4. Iterate and improve: Continuously gather feedback, measure your success, and refine your solution over time using a combination of surveys, metrics, monitoring tools and testing. 

Automated incident response using a internal developer portal

In a nutshell, automated incident management is a game-changer for handling incidents more smoothly and efficiently. By automating key parts of the process using an internal developer portal, teams can resolve issues faster, cut down on the stress of being on-call, and work together more seamlessly. This approach doesn't just make life easier for engineers; it also strengthens system reliability and helps teams hit important goals like reducing MTTR. The payoff for automated incident response is an upgrade in performance, an improved developer experience, better collaboration, and more satisfied customers.

Further reading: How internal developer portals improve incident management

{{cta_1}}

Check out Port's pre-populated demo and see what it's all about.

Check live demo

No email required

{{cta_2}}

Contact sales for a technical product walkthrough

Let’s start
{{cta_3}}

Open a free Port account. No credit card required

Let’s start
{{cta_4}}

Watch Port live coding videos - setting up an internal developer portal & platform

{{cta_5}}

Check out Port's pre-populated demo and see what it's all about.

(no email required)

Let’s start
{{cta_6}}

Contact sales for a technical product walkthrough

Let’s start
{{cta_7}}

Open a free Port account. No credit card required

Let’s start
{{cta_8}}

Watch Port live coding videos - setting up an internal developer portal & platform

{{cta-demo}}
{{reading-box-backstage-vs-port}}

Example JSON block

{
  "foo": "bar"
}

Order Domain

{
  "properties": {},
  "relations": {},
  "title": "Orders",
  "identifier": "Orders"
}

Cart System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Cart",
  "title": "Cart"
}

Products System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Products",
  "title": "Products"
}

Cart Resource

{
  "properties": {
    "type": "postgress"
  },
  "relations": {},
  "icon": "GPU",
  "title": "Cart SQL database",
  "identifier": "cart-sql-sb"
}

Cart API

{
 "identifier": "CartAPI",
 "title": "Cart API",
 "blueprint": "API",
 "properties": {
   "type": "Open API"
 },
 "relations": {
   "provider": "CartService"
 },
 "icon": "Link"
}

Core Kafka Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Kafka Library",
  "identifier": "CoreKafkaLibrary"
}

Core Payment Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Payment Library",
  "identifier": "CorePaymentLibrary"
}

Cart Service JSON

{
 "identifier": "CartService",
 "title": "Cart Service",
 "blueprint": "Component",
 "properties": {
   "type": "service"
 },
 "relations": {
   "system": "Cart",
   "resources": [
     "cart-sql-sb"
   ],
   "consumesApi": [],
   "components": [
     "CorePaymentLibrary",
     "CoreKafkaLibrary"
   ]
 },
 "icon": "Cloud"
}

Products Service JSON

{
  "identifier": "ProductsService",
  "title": "Products Service",
  "blueprint": "Component",
  "properties": {
    "type": "service"
  },
  "relations": {
    "system": "Products",
    "consumesApi": [
      "CartAPI"
    ],
    "components": []
  }
}

Component Blueprint

{
 "identifier": "Component",
 "title": "Component",
 "icon": "Cloud",
 "schema": {
   "properties": {
     "type": {
       "enum": [
         "service",
         "library"
       ],
       "icon": "Docs",
       "type": "string",
       "enumColors": {
         "service": "blue",
         "library": "green"
       }
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "system": {
     "target": "System",
     "required": false,
     "many": false
   },
   "resources": {
     "target": "Resource",
     "required": false,
     "many": true
   },
   "consumesApi": {
     "target": "API",
     "required": false,
     "many": true
   },
   "components": {
     "target": "Component",
     "required": false,
     "many": true
   },
   "providesApi": {
     "target": "API",
     "required": false,
     "many": false
   }
 }
}

Resource Blueprint

{
 “identifier”: “Resource”,
 “title”: “Resource”,
 “icon”: “DevopsTool”,
 “schema”: {
   “properties”: {
     “type”: {
       “enum”: [
         “postgress”,
         “kafka-topic”,
         “rabbit-queue”,
         “s3-bucket”
       ],
       “icon”: “Docs”,
       “type”: “string”
     }
   },
   “required”: []
 },
 “mirrorProperties”: {},
 “formulaProperties”: {},
 “calculationProperties”: {},
 “relations”: {}
}

API Blueprint

{
 "identifier": "API",
 "title": "API",
 "icon": "Link",
 "schema": {
   "properties": {
     "type": {
       "type": "string",
       "enum": [
         "Open API",
         "grpc"
       ]
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "provider": {
     "target": "Component",
     "required": true,
     "many": false
   }
 }
}

Domain Blueprint

{
 "identifier": "Domain",
 "title": "Domain",
 "icon": "Server",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {}
}

System Blueprint

{
 "identifier": "System",
 "title": "System",
 "icon": "DevopsTool",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "domain": {
     "target": "Domain",
     "required": true,
     "many": false
   }
 }
}
{{tabel-1}}

Microservices SDLC

  • Scaffold a new microservice

  • Deploy (canary or blue-green)

  • Feature flagging

  • Revert

  • Lock deployments

  • Add Secret

  • Force merge pull request (skip tests on crises)

  • Add environment variable to service

  • Add IaC to the service

  • Upgrade package version

Development environments

  • Spin up a developer environment for 5 days

  • ETL mock data to environment

  • Invite developer to the environment

  • Extend TTL by 3 days

Cloud resources

  • Provision a cloud resource

  • Modify a cloud resource

  • Get permissions to access cloud resource

SRE actions

  • Update pod count

  • Update auto-scaling group

  • Execute incident response runbook automation

Data Engineering

  • Add / Remove / Update Column to table

  • Run Airflow DAG

  • Duplicate table

Backoffice

  • Change customer configuration

  • Update customer software version

  • Upgrade - Downgrade plan tier

  • Create - Delete customer

Machine learning actions

  • Train model

  • Pre-process dataset

  • Deploy

  • A/B testing traffic route

  • Revert

  • Spin up remote Jupyter notebook

{{tabel-2}}

Engineering tools

  • Observability

  • Tasks management

  • CI/CD

  • On-Call management

  • Troubleshooting tools

  • DevSecOps

  • Runbooks

Infrastructure

  • Cloud Resources

  • K8S

  • Containers & Serverless

  • IaC

  • Databases

  • Environments

  • Regions

Software and more

  • Microservices

  • Docker Images

  • Docs

  • APIs

  • 3rd parties

  • Runbooks

  • Cron jobs

Starting with Port is simple, fast and free.

Let’s start