Blog

Automated incident management: what is it, benefits, implementation

Sooraj Shah

Jenny Salem

September 3, 2024

Example H2

Ready to start?

Play with live demo

Automated incident management: what is it, benefits, implementation

Introduction

It’s 3AM, the on-call engineer gets an alert about a critical incident. They have to perform a number of manual and time-consuming tasks, at a time when it’s paramount that the incident is resolved. They’re exhausted and have to rely on their own know-how to extract the right information, including:

Who has the relevant expertise to help
Where the relevant dashboards are and how to access them
What the process is to get the relevant permissions for access or to perform actions

This knowledge is usually not documented; and if it is, it gets outdated quickly, has gaps, and isn’t clear.

The ideal scenario is a world where incident management is less confusing, where information is readily available and remediation steps are clear. For instance, being able to immediately create a Slack channel for a responder team, being able to easily find an owner of a service or documentation, and being able to notify the team when the incident is resolved automatically.

This is where automated incident response comes in.

What is automated incident management

Automated incident management is a holistic approach to managing and responding to incidents. The approach leverages automation and self-service tools to streamline the incident response process and empower everyone in your organization to take (their relevant) ownership of incidents.

Automation reduces

the time to detect an incident
notify the on-call of the incident
resolve the incident and
ensure that this type of incident does not reoccur.

This can subsequently aid under-pressure engineers, by reducing their stress levels and it can also help the engineering organization to reduce mean-time-to-resolution (MTTR). Swift incident resolution means happier customers as uptime is correlated with customer retention, which directly impacts revenue.

Site reliability engineers (SREs) are traditionally tasked with managing incidents as their remit is on ensuring the reliability, availability and performance of systems. Automated incident management is not about replacing SREs but rather augmenting their capabilities and distributing responsibility. By doing so, the engineering team should be able to reduce the time to detect, diagnose and resolve incidents, ultimately improving system reliability and reducing the burden on the SRE team.

The idea of automating documentation or automating steps is positive, but a key concern for DevOps and platform engineers is ensuring that standards are built in so that the on-call doesn’t veer off-course when attempting to resolve the incident. In other words, automated incident response should also factor in golden paths - providing developers with freedom, while ensuring standards are adhered to as they are embedded by platform engineers into tasks that the on-call engineers can accomplish independently through automation or self-service.

Building an incident response framework with a portal

An internal developer portal provides the appropriate framework that factors in automation, golden paths, standards and defined workflows. With a portal, engineers can establish workflows to automate the incident response process, streamlining tasks such as incident detection and on-call team notifications and initiating targeted remediation actions. For instance, if a monitoring system identifies a surge in CPU usage, a workflow could automatically alert the DevOps team, increase resource allocation, and generate a ticket in the incident management system, ensuring a prompt and coordinated response.

Here is how the framework for incident management can be built with a developer portal:

Centralized hub‍

The portal is the go-to place for developers to get real-time information via the software catalog, such as:

Who the owner of a service is
Who has the specific domain expertise required
Health metrics about infrastructure and applications

Having all of this information (and more) centralized, ensures that on-call engineers have all the information they need so they don’t have to scramble around through outdated documentation to find owners, dependencies, and the like.
The on-call engineer would get a link sent automatically in the incident channel to easily open the portal catalog, check out the asset's dependencies, see the latest deployments, and dive into monitoring dashboards for further investigation.

Centralized communication

As well as ensuring that the right on-call engineer is notified of an incident instantly, the portal will open a Slack or Teams channel to handle the incident. It will automatically add in the relevant stakeholders to this channel - including owners, experts and the affected customers’ dedicated customer success managers. The channel will take into consideration domain expertise, time zones and more. This will ensure coordinated communication throughout.

Centralized remediation ‍

One of the key time lags in dealing with incidents is for on-call engineers to wait for approvals. That could be approvals for things like access to production and rollbacks. These can take hours if the incident has taken place in the night, which significantly impacts the time taken to resolve an incident - and could cause customers’ significant downtime.

The portal’s automations enables users to streamline processes - for example, giving automated access to production to whoever is on-call. Meanwhile, self-service actions can be used to streamline hotfixes like reverting to a previous version, scaling out or restarting a pod - reducing the error prone manual process of runbooks.

Centralized monitoring‍

SREs can build scorecards to define standards and monitor production readiness of different services and applications.
Rather than having to manually verify readiness or compliance, they can glance at a scorecard to ensure best practices are being upheld - such as checking monitoring services are active, documentation is available, owner details are established, and the correct number of replica sets are in place.

As the software catalog is connected to all of a company’s monitoring tools, it can be used to trigger alerts based on health metrics or other types of metrics. For example, a low operational readiness score could trigger an email to the service owner.

Benefits of automated incident management:

Faster incident resolution - reduction in MTTR

By automating the different stages of incident response, the on-call engineer will resolve the incident much faster. Rather than using manual methods and waiting for approvals - a portal can provide:

Immediate targeted alerts
Automated incident creation (pre-filling the incident management system with relevant details from the Port catalog)
Automated opening of a Slack/Teams channel
An automated link provided to an on-call engineer to open the catalog
Use self-service actions for remediation
Automated incident status updates (including resolution) in the dedicated Slack channel and the incident management system

Reduced on-call fatigue

The reduction in tedious manual tasks and waiting for approvals can significantly reduce on-call fatigue, too. For example, an incident that is detected in the middle of the night may take the whole night to resolve without automated incident response, rather than a few hours. An on-call engineer is likely to be more exhausted and this can therefore impair their productivity going forward.

Improved Team Collaboration

With everyone that needs to know about the incident automatically alerted and updated in the selected communication channel, there’s a greater sense of teamwork. There’s also more transparency - and colleagues are less likely to get frustrated with constant requests (for access or for information), time waiting for approvals and a general sense of being overwhelmed by reminders, interruptions and follow-ups. Better team collaboration means better productivity and morale.

Enhanced System Reliability

A portal helps to enhance system reliability in a number of ways. First, maturity scorecards ensure that best practices are being upheld, and as a result, system reliability is enhanced. Second, key performance indicators like the number of outages, Mean Time To Resolution (MTTR) and failed deployments can be defined, monitored and analyzed through dashboards. Third, you can track and measure all your engineering metrics and standards in the portal; meaning you can monitor MTTR, keep tabs on how many incidents are open vs resolved, and even understand how metrics are connected and correlated. For instance, you may notice a correlation between a slowdown in deployment frequency and a spike in incidents.

Automated incident response best practices

Start small: Choose a specific incident type or service to pilot your chosen automated incident management solution. You could survey your on-call engineers or use scorecards to identify which type of incident you should kick off with.
Help yourself to hit your OKR: Does your team have a specific OKR in place? Is one of the KPIs to reduce MTTR? The best way of hitting your goals is to focus on tasks that will get you there quickly. So instead of automating each stage of incident management immediately, you can begin with a mix of automations and self-service actions, and add more options as your on-call engineers get accustomed with the portal.
Train your team: Educate your team on incident response best practices and how to use the new tools and processes. That can be through documentation on how to set up or use automations, self-service actions, or by providing details on specific topics such as code reusability and error handling.
Iterate and improve: Continuously gather feedback, measure your success, and refine your solution over time using a combination of surveys, metrics, monitoring tools and testing.

Automated incident response using a internal developer portal

In a nutshell, automated incident management is a game-changer for handling incidents more smoothly and efficiently. By automating key parts of the process using an internal developer portal, teams can resolve issues faster, cut down on the stress of being on-call, and work together more seamlessly. This approach doesn't just make life easier for engineers; it also strengthens system reliability and helps teams hit important goals like reducing MTTR. The payoff for automated incident response is an upgrade in performance, an improved developer experience, better collaboration, and more satisfied customers.

Further reading: How internal developer portals improve incident management

Tags:

Internal Developer Portal

Check out Port's pre-populated demo and see what it's all about.

Check live demo

No email required

Check out the 2025 State of Internal Developer Portals report

See the full report

No email required

Contact sales for a technical product walkthrough

Let’s start

Open a free Port account. No credit card required

Let’s start

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start

Check out Port's pre-populated demo and see what it's all about.

(no email required)

Let’s start

Contact sales for a technical walkthrough of Port

Let’s start

Open a free Port account. No credit card required

Let’s start

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start

Book a demo right now to check out Port's developer portal yourself

Apply to join the Beta for Port's new Backstage plugin

Apply for beta

It's a Trap - Jenkins as Self service UI

How do GitOps affect developer experience?

It's a Trap - Jenkins as Self service UI. Click her to download the eBook

Download eBook

Learning from CyberArk - building an internal developer platform in-house

Learn more about Port’s Backstage plugin

Build Backstage better — with Port

Read the plugin docs

Return to Backstage Plugin docs

Example JSON block

{
  "foo": "bar"
}

Order Domain

{
  "properties": {},
  "relations": {},
  "title": "Orders",
  "identifier": "Orders"
}

Cart System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Cart",
  "title": "Cart"
}

Products System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Products",
  "title": "Products"
}

Cart Resource

{
  "properties": {
    "type": "postgress"
  },
  "relations": {},
  "icon": "GPU",
  "title": "Cart SQL database",
  "identifier": "cart-sql-sb"
}

Cart API

{
 "identifier": "CartAPI",
 "title": "Cart API",
 "blueprint": "API",
 "properties": {
   "type": "Open API"
 },
 "relations": {
   "provider": "CartService"
 },
 "icon": "Link"
}

Core Kafka Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Kafka Library",
  "identifier": "CoreKafkaLibrary"
}

Core Payment Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Payment Library",
  "identifier": "CorePaymentLibrary"
}

Cart Service JSON

{
 "identifier": "CartService",
 "title": "Cart Service",
 "blueprint": "Component",
 "properties": {
   "type": "service"
 },
 "relations": {
   "system": "Cart",
   "resources": [
     "cart-sql-sb"
   ],
   "consumesApi": [],
   "components": [
     "CorePaymentLibrary",
     "CoreKafkaLibrary"
   ]
 },
 "icon": "Cloud"
}

Products Service JSON

{
  "identifier": "ProductsService",
  "title": "Products Service",
  "blueprint": "Component",
  "properties": {
    "type": "service"
  },
  "relations": {
    "system": "Products",
    "consumesApi": [
      "CartAPI"
    ],
    "components": []
  }
}

Component Blueprint

{
 "identifier": "Component",
 "title": "Component",
 "icon": "Cloud",
 "schema": {
   "properties": {
     "type": {
       "enum": [
         "service",
         "library"
       ],
       "icon": "Docs",
       "type": "string",
       "enumColors": {
         "service": "blue",
         "library": "green"
       }
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "system": {
     "target": "System",
     "required": false,
     "many": false
   },
   "resources": {
     "target": "Resource",
     "required": false,
     "many": true
   },
   "consumesApi": {
     "target": "API",
     "required": false,
     "many": true
   },
   "components": {
     "target": "Component",
     "required": false,
     "many": true
   },
   "providesApi": {
     "target": "API",
     "required": false,
     "many": false
   }
 }
}

Resource Blueprint

{
 “identifier”: “Resource”,
 “title”: “Resource”,
 “icon”: “DevopsTool”,
 “schema”: {
   “properties”: {
     “type”: {
       “enum”: [
         “postgress”,
         “kafka-topic”,
         “rabbit-queue”,
         “s3-bucket”
       ],
       “icon”: “Docs”,
       “type”: “string”
     }
   },
   “required”: []
 },
 “mirrorProperties”: {},
 “formulaProperties”: {},
 “calculationProperties”: {},
 “relations”: {}
}

API Blueprint

{
 "identifier": "API",
 "title": "API",
 "icon": "Link",
 "schema": {
   "properties": {
     "type": {
       "type": "string",
       "enum": [
         "Open API",
         "grpc"
       ]
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "provider": {
     "target": "Component",
     "required": true,
     "many": false
   }
 }
}

Domain Blueprint

{
 "identifier": "Domain",
 "title": "Domain",
 "icon": "Server",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {}
}

System Blueprint

{
 "identifier": "System",
 "title": "System",
 "icon": "DevopsTool",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "domain": {
     "target": "Domain",
     "required": true,
     "many": false
   }
 }
}

Microservices SDLC

Scaffold a new microservice
Deploy (canary or blue-green)
Feature flagging
Revert
Lock deployments
Add Secret
Force merge pull request (skip tests on crises)
Add environment variable to service
Add IaC to the service
Upgrade package version

Development environments

Spin up a developer environment for 5 days
ETL mock data to environment
Invite developer to the environment
Extend TTL by 3 days

Cloud resources

Provision a cloud resource
Modify a cloud resource
Get permissions to access cloud resource

SRE actions

Update pod count
Update auto-scaling group
Execute incident response runbook automation

Data Engineering

Add / Remove / Update Column to table
Run Airflow DAG
Duplicate table

Backoffice

Change customer configuration
Update customer software version
Upgrade - Downgrade plan tier
Create - Delete customer

Machine learning actions

Train model
Pre-process dataset
Deploy
A/B testing traffic route
Revert
Spin up remote Jupyter notebook

Engineering tools

Observability
Tasks management
CI/CD
On-Call management
Troubleshooting tools
DevSecOps
Runbooks

Infrastructure

Cloud Resources
K8S
Containers & Serverless
IaC
Databases
Environments
Regions

Software and more

Microservices
Docker Images
Docs
APIs
3rd parties
Runbooks
Cron jobs

Starting with Port is simple, fast and free.

Let’s start

Ready to start?

Introduction

What is automated incident management

Building an incident response framework with a portal

Benefits of automated incident management:

Faster incident resolution - reduction in MTTR

Reduced on-call fatigue

Improved Team Collaboration

Enhanced System Reliability

Automated incident response best practices

Automated incident response using a internal developer portal

Tags:

Previous article

Next article

Check out Port's pre-populated demo and see what it's all about.

Check out the 2025 State of Internal Developer Portals report

Contact sales for a technical product walkthrough

Open a free Port account. No credit card required

Watch Port live coding videos - setting up an internal developer portal & platform

Check out Port's pre-populated demo and see what it's all about.

Contact sales for a technical walkthrough of Port

Open a free Port account. No credit card required

Watch Port live coding videos - setting up an internal developer portal & platform

Book a demo right now to check out Port's developer portal yourself

Apply to join the Beta for Port's new Backstage plugin

It's a Trap - Jenkins as Self service UI

How do GitOps affect developer experience?

It's a Trap - Jenkins as Self service UI. Click her to download the eBook

Learning from CyberArk - building an internal developer platform in-house

Further reading:

Learn more about Port’s Backstage plugin

Build Backstage better — with Port

Example JSON block

Order Domain

Cart System

Products System

Cart Resource

Cart API

Core Kafka Library

Core Payment Library

Cart Service JSON

Products Service JSON

Component Blueprint

Resource Blueprint

API Blueprint

Domain Blueprint

System Blueprint

Microservices SDLC

Development environments

Cloud resources

SRE actions

Data Engineering

Backoffice

Machine learning actions

Engineering tools

Infrastructure

Software and more

You may also be interested in

How site reliability engineers (SREs) can "shift left" using a unified service catalog

How to measure the ROI of GenAI tools

What is an internal developer portal homepage?

Starting with Port is simple, fast and free.