Overview of what’s missing from your incident management program
To combat incidents, an incident management framework is usually put in place. The first point of call is an incident management tool. These tools provide organizations with the capability to trigger, escalate and manage incidents in a highly effective way. They enable organizations to configure the on-call rotation, trigger alerts based on events sent from monitoring tools, notify the on-call, and let the various respondents communicate around the incident. Then, when the incident is resolved, the tool enables you to close the incident and issue a summary.
What about actually resolving the incident? While incident management tools can handle incident logistics, the onus is on the on-call engineer to resolve it. Instead of having on-call engineers access multiple tools to resolve the incident, a portal can provide a great start, with everything from code owners, to monitoring links, downstream and upstream dependencies and more. On-call engineers need support to deal with the incident beyond what an incident management tool alone can provide, and internal developer portals can do just that.
What on-call need is:
A better overall experience - so they are able to find and access the information they need with ease, from ‘what are the upstream/downstream dependencies’ to underlying infrastructure health metrics, and understanding who the owner of a service is and if a service is being monitored correctly.
Context - with all of the information in one place, so they can understand what’s going on across the organization in real-time, without having to switch between different tools.
Autonomy - so that when they’re investigating the incident, they can perform quick easy-to-execute actions without needing to ask for help. For instance, actionable playbooks with pre-configured self-service actions would enable engineers to use day-2 operations as part of the incident remediation process.
How does the portal support incident management?
While there are many reasons to adopt an internal developer portal as part of a platform engineering initiative, a portal is especially valuable in the context of incident management, providing:
A better experience because:
- It is the primary tool developers already use across the SDLC- it offers an easy-to-use interface to get the information they need and perform the actions they need. This improves the efficiency of incident management as engineers have an easier way to act, remediate and prevent further incidents.
- SREs who build the framework to deal with incidents can build a big part of it in the portal. SREs can build scorecards to monitor production readiness of different services and applications, preventing incidents in the first place. SREs can also create self-service actions that on-call can use during the incident resolution process.
Context by using:
A software catalog that reflects the entire engineering ecosystem for the developer, from services through APIs, CI/CD and more, including their chosen incident management tool’s information and capabilities. The integrations and data feed into the portal’s software catalog providing a complete picture. By searching through the software catalog, users can find the information they need about a service, application or owner.
Autonomy through:
- Self-service actions, so that engineers can use day-2 operations and don’t need to rely on DevOps or ticketops.
- Dashboards tailored for different teams or individuals with the data they need, seamlessly.
Four-step strategy to manage incidents using a portal
So how do you actually use the portal to benefit your incident management program?
1. Have your foundations in place
How often have you discovered that critical services aren’t monitored? Or that you had to send several slack messages just to understand who the owner of a specific service was? This should never happen, but during incidents, this can be critical. In this section, we will explain how you can ensure that you have all your foundations in place.
With a software catalog in place, it’s pretty easy to answer the following questions:
- Who’s on call right now?
- Who’s the owner of this service?
- Is this service properly monitored? And where exactly?
For production readiness questions, SREs can use scorecards to define standards and track the compliance of services, so they can understand if something is missing and resolve this.
For instance, they may check monitoring services are active or if it includes API documentation, an established on-call rotation and runbooks before going to production. SRE teams can just glance at the scorecard to check whether a service is ready, and what is required to make it ready, rather than having to manually verify this. Likewise, they can use the portal to communicate initiatives and easily track them by user, team and more.
2. Investigate with context you wouldn’t otherwise have
Once an incident is detected - thanks to the integration with the incident management tool - the on-call engineer is notified. They’ll first try to implement a quick fix in order to mitigate (which can also be done using day-2 ops in the portal, more about this later) but then they need to investigate. Of course, they’ll start looking at their monitoring tools, and they’ll get useful information such as network latency, CPU usage etc. But they’re missing important context.
Perhaps a recent deployment caused the incident to happen - this is not necessarily something you can find in the monitoring tool. Next, they’ll want to check downstream dependencies of the faulty service, to ensure they’re not impacted either. Yet again, this information isn’t always available in the incident management tool.
As the portal connects with monitoring, Git, CD and security, the on-call engineer will have all the information side-by-side making it quicker and clearer for them to understand what’s happening. Even the monitoring information can be displayed in the portal so that they don't have to jump between multiple tools.
3. Automate actions and remediation
Once the on-call engineer understands what is happening, they need to take action. Sure, they have detailed playbooks explaining step-by-step what to do. These require the engineer to log in to instances, perhaps copy and paste some scripts or change the configuration. While these playbooks are invaluable; they can be enhanced using a portal.
As part of the preparation stage, SREs can create self-service actions directly from the portal that allow respondents to execute on their existing incident management playbooks for every scenario. The engineer can then start the remediation process using day-2 operations that have been pre-configured for them directly from the portal - for example:
- Requesting permission to a cluster
- Rollback a service
- Scale up a cloud resource
- Toggle off a feature flag
Here are some examples from Port’s demo:
By enabling engineers to act directly in the portal, they no longer need to switch between different tools or copy and paste scripts, increasing efficiency, improving the developer experience and reducing cognitive load. As self-service actions are centralized and provided with the complexity abstracted away, they reduce the risk of errors and simplify the execution of necessary remediation steps.
Now, imagine having one place with everything you need to solve your incident.
The relevant information, your health metrics and quick actions all at your fingertips.
4. Learn and prevent (and put that on loop)
Continuous learning is crucial for preventing future issues; it enables you to reduce frequent issues and build better products. Accessing useful data in an easy-to-consume way is important, so that your engineering team can go from being reactive and fixing issues, to being proactive and improving the way you develop and respond.
- Maturity scorecards to prevent incidents:
By creating maturity scorecards, an engineering team can ensure best practices are always upheld - for example, ensuring you have the right number of ReplicaSets or that critical vulnerabilities are being remediated.
- Dashboards to evaluate and analyze performance:
SREs can define and monitor performance metrics that should be monitored. These metrics could include the number of outages, MTTR, and failed deployments. These metrics, as well as others can then be visualized in the portal’s dashboards.
With continuous monitoring, you can identify which resources are prone to incidents and cause the highest number of outages. You can check how long it takes to recover per team and per service, and establish initiatives to improve. These initiatives may be to train underperforming teams, fine tune processes or resolve issues with technical debt. You can use the portal to communicate the initiatives and track them by developer, team and service.
This ongoing evaluation helps in building a more resilient and efficient system.
A more efficient, effective and informed incident management approach
By coupling the power of incident management tools with the simplicity and context-rich nature of a developer portal, engineering teams can better deal with incidents at all levels. The customizability of the portal means that all of those involved in the incident management program - SREs, developers and managers - can all benefit in different ways, streamlining the overall approach and success of the program.
Want to start with incident management for your internal developer portal and don’t know where to start? Check out the following materials:
Check out Port's pre-populated demo and see what it's all about.
No email required
Contact sales for a technical product walkthrough
Open a free Port account. No credit card required
Watch Port live coding videos - setting up an internal developer portal & platform
Check out Port's pre-populated demo and see what it's all about.
(no email required)
Contact sales for a technical product walkthrough
Open a free Port account. No credit card required
Watch Port live coding videos - setting up an internal developer portal & platform
Book a demo right now to check out Port's developer portal yourself
Apply to join the Beta for Port's new Backstage plugin
It's a Trap - Jenkins as Self service UI
Further reading:
Example JSON block
Order Domain
Cart System
Products System
Cart Resource
Cart API
Core Kafka Library
Core Payment Library
Cart Service JSON
Products Service JSON
Component Blueprint
Resource Blueprint
API Blueprint
Domain Blueprint
System Blueprint
Microservices SDLC
Scaffold a new microservice
Deploy (canary or blue-green)
Feature flagging
Revert
Lock deployments
Add Secret
Force merge pull request (skip tests on crises)
Add environment variable to service
Add IaC to the service
Upgrade package version
Development environments
Spin up a developer environment for 5 days
ETL mock data to environment
Invite developer to the environment
Extend TTL by 3 days
Cloud resources
Provision a cloud resource
Modify a cloud resource
Get permissions to access cloud resource
SRE actions
Update pod count
Update auto-scaling group
Execute incident response runbook automation
Data Engineering
Add / Remove / Update Column to table
Run Airflow DAG
Duplicate table
Backoffice
Change customer configuration
Update customer software version
Upgrade - Downgrade plan tier
Create - Delete customer
Machine learning actions
Train model
Pre-process dataset
Deploy
A/B testing traffic route
Revert
Spin up remote Jupyter notebook
Engineering tools
Observability
Tasks management
CI/CD
On-Call management
Troubleshooting tools
DevSecOps
Runbooks
Infrastructure
Cloud Resources
K8S
Containers & Serverless
IaC
Databases
Environments
Regions
Software and more
Microservices
Docker Images
Docs
APIs
3rd parties
Runbooks
Cron jobs