Observability in platform engineering

August 7, 2024

Observability in platform engineering

Ready to start?

Internal developer platforms exist so that you can ship software faster and better.

  • Faster, by providing developers with golden paths, giving them autonomy and guardrails.
  • Better, by improving production readiness and compliance and reducing MTTR.

One of the ways to do this is to ensure that observability is baked into your platform engineering practices. By ensuring observability, you can improve reliability, compliance and velocity. 

What is observability?

At its core, observability is a method for understanding what's happening inside your systems. It's often mentioned alongside monitoring, and while the two are closely related, and both are about identifying the root cause of issues, they serve different purposes in maintaining the health of your software.

  • Observability is the more proactive of the two. It's like watching a stock ticker, where you see stock prices fluctuate over time. Just as a stock trader might infer trends and make decisions based on these changes, observability helps you understand what’s happening in your system in real-time or to even anticipate future issues.
  • Monitoring is more reactive. It alerts you when something has already gone wrong, for example, a 504 error. Monitoring provides a snapshot of the moment, notifying you of issues that need immediate attention. 

In short, observability is a live stream around system information; it lets you understand what happened up to and immediately following the event - so it enables you to understand the progression of the state of a system over time, whereas monitoring captures point-in-time information. 

What is an internal developer portal and platform?

An internal developer platform provides golden paths for developers and managers. It consists of many tools and the self-service actions that are reusable that run through them. Its goal is to reduce cognitive load on developers without abstracting away context and underlying technologies. 

Internal developer portals are the central hub for the internal developer platform, providing a microservice catalog, a way to set and maintain software standards and developer self-service.

If your internal developer platform is a collection of technologies that your enterprise has assembled to operate the business, an internal developer portal is the interface through which various users (like developers, operators, and product managers) interact with these technologies. 

The portal is designed to simplify this interaction by abstracting away the complexity and specialized knowledge needed to manage these technologies. It also reduces the amount of tools that developers need to interact with, in this case observability and monitoring tools. 

In the context of observability, internal developer portals can:

  • Ensure better practices around standards and compliance - driving better software and baking in observability; and
  • Make incident management processes better and simpler for developers

How internal developer portals make observability better

If observability is about building reliably and anticipating failures in advance, internal developer portals can drive better reliability by ensuring everything software is built with compliance, resilience and standards in mind. Specifically, internal developer portals support the following:

Ensuring everything software is built with observability inside

Internal developer portals create golden paths, to ensure that when a service is built standards are met. Once such a requirement can be that observability is baked into any new service or self-service action. This will ensure that:

  • All assets are monitored 
  • That every critical asset has a owner and an on-call 

Acting as a central system of record

The service catalog is at the center of the internal developer portal. It has various entities, from microservices, to cloud resources, running services, APIs, and additional data can be added to the entities, such as AppSec data or cost. There is tremendous value in connecting information from observability platforms to software catalog entities. What this does is take data from outside tools and immediately add the context that helps understand what’s going on with a service in real time. 

Instead of tracking standards, spreadsheets, CMDBs, various methods of checking compliance, checklists and SRE reviews, the internal developer portal has all this data in one place.

How portals and observability work together

Here’s an example: without an internal developer portal, when an issue arises, such as a memory error, a developer  is typically assigned a ticket to fix it. 

At this point they may need to dive into multiple observability tools like Datadog, Grafana, and New Relic to understand what occurred. This can be difficult because many organizations operate with a ticketing system that requires approval to access these platforms. Once granted access, they need to navigate between dozens of different dashboards and may experience difficulty determining the right dashboard to use. This can result in prolonged war room sessions as teams work through the night to identify the root cause.

An internal developer portal can bridge these gaps by connecting the dots between different systems and giving users the autonomy to access the information they need quickly. This reduces the complexity and time needed to diagnose and resolve issues.

How a portal can help in incident management with the right information

Observability and monitoring tools are crucial, but they are just one part of the broader chain of events needed to resolve issues. The journey to resolution often involves navigating through various systems—you're in Splunk for logs, Prometheus for metrics, and Honeycomb for tracing. This complex web of tools can be time-consuming and cumbersome to sift through to find answers.

This is where an internal developer portal becomes incredibly powerful. Imagine you detect an issue with your recommendation service. With a portal, you can immediately understand the problem's context through a unified view. The portal’s graph allows you to see how services are deployed, their relationships, and the cloud environments they operate in. 

The visualization is just one part of the equation. The underlying metrics, logs, and traces—often spread across different systems—are the meat of observability. A portal’s promise lies in its ability to truly integrate these components, making it easier to bring all this data together for context and quick issue resolution.

What SREs can build with an internal developer portal

Site Reliability Engineers (SREs) can use the internal developer portal to drive better incident management outcomes

A better on-call experience: SREs can use the portal to build a framework that will make on-call work better.

For instance, SRE and platform engineering teams focused on uptime and reliability may have already created a suite of dashboards for critical services. The portal can link to these dashboards so that users or on-call personnel can quickly access the necessary information, whether it's in Grafana, Datadog, or another tool. By applying specific filters to dashboards and associating them with the relevant service, individual contributors can easily find the data they need.

These contributors should already have the appropriate permissions to view this information, as it pertains to their services. Ultimately, an internal developer portal aims to break down barriers and make information readily accessible.

Permissions: SREs can also create dynamic permissions or just in time permissions to ensure on-call engineers can easily self-serve. 

This is especially critical during off-hours. If an issue arises at 3 AM and the on-call SRE is working alone, having a portal to access all necessary information can be crucial. 

For example, many companies rely on a Confluence page or other documentation tool that provides a step-by-step guide of the troubleshooting process. However, these steps can include using tools which the SRE doesn’t have permission to use or may suggest looking at a dashboard that doesn’t make sense to them. A portal can automate and enhance these steps, by providing self-service actions to get permission for access to tools or databases, and by providing relevant information in a way that the SRE will understand it; for instance using dashboards specifically tailored to them, rather than providing them with access to observability and monitoring tools that are tailored to DevOps engineers. 

Internal developer portals give you the freedom to change the underlying observability tools

Internal developer portals are loosely coupled with the underlying internal developer platform. This means that the underlying platform tools can be replaced without hurting or changing the developer experience. In the case of sometimes costly observability tools, internal developer portals allow you to change tools as you need, while ensuring that the developers have the same experience addressing observability issues. 

This future-proof approach ensures that regardless of the tools an organization uses, an internal developer portal can adapt. If the technology stack changes, the portal can accommodate those changes effortlessly. Being observability-tooling agnostic, the portal ensures that the right tools are used for the right jobs, all while simplifying the user’s interaction with the system.

{{cta_1}}

Check out Port's pre-populated demo and see what it's all about.

Check live demo

No email required

{{cta_2}}

Contact sales for a technical product walkthrough

Let’s start
{{cta_3}}

Open a free Port account. No credit card required

Let’s start
{{cta_4}}

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start
{{cta_5}}

Check out Port's pre-populated demo and see what it's all about.

(no email required)

Let’s start
{{cta_6}}

Contact sales for a technical product walkthrough

Let’s start
{{cta_7}}

Open a free Port account. No credit card required

Let’s start
{{cta_8}}

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start
{{cta-demo}}
{{reading-box-backstage-vs-port}}

Example JSON block

{
  "foo": "bar"
}

Order Domain

{
  "properties": {},
  "relations": {},
  "title": "Orders",
  "identifier": "Orders"
}

Cart System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Cart",
  "title": "Cart"
}

Products System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Products",
  "title": "Products"
}

Cart Resource

{
  "properties": {
    "type": "postgress"
  },
  "relations": {},
  "icon": "GPU",
  "title": "Cart SQL database",
  "identifier": "cart-sql-sb"
}

Cart API

{
 "identifier": "CartAPI",
 "title": "Cart API",
 "blueprint": "API",
 "properties": {
   "type": "Open API"
 },
 "relations": {
   "provider": "CartService"
 },
 "icon": "Link"
}

Core Kafka Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Kafka Library",
  "identifier": "CoreKafkaLibrary"
}

Core Payment Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Payment Library",
  "identifier": "CorePaymentLibrary"
}

Cart Service JSON

{
 "identifier": "CartService",
 "title": "Cart Service",
 "blueprint": "Component",
 "properties": {
   "type": "service"
 },
 "relations": {
   "system": "Cart",
   "resources": [
     "cart-sql-sb"
   ],
   "consumesApi": [],
   "components": [
     "CorePaymentLibrary",
     "CoreKafkaLibrary"
   ]
 },
 "icon": "Cloud"
}

Products Service JSON

{
  "identifier": "ProductsService",
  "title": "Products Service",
  "blueprint": "Component",
  "properties": {
    "type": "service"
  },
  "relations": {
    "system": "Products",
    "consumesApi": [
      "CartAPI"
    ],
    "components": []
  }
}

Component Blueprint

{
 "identifier": "Component",
 "title": "Component",
 "icon": "Cloud",
 "schema": {
   "properties": {
     "type": {
       "enum": [
         "service",
         "library"
       ],
       "icon": "Docs",
       "type": "string",
       "enumColors": {
         "service": "blue",
         "library": "green"
       }
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "system": {
     "target": "System",
     "required": false,
     "many": false
   },
   "resources": {
     "target": "Resource",
     "required": false,
     "many": true
   },
   "consumesApi": {
     "target": "API",
     "required": false,
     "many": true
   },
   "components": {
     "target": "Component",
     "required": false,
     "many": true
   },
   "providesApi": {
     "target": "API",
     "required": false,
     "many": false
   }
 }
}

Resource Blueprint

{
 “identifier”: “Resource”,
 “title”: “Resource”,
 “icon”: “DevopsTool”,
 “schema”: {
   “properties”: {
     “type”: {
       “enum”: [
         “postgress”,
         “kafka-topic”,
         “rabbit-queue”,
         “s3-bucket”
       ],
       “icon”: “Docs”,
       “type”: “string”
     }
   },
   “required”: []
 },
 “mirrorProperties”: {},
 “formulaProperties”: {},
 “calculationProperties”: {},
 “relations”: {}
}

API Blueprint

{
 "identifier": "API",
 "title": "API",
 "icon": "Link",
 "schema": {
   "properties": {
     "type": {
       "type": "string",
       "enum": [
         "Open API",
         "grpc"
       ]
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "provider": {
     "target": "Component",
     "required": true,
     "many": false
   }
 }
}

Domain Blueprint

{
 "identifier": "Domain",
 "title": "Domain",
 "icon": "Server",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {}
}

System Blueprint

{
 "identifier": "System",
 "title": "System",
 "icon": "DevopsTool",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "domain": {
     "target": "Domain",
     "required": true,
     "many": false
   }
 }
}
{{tabel-1}}

Microservices SDLC

  • Scaffold a new microservice

  • Deploy (canary or blue-green)

  • Feature flagging

  • Revert

  • Lock deployments

  • Add Secret

  • Force merge pull request (skip tests on crises)

  • Add environment variable to service

  • Add IaC to the service

  • Upgrade package version

Development environments

  • Spin up a developer environment for 5 days

  • ETL mock data to environment

  • Invite developer to the environment

  • Extend TTL by 3 days

Cloud resources

  • Provision a cloud resource

  • Modify a cloud resource

  • Get permissions to access cloud resource

SRE actions

  • Update pod count

  • Update auto-scaling group

  • Execute incident response runbook automation

Data Engineering

  • Add / Remove / Update Column to table

  • Run Airflow DAG

  • Duplicate table

Backoffice

  • Change customer configuration

  • Update customer software version

  • Upgrade - Downgrade plan tier

  • Create - Delete customer

Machine learning actions

  • Train model

  • Pre-process dataset

  • Deploy

  • A/B testing traffic route

  • Revert

  • Spin up remote Jupyter notebook

{{tabel-2}}

Engineering tools

  • Observability

  • Tasks management

  • CI/CD

  • On-Call management

  • Troubleshooting tools

  • DevSecOps

  • Runbooks

Infrastructure

  • Cloud Resources

  • K8S

  • Containers & Serverless

  • IaC

  • Databases

  • Environments

  • Regions

Software and more

  • Microservices

  • Docker Images

  • Docs

  • APIs

  • 3rd parties

  • Runbooks

  • Cron jobs

Starting with Port is simple, fast and free.

Let’s start