Blog

Debugging K8s with K8sGPT in your internal developer portal

Dan Amzulescu

December 6, 2024

Example H2

Ready to start?

Play with live demo

Debugging K8s with K8sGPT in your internal developer portal

Editor’s note: This post was originally published on The New Stack on 21 November 2024.

Quickly identifying and resolving issues is a constant challenge for DevOps and SRE teams, who often find themselves navigating a complex web of commands, logs, and dashboards that is unique to each problem. This fragmented approach delays resolutions and developers frequently report that they spend nearly 40% of their time just troubleshooting — which also opens software environments to risk of human error.

‍Platform engineering emerged as a way to overcome this DevOps complexity, and at the heart of platform engineering is the internal developer portal. An internal developer portal streamlines incident response, reduces manual toil, and empowers DevOps teams to resolve issues faster. It offers a unified space for managing infrastructure, code repositories, and deployments.

Portals also centralize all the data related to your software development lifecycle (SDLC) in one, accessible place. Integrating AI into your portal can help you proactively identify potential system degradations and provide instant guidance on remediation, which can sometimes cut your average incident resolution time by 50%.

In this article, I’ll walk you through how to accelerate issue resolution using AI to enrich portal data, and how to display the enriched data within the portal to reduce time-to-resolution.

Using K8sGPT to enrich portal data

K8sGPT is an AI agent specifically designed for Kubernetes environments. It surfaces actionable insights from historical data, providing quick recommendations that significantly reduce resolution times. By pinpointing anomalies or misconfigurations and offering intelligent solutions, K8sGPT transforms a traditionally reactive process into a proactive one. Plus, by tightly integrating with your portal, these insights are presented in a single pane of glass that is fully aligned with your operational workflows.

While our example will focus on Kubernetes only, AI can assist across multiple domains in more advanced scenarios, such as cloud infrastructure, where issues often span different layers of the stack. Our ultimate goal is not just to equip AI to handle multiple domains but to empower it to fully automate the remediation process, resolving issues independently.

In the context of an internal developer portal, you can use K8sGPT to gather data from all of your workflows across your entire SDLC and draw insights from them. With that vision in mind, let’s start with small steps and explore how a single-domain workflow can improve efficiency.

Deploying an automatic AI enrichment process

Let’s say we want to create an automated workflow to enrich our internal developer portal with a real-time view of failing Kubernetes workloads. This workflow involves several key components that, when working together, will use AI to create an automated process that helps us use our portal to solve observed issues in K8s.

These components are:

Kubernetes (K8s) cluster: This represents our workload infrastructure. There are multiple ways to deploy Kubernetes clusters, and the most common are platform-as-a-service (PaaS) such as EKS, AKS, and GKE; whatever you’re using, there should also be an integration between the cluster and the portal in order to have both the listing of the workloads and their health status.
Internal developer portal: This is where all the data about our Kubernetes cluster is centralized and can be correlated and refined. This will provide us easy access to deployments data and AI Insights on how to solve unhealthy Kubernetes workloads. We’ll be using Port for this example.
K8sGPT: This is our main AI “consultant.” It is in charge of both issuing commands against the K8s API to collect data and to communicate back and forth with an AI LLM that provides insights. some text
- K8sGPT can be deployed both outside and in-cluster. To deploy the K8sGPT REST API server, follow the installation guide
- Use this command to serve the REST API: k8sgpt serve --http
- To deploy the in-cluster K8sGPT, follow the installation guide
Communication facilitator: The communication facilitator is crucial for bridging the gap between the portal and K8sGPT. It ensures that commands, queries, and insights flow seamlessly between these systems. Depending on your organization's security and compliance requirements, you can use:some text
- Kafka topics: In our example we have used Kafka topics, which means that when a workload is recognized as failing, a message will be created in a Kafka topic. The communication facilitator (in our case, a Python script) will handle checking the topics and consuming messages in a PULL-based direction.
- Another approach is to just have the script constantly check locally for the health of workloads and enrich that check with AI insights when workloads fail.
AI Language Learning Model (LLM): The core intelligence behind K8sGPT, leveraging natural language processing to interpret Kubernetes data and provide actionable recommendations.

You should note that K8sGPT currently supports 11 different types of AI backends. I have tested it with Ollama; you can follow instructions to also download a model here. Check out these guides for using an OpenAI API token and deploying an on-prem LLM using Ollama to help.

Once deployed, you can configure K8sGPT to use Ollama like so:

k8sgpt serve --http -b openai

Understanding the flow of events

Let’s take a look at the below diagram to better understand how we enable a fully automated flow of events to serve AI insights. The numbers beside each element in the tech stack correspond with an explanation of how they are involved in the flow:

A K8s integration updates the portal with health status of a workload
An automated workflow issues a message to a Kafka Topic
Our Python script picks up the Topic message data
Our Python script polls K8sGPT for insights
K8sGPT communicates with our K8s cluster and LLM
K8sGPT replies to our Python script with insights
Our Python script leverages our portal API — in our case, Port — to populate the Kubernetes workload entity with insights on how to fix the issue

Configuring the automated workflow in the portal

Let's now configure the portal to facilitate the automated workflow. First things first, we need to make room for AI insights data as part of the Kubernetes workload dashboard. We can add an Insights property to our workload blueprint:

*A JSON representation of our K8s workload blueprint in data mode* *(see this JSON code in GitHub)*

Next, we define a workflow that will automatically notify our communication facilitator of an unhealthy workload. We need to select the datapoint that will trigger the workflow automation (reporting on the health of our workloads) and we need to define what will be triggered (workload data will be sent to Kafka):

*JSON representation of the automation workflow* *(see this code in GitHub)*

Last but not least, we need to create the facilitator that will:

Continuously listen to Kafka topics
Consume relevant messages (with the right type) to poll K8sGPT regarding the identified, unhealthy workloads
Populate the portal with insights from K8sGPT for the relevant unhealthy Kubernetes workloads

Here is an example of what K8s AI insights look like in the portal:

*JSON representation of our insights on Kubernetes workflows (see this code in GitHub*)

Now, when this workflow is implemented, you can receive regular, real-time updates on the health of your Kubernetes workflows, alongside recommendations for how to resolve them — reducing the time it takes to locate the issues and figure out how to fix them.

Additional considerations for your workflow

Though this may be a good start to automating your workflow, there are still other ways you can continuously implement improvements to it and boost efficiency.

We could simplify the flow of events in our workflow in multiple ways — for example, we can entirely bypass using an event to trigger K8sGPT and instead modify the script to continuously monitor the health of the cluster and distribute insights autonomously.

The command-line outputs and overall refinements in our example are also better than the outputs provided via REST API. Therefore, some additional output modifications were made to improve the overall REST API generated output.

Using alternative GPTs

While K8sGPT offers impressive capabilities, in more advanced scenarios, AI can assist across multiple domains, such as cloud infrastructure, where issues often span different layers of the stack. Our ultimate goal is not just to equip AI to handle multiple domains but to empower it to fully automate the remediation process and resolve issues independently.

I recently came across another emerging open-source AI project, HolmesGPT, which I believe complements and even extends K8sGPT’s functionality.

HolmesGPT offers AI-driven insights that supports both Kubernetes and other flexible deployment architectures, along with multiple AI models. However, based on my experiments, HolmesGPT stands out with its advanced capabilities and superior performance.

One of HolmesGPT’s standout features is its ability to understand and respond to natural language queries. Here are two examples that illustrate its prowess:

Simple question: Identifying Kubernetes pods with issues

Complex inquiry: Requesting solutions

But HolmesGPT doesn’t stop at Kubernetes. It extends its analytical capabilities to a wide range of platforms and tools, including PagerDuty, OpsGenie, Prometheus, and Jira, among others. This cross-domain functionality is a game-changer, allowing users to set up workflows that analyze and interpret logs and data from many different sources.

Central to this capability is the concept of runbooks, which can be defined in natural language. These runbooks enable users to create cross-domain workflows for comprehensive issue analysis and resolution, making the entire troubleshooting process more coherent and streamlined.

In essence, HolmesGPT isn’t just an AI tool for Kubernetes — it’s a holistic solution for modern DevOps environments, empowering teams to resolve issues more efficiently and effectively.

Summary

Debugging and resolving issues often consumes significant time and involve error-prone manual processes for engineers.
Reducing time-to-resolution is crucial for improving service quality and allowing teams to focus on innovation.
Internal developer portals represent a significant step towards reducing time-to-resolution by providing refined, contextual information.
Portals can be further enhanced by leveraging AI insights across various domains.
The ultimate goal is to achieve cross-domain insights and automated remediation, streamlining problem-solving processes.

Want to see how it could work for you? Check out Port’s live demo or read about driving developer self-service with Crossplane, Kubernetes and a portal, here.

Tags:

Internal Developer Portal

Check out Port's pre-populated demo and see what it's all about.

Check live demo

No email required

Check out the 2025 State of Internal Developer Portals report

See the full report

No email required

Contact sales for a technical product walkthrough

Let’s start

Open a free Port account. No credit card required

Let’s start

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start

Check out Port's pre-populated demo and see what it's all about.

(no email required)

Let’s start

Contact sales for a technical walkthrough of Port

Let’s start

Open a free Port account. No credit card required

Let’s start

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start

Book a demo right now to check out Port's developer portal yourself

Apply to join the Beta for Port's new Backstage plugin

Apply for beta

It's a Trap - Jenkins as Self service UI

How do GitOps affect developer experience?

It's a Trap - Jenkins as Self service UI. Click her to download the eBook

Download eBook

Learning from CyberArk - building an internal developer platform in-house

Learn more about Port’s Backstage plugin

Build Backstage better — with Port

Read the plugin docs

Return to Backstage Plugin docs

Example JSON block

{
  "foo": "bar"
}

Order Domain

{
  "properties": {},
  "relations": {},
  "title": "Orders",
  "identifier": "Orders"
}

Cart System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Cart",
  "title": "Cart"
}

Products System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Products",
  "title": "Products"
}

Cart Resource

{
  "properties": {
    "type": "postgress"
  },
  "relations": {},
  "icon": "GPU",
  "title": "Cart SQL database",
  "identifier": "cart-sql-sb"
}

Cart API

{
 "identifier": "CartAPI",
 "title": "Cart API",
 "blueprint": "API",
 "properties": {
   "type": "Open API"
 },
 "relations": {
   "provider": "CartService"
 },
 "icon": "Link"
}

Core Kafka Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Kafka Library",
  "identifier": "CoreKafkaLibrary"
}

Core Payment Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Payment Library",
  "identifier": "CorePaymentLibrary"
}

Cart Service JSON

{
 "identifier": "CartService",
 "title": "Cart Service",
 "blueprint": "Component",
 "properties": {
   "type": "service"
 },
 "relations": {
   "system": "Cart",
   "resources": [
     "cart-sql-sb"
   ],
   "consumesApi": [],
   "components": [
     "CorePaymentLibrary",
     "CoreKafkaLibrary"
   ]
 },
 "icon": "Cloud"
}

Products Service JSON

{
  "identifier": "ProductsService",
  "title": "Products Service",
  "blueprint": "Component",
  "properties": {
    "type": "service"
  },
  "relations": {
    "system": "Products",
    "consumesApi": [
      "CartAPI"
    ],
    "components": []
  }
}

Component Blueprint

{
 "identifier": "Component",
 "title": "Component",
 "icon": "Cloud",
 "schema": {
   "properties": {
     "type": {
       "enum": [
         "service",
         "library"
       ],
       "icon": "Docs",
       "type": "string",
       "enumColors": {
         "service": "blue",
         "library": "green"
       }
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "system": {
     "target": "System",
     "required": false,
     "many": false
   },
   "resources": {
     "target": "Resource",
     "required": false,
     "many": true
   },
   "consumesApi": {
     "target": "API",
     "required": false,
     "many": true
   },
   "components": {
     "target": "Component",
     "required": false,
     "many": true
   },
   "providesApi": {
     "target": "API",
     "required": false,
     "many": false
   }
 }
}

Resource Blueprint

{
 “identifier”: “Resource”,
 “title”: “Resource”,
 “icon”: “DevopsTool”,
 “schema”: {
   “properties”: {
     “type”: {
       “enum”: [
         “postgress”,
         “kafka-topic”,
         “rabbit-queue”,
         “s3-bucket”
       ],
       “icon”: “Docs”,
       “type”: “string”
     }
   },
   “required”: []
 },
 “mirrorProperties”: {},
 “formulaProperties”: {},
 “calculationProperties”: {},
 “relations”: {}
}

API Blueprint

{
 "identifier": "API",
 "title": "API",
 "icon": "Link",
 "schema": {
   "properties": {
     "type": {
       "type": "string",
       "enum": [
         "Open API",
         "grpc"
       ]
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "provider": {
     "target": "Component",
     "required": true,
     "many": false
   }
 }
}

Domain Blueprint

{
 "identifier": "Domain",
 "title": "Domain",
 "icon": "Server",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {}
}

System Blueprint

{
 "identifier": "System",
 "title": "System",
 "icon": "DevopsTool",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "domain": {
     "target": "Domain",
     "required": true,
     "many": false
   }
 }
}

Microservices SDLC

Scaffold a new microservice
Deploy (canary or blue-green)
Feature flagging
Revert
Lock deployments
Add Secret
Force merge pull request (skip tests on crises)
Add environment variable to service
Add IaC to the service
Upgrade package version

Development environments

Spin up a developer environment for 5 days
ETL mock data to environment
Invite developer to the environment
Extend TTL by 3 days

Cloud resources

Provision a cloud resource
Modify a cloud resource
Get permissions to access cloud resource

SRE actions

Update pod count
Update auto-scaling group
Execute incident response runbook automation

Data Engineering

Add / Remove / Update Column to table
Run Airflow DAG
Duplicate table

Backoffice

Change customer configuration
Update customer software version
Upgrade - Downgrade plan tier
Create - Delete customer

Machine learning actions

Train model
Pre-process dataset
Deploy
A/B testing traffic route
Revert
Spin up remote Jupyter notebook

Engineering tools

Observability
Tasks management
CI/CD
On-Call management
Troubleshooting tools
DevSecOps
Runbooks

Infrastructure

Cloud Resources
K8S
Containers & Serverless
IaC
Databases
Environments
Regions

Software and more

Microservices
Docker Images
Docs
APIs
3rd parties
Runbooks
Cron jobs

Starting with Port is simple, fast and free.

Let’s start

Ready to start?

Using K8sGPT to enrich portal data

Deploying an automatic AI enrichment process

Understanding the flow of events

Configuring the automated workflow in the portal

Additional considerations for your workflow

Using alternative GPTs

Summary

Tags:

Previous article

Next article

Check out Port's pre-populated demo and see what it's all about.

Check out the 2025 State of Internal Developer Portals report

Contact sales for a technical product walkthrough

Open a free Port account. No credit card required

Watch Port live coding videos - setting up an internal developer portal & platform

Check out Port's pre-populated demo and see what it's all about.

Contact sales for a technical walkthrough of Port

Open a free Port account. No credit card required

Watch Port live coding videos - setting up an internal developer portal & platform

Book a demo right now to check out Port's developer portal yourself

Apply to join the Beta for Port's new Backstage plugin

It's a Trap - Jenkins as Self service UI

How do GitOps affect developer experience?

It's a Trap - Jenkins as Self service UI. Click her to download the eBook

Learning from CyberArk - building an internal developer platform in-house

Further reading:

Learn more about Port’s Backstage plugin

Build Backstage better — with Port

Example JSON block

Order Domain

Cart System

Products System

Cart Resource

Cart API

Core Kafka Library

Core Payment Library

Cart Service JSON

Products Service JSON

Component Blueprint

Resource Blueprint

API Blueprint

Domain Blueprint

System Blueprint

Microservices SDLC

Development environments

Cloud resources

SRE actions

Data Engineering

Backoffice

Machine learning actions

Engineering tools

Infrastructure

Software and more

You may also be interested in

How site reliability engineers (SREs) can "shift left" using a unified service catalog

How to measure the ROI of GenAI tools

What is an internal developer portal homepage?

Starting with Port is simple, fast and free.