Debugging K8s with K8sGPT in your internal developer portal

December 6, 2024

Debugging K8s with K8sGPT in your internal developer portal

Ready to start?

Editor’s note: This post was originally published on The New Stack on 21 November 2024.

Quickly identifying and resolving issues is a constant challenge for DevOps and SRE teams, who often find themselves navigating a complex web of commands, logs, and dashboards that is unique to each problem. This fragmented approach delays resolutions and developers frequently report that they spend nearly 40% of their time just troubleshooting — which also opens software environments to risk of human error.

Platform engineering emerged as a way to overcome this DevOps complexity, and at the heart of platform engineering is the internal developer portal. An internal developer portal streamlines incident response, reduces manual toil, and empowers DevOps teams to resolve issues faster. It offers a unified space for managing infrastructure, code repositories, and deployments. 

Portals also centralize all the data related to your software development lifecycle (SDLC) in one, accessible place. Integrating AI into your portal can help you proactively identify potential system degradations and provide instant guidance on remediation, which can sometimes cut your average incident resolution time by 50%.

In this article, I’ll walk you through how to accelerate issue resolution using AI to enrich portal data, and how to display the enriched data within the portal to reduce time-to-resolution. 

Using K8sGPT to enrich portal data

K8sGPT is an AI agent specifically designed for Kubernetes environments. It surfaces actionable insights from historical data, providing quick recommendations that significantly reduce resolution times. By pinpointing anomalies or misconfigurations and offering intelligent solutions, K8sGPT transforms a traditionally reactive process into a proactive one. Plus, by tightly integrating with your portal, these insights are presented in a single pane of glass that is fully aligned with your operational workflows.

While our example will focus on Kubernetes only, AI can assist across multiple domains in more advanced scenarios, such as cloud infrastructure, where issues often span different layers of the stack. Our ultimate goal is not just to equip AI to handle multiple domains but to empower it to fully automate the remediation process, resolving issues independently. 

In the context of an internal developer portal, you can use K8sGPT to gather data from all of your workflows across your entire SDLC and draw insights from them. With that vision in mind, let’s start with small steps and explore how a single-domain workflow can improve efficiency.

Deploying an automatic AI enrichment process

Let’s say we want to create an automated workflow to enrich our internal developer portal with a real-time view of failing Kubernetes workloads. This workflow involves several key components that, when working together, will use AI to create an automated process that helps us use our portal to solve observed issues in K8s. 

These components are:

  • Kubernetes (K8s) cluster: This represents our workload infrastructure. There are multiple ways to deploy Kubernetes clusters, and the most common are platform-as-a-service (PaaS) such as EKS, AKS, and GKE; whatever you’re using, there should also be an integration between the cluster and the portal in order to have both the listing of the workloads and their health status.
  • Internal developer portal: This is where all the data about our Kubernetes cluster is centralized and can be correlated and refined. This will provide us easy access to deployments data and AI Insights on how to solve unhealthy Kubernetes workloads. We’ll be using Port for this example. 
  • K8sGPT: This is our main AI “consultant.” It is in charge of both issuing commands against the K8s API to collect data and to communicate back and forth with an AI LLM that provides insights. some text
  • Communication facilitator: The communication facilitator is crucial for bridging the gap between the portal and K8sGPT. It ensures that commands, queries, and insights flow seamlessly between these systems. Depending on your organization's security and compliance requirements, you can use:some text
    • Kafka topics: In our example we have used Kafka topics, which means that when a workload is recognized as failing, a message will be created in a Kafka topic. The communication facilitator (in our case, a Python script) will handle checking the topics and consuming messages in a PULL-based direction.
    • Another approach is to just have the script constantly check locally for the health of workloads and enrich that check with AI insights when workloads fail.
  • AI Language Learning Model (LLM): The core intelligence behind K8sGPT, leveraging natural language processing to interpret Kubernetes data and provide actionable recommendations. 

You should note that K8sGPT currently supports 11 different types of AI backends. I have tested it with Ollama; you can follow instructions to also download a model here. Check out these guides for using an OpenAI API token and deploying an on-prem LLM using Ollama to help. 

Once deployed, you can configure K8sGPT to use Ollama like so:

k8sgpt serve --http -b openai

Understanding the flow of events

Let’s take a look at the below diagram to better understand how we enable a fully automated flow of events to serve AI insights. The numbers beside each element in the tech stack correspond with an explanation of how they are involved in the flow:

  1. A K8s integration updates the portal with health status of a workload
  2. An automated workflow issues a message to a Kafka Topic
  3. Our Python script picks up the Topic message data
  4. Our Python script polls K8sGPT for insights
  5. K8sGPT communicates with our K8s cluster and LLM
  6. K8sGPT replies to our Python script with insights
  7. Our Python script leverages our portal API — in our case, Port — to populate the Kubernetes workload entity with insights on how to fix the issue

Configuring the automated workflow in the portal

Let's now configure the portal to facilitate the automated workflow. First things first, we need to make room for AI insights data as part of the Kubernetes workload dashboard. We can add an Insights property to our workload blueprint:

A JSON representation of our K8s workload blueprint in data mode (see this JSON code in GitHub)

Next, we define a workflow that will automatically notify our communication facilitator of an unhealthy workload. We need to select the datapoint that will trigger the workflow automation (reporting on the health of our workloads) and we need to define what will be triggered (workload data will be sent to Kafka):

JSON representation of the automation workflow (see this code in GitHub)

Last but not least, we need to create the facilitator that will:

  • Continuously listen to Kafka topics
  • Consume relevant messages (with the right type) to poll K8sGPT regarding the identified, unhealthy workloads
  • Populate the portal with insights from K8sGPT for the relevant unhealthy Kubernetes workloads

Here is an example of what K8s AI insights look like in the portal:

JSON representation of our insights on Kubernetes workflows (see this code in GitHub)

Now, when this workflow is implemented, you can receive regular, real-time updates on the health of your Kubernetes workflows, alongside recommendations for how to resolve them — reducing the time it takes to locate the issues and figure out how to fix them.

Additional considerations for your workflow

Though this may be a good start to automating your workflow, there are still other ways you can continuously implement improvements to it and boost efficiency.

We could simplify the flow of events in our workflow in multiple ways — for example, we can entirely bypass using an event to trigger K8sGPT and instead modify the script to continuously monitor the health of the cluster and distribute insights autonomously. 

The command-line outputs and overall refinements in our example are also better than the outputs provided via REST API. Therefore, some additional output modifications were made to improve the overall REST API generated output.

Using alternative GPTs

While K8sGPT offers impressive capabilities, in more advanced scenarios, AI can assist across multiple domains, such as cloud infrastructure, where issues often span different layers of the stack. Our ultimate goal is not just to equip AI to handle multiple domains but to empower it to fully automate the remediation process and resolve issues independently.

I recently came across another emerging open-source AI project, HolmesGPT, which I believe complements and even extends K8sGPT’s functionality. 

HolmesGPT offers AI-driven insights that supports both Kubernetes and other flexible deployment architectures, along with multiple AI models. However, based on my experiments, HolmesGPT stands out with its advanced capabilities and superior performance.

One of HolmesGPT’s standout features is its ability to understand and respond to natural language queries. Here are two examples that illustrate its prowess:

  • Simple question: Identifying Kubernetes pods with issues
  • Complex inquiry: Requesting solutions

But HolmesGPT doesn’t stop at Kubernetes. It extends its analytical capabilities to a wide range of platforms and tools, including PagerDuty, OpsGenie, Prometheus, and Jira, among others. This cross-domain functionality is a game-changer, allowing users to set up workflows that analyze and interpret logs and data from many different sources.

Central to this capability is the concept of runbooks, which can be defined in natural language. These runbooks enable users to create cross-domain workflows for comprehensive issue analysis and resolution, making the entire troubleshooting process more coherent and streamlined.

In essence, HolmesGPT isn’t just an AI tool for Kubernetes — it’s a holistic solution for modern DevOps environments, empowering teams to resolve issues more efficiently and effectively.

Summary

  • Debugging and resolving issues often consumes significant time and involve error-prone manual processes for engineers.
  • Reducing time-to-resolution is crucial for improving service quality and allowing teams to focus on innovation.
  • Internal developer portals represent a significant step towards reducing time-to-resolution by providing refined, contextual information.
  • Portals can be further enhanced by leveraging AI insights across various domains.
  • The ultimate goal is to achieve cross-domain insights and automated remediation, streamlining problem-solving processes.

Want to see how it could work for you? Check out Port’s live demo or read about driving developer self-service with Crossplane, Kubernetes and a portal, here.

{{cta_1}}

Check out Port's pre-populated demo and see what it's all about.

Check live demo

No email required

{{cta_2}}

Contact sales for a technical product walkthrough

Let’s start
{{cta_3}}

Open a free Port account. No credit card required

Let’s start
{{cta_4}}

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start
{{cta_5}}

Check out Port's pre-populated demo and see what it's all about.

(no email required)

Let’s start
{{cta_6}}

Contact sales for a technical product walkthrough

Let’s start
{{cta_7}}

Open a free Port account. No credit card required

Let’s start
{{cta_8}}

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start
{{cta-demo}}
{{reading-box-backstage-vs-port}}

Example JSON block

{
  "foo": "bar"
}

Order Domain

{
  "properties": {},
  "relations": {},
  "title": "Orders",
  "identifier": "Orders"
}

Cart System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Cart",
  "title": "Cart"
}

Products System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Products",
  "title": "Products"
}

Cart Resource

{
  "properties": {
    "type": "postgress"
  },
  "relations": {},
  "icon": "GPU",
  "title": "Cart SQL database",
  "identifier": "cart-sql-sb"
}

Cart API

{
 "identifier": "CartAPI",
 "title": "Cart API",
 "blueprint": "API",
 "properties": {
   "type": "Open API"
 },
 "relations": {
   "provider": "CartService"
 },
 "icon": "Link"
}

Core Kafka Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Kafka Library",
  "identifier": "CoreKafkaLibrary"
}

Core Payment Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Payment Library",
  "identifier": "CorePaymentLibrary"
}

Cart Service JSON

{
 "identifier": "CartService",
 "title": "Cart Service",
 "blueprint": "Component",
 "properties": {
   "type": "service"
 },
 "relations": {
   "system": "Cart",
   "resources": [
     "cart-sql-sb"
   ],
   "consumesApi": [],
   "components": [
     "CorePaymentLibrary",
     "CoreKafkaLibrary"
   ]
 },
 "icon": "Cloud"
}

Products Service JSON

{
  "identifier": "ProductsService",
  "title": "Products Service",
  "blueprint": "Component",
  "properties": {
    "type": "service"
  },
  "relations": {
    "system": "Products",
    "consumesApi": [
      "CartAPI"
    ],
    "components": []
  }
}

Component Blueprint

{
 "identifier": "Component",
 "title": "Component",
 "icon": "Cloud",
 "schema": {
   "properties": {
     "type": {
       "enum": [
         "service",
         "library"
       ],
       "icon": "Docs",
       "type": "string",
       "enumColors": {
         "service": "blue",
         "library": "green"
       }
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "system": {
     "target": "System",
     "required": false,
     "many": false
   },
   "resources": {
     "target": "Resource",
     "required": false,
     "many": true
   },
   "consumesApi": {
     "target": "API",
     "required": false,
     "many": true
   },
   "components": {
     "target": "Component",
     "required": false,
     "many": true
   },
   "providesApi": {
     "target": "API",
     "required": false,
     "many": false
   }
 }
}

Resource Blueprint

{
 “identifier”: “Resource”,
 “title”: “Resource”,
 “icon”: “DevopsTool”,
 “schema”: {
   “properties”: {
     “type”: {
       “enum”: [
         “postgress”,
         “kafka-topic”,
         “rabbit-queue”,
         “s3-bucket”
       ],
       “icon”: “Docs”,
       “type”: “string”
     }
   },
   “required”: []
 },
 “mirrorProperties”: {},
 “formulaProperties”: {},
 “calculationProperties”: {},
 “relations”: {}
}

API Blueprint

{
 "identifier": "API",
 "title": "API",
 "icon": "Link",
 "schema": {
   "properties": {
     "type": {
       "type": "string",
       "enum": [
         "Open API",
         "grpc"
       ]
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "provider": {
     "target": "Component",
     "required": true,
     "many": false
   }
 }
}

Domain Blueprint

{
 "identifier": "Domain",
 "title": "Domain",
 "icon": "Server",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {}
}

System Blueprint

{
 "identifier": "System",
 "title": "System",
 "icon": "DevopsTool",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "domain": {
     "target": "Domain",
     "required": true,
     "many": false
   }
 }
}
{{tabel-1}}

Microservices SDLC

  • Scaffold a new microservice

  • Deploy (canary or blue-green)

  • Feature flagging

  • Revert

  • Lock deployments

  • Add Secret

  • Force merge pull request (skip tests on crises)

  • Add environment variable to service

  • Add IaC to the service

  • Upgrade package version

Development environments

  • Spin up a developer environment for 5 days

  • ETL mock data to environment

  • Invite developer to the environment

  • Extend TTL by 3 days

Cloud resources

  • Provision a cloud resource

  • Modify a cloud resource

  • Get permissions to access cloud resource

SRE actions

  • Update pod count

  • Update auto-scaling group

  • Execute incident response runbook automation

Data Engineering

  • Add / Remove / Update Column to table

  • Run Airflow DAG

  • Duplicate table

Backoffice

  • Change customer configuration

  • Update customer software version

  • Upgrade - Downgrade plan tier

  • Create - Delete customer

Machine learning actions

  • Train model

  • Pre-process dataset

  • Deploy

  • A/B testing traffic route

  • Revert

  • Spin up remote Jupyter notebook

{{tabel-2}}

Engineering tools

  • Observability

  • Tasks management

  • CI/CD

  • On-Call management

  • Troubleshooting tools

  • DevSecOps

  • Runbooks

Infrastructure

  • Cloud Resources

  • K8S

  • Containers & Serverless

  • IaC

  • Databases

  • Environments

  • Regions

Software and more

  • Microservices

  • Docker Images

  • Docs

  • APIs

  • 3rd parties

  • Runbooks

  • Cron jobs

Starting with Port is simple, fast and free.

Let’s start