Debugging K8s with K8sGPT in your internal developer portal
December 6, 2024
Ready to start?
Editor’s note: This post was originally published on The New Stack on 21 November 2024.
Quickly identifying and resolving issues is a constant challenge for DevOps and SRE teams, who often find themselves navigating a complex web of commands, logs, and dashboards that is unique to each problem. This fragmented approach delays resolutions and developers frequently report that they spend nearly 40% of their time just troubleshooting — which also opens software environments to risk of human error.
Platform engineering emerged as a way to overcome this DevOps complexity, and at the heart of platform engineering is the internal developer portal. An internal developer portal streamlines incident response, reduces manual toil, and empowers DevOps teams to resolve issues faster. It offers a unified space for managing infrastructure, code repositories, and deployments.
Portals also centralize all the data related to your software development lifecycle (SDLC) in one, accessible place. Integrating AI into your portal can help you proactively identify potential system degradations and provide instant guidance on remediation, which can sometimes cut your average incident resolution time by 50%.
In this article, I’ll walk you through how to accelerate issue resolution using AI to enrich portal data, and how to display the enriched data within the portal to reduce time-to-resolution.
Using K8sGPT to enrich portal data
K8sGPT is an AI agent specifically designed for Kubernetes environments. It surfaces actionable insights from historical data, providing quick recommendations that significantly reduce resolution times. By pinpointing anomalies or misconfigurations and offering intelligent solutions, K8sGPT transforms a traditionally reactive process into a proactive one. Plus, by tightly integrating with your portal, these insights are presented in a single pane of glass that is fully aligned with your operational workflows.
While our example will focus on Kubernetes only, AI can assist across multiple domains in more advanced scenarios, such as cloud infrastructure, where issues often span different layers of the stack. Our ultimate goal is not just to equip AI to handle multiple domains but to empower it to fully automate the remediation process, resolving issues independently.
In the context of an internal developer portal, you can use K8sGPT to gather data from all of your workflows across your entire SDLC and draw insights from them. With that vision in mind, let’s start with small steps and explore how a single-domain workflow can improve efficiency.
Deploying an automatic AI enrichment process
Let’s say we want to create an automated workflow to enrich our internal developer portal with a real-time view of failing Kubernetes workloads. This workflow involves several key components that, when working together, will use AI to create an automated process that helps us use our portal to solve observed issues in K8s.
These components are:
- Kubernetes (K8s) cluster: This represents our workload infrastructure. There are multiple ways to deploy Kubernetes clusters, and the most common are platform-as-a-service (PaaS) such as EKS, AKS, and GKE; whatever you’re using, there should also be an integration between the cluster and the portal in order to have both the listing of the workloads and their health status.
- Internal developer portal: This is where all the data about our Kubernetes cluster is centralized and can be correlated and refined. This will provide us easy access to deployments data and AI Insights on how to solve unhealthy Kubernetes workloads. We’ll be using Port for this example.
- K8sGPT: This is our main AI “consultant.” It is in charge of both issuing commands against the K8s API to collect data and to communicate back and forth with an AI LLM that provides insights. some text
- K8sGPT can be deployed both outside and in-cluster. To deploy the K8sGPT REST API server, follow the installation guide
- Use this command to serve the REST API: k8sgpt serve --http
- To deploy the in-cluster K8sGPT, follow the installation guide
- Communication facilitator: The communication facilitator is crucial for bridging the gap between the portal and K8sGPT. It ensures that commands, queries, and insights flow seamlessly between these systems. Depending on your organization's security and compliance requirements, you can use:some text
- Kafka topics: In our example we have used Kafka topics, which means that when a workload is recognized as failing, a message will be created in a Kafka topic. The communication facilitator (in our case, a Python script) will handle checking the topics and consuming messages in a PULL-based direction.
- Another approach is to just have the script constantly check locally for the health of workloads and enrich that check with AI insights when workloads fail.
- AI Language Learning Model (LLM): The core intelligence behind K8sGPT, leveraging natural language processing to interpret Kubernetes data and provide actionable recommendations.
You should note that K8sGPT currently supports 11 different types of AI backends. I have tested it with Ollama; you can follow instructions to also download a model here. Check out these guides for using an OpenAI API token and deploying an on-prem LLM using Ollama to help.
Once deployed, you can configure K8sGPT to use Ollama like so:
k8sgpt serve --http -b openai
Understanding the flow of events
Let’s take a look at the below diagram to better understand how we enable a fully automated flow of events to serve AI insights. The numbers beside each element in the tech stack correspond with an explanation of how they are involved in the flow:
- A K8s integration updates the portal with health status of a workload
- An automated workflow issues a message to a Kafka Topic
- Our Python script picks up the Topic message data
- Our Python script polls K8sGPT for insights
- K8sGPT communicates with our K8s cluster and LLM
- K8sGPT replies to our Python script with insights
- Our Python script leverages our portal API — in our case, Port — to populate the Kubernetes workload entity with insights on how to fix the issue
Configuring the automated workflow in the portal
Let's now configure the portal to facilitate the automated workflow. First things first, we need to make room for AI insights data as part of the Kubernetes workload dashboard. We can add an Insights property to our workload blueprint:
Next, we define a workflow that will automatically notify our communication facilitator of an unhealthy workload. We need to select the datapoint that will trigger the workflow automation (reporting on the health of our workloads) and we need to define what will be triggered (workload data will be sent to Kafka):
Last but not least, we need to create the facilitator that will:
- Continuously listen to Kafka topics
- Consume relevant messages (with the right type) to poll K8sGPT regarding the identified, unhealthy workloads
- Populate the portal with insights from K8sGPT for the relevant unhealthy Kubernetes workloads
Here is an example of what K8s AI insights look like in the portal:
Now, when this workflow is implemented, you can receive regular, real-time updates on the health of your Kubernetes workflows, alongside recommendations for how to resolve them — reducing the time it takes to locate the issues and figure out how to fix them.
Additional considerations for your workflow
Though this may be a good start to automating your workflow, there are still other ways you can continuously implement improvements to it and boost efficiency.
We could simplify the flow of events in our workflow in multiple ways — for example, we can entirely bypass using an event to trigger K8sGPT and instead modify the script to continuously monitor the health of the cluster and distribute insights autonomously.
The command-line outputs and overall refinements in our example are also better than the outputs provided via REST API. Therefore, some additional output modifications were made to improve the overall REST API generated output.
Using alternative GPTs
While K8sGPT offers impressive capabilities, in more advanced scenarios, AI can assist across multiple domains, such as cloud infrastructure, where issues often span different layers of the stack. Our ultimate goal is not just to equip AI to handle multiple domains but to empower it to fully automate the remediation process and resolve issues independently.
I recently came across another emerging open-source AI project, HolmesGPT, which I believe complements and even extends K8sGPT’s functionality.
HolmesGPT offers AI-driven insights that supports both Kubernetes and other flexible deployment architectures, along with multiple AI models. However, based on my experiments, HolmesGPT stands out with its advanced capabilities and superior performance.
One of HolmesGPT’s standout features is its ability to understand and respond to natural language queries. Here are two examples that illustrate its prowess:
- Simple question: Identifying Kubernetes pods with issues
- Complex inquiry: Requesting solutions
But HolmesGPT doesn’t stop at Kubernetes. It extends its analytical capabilities to a wide range of platforms and tools, including PagerDuty, OpsGenie, Prometheus, and Jira, among others. This cross-domain functionality is a game-changer, allowing users to set up workflows that analyze and interpret logs and data from many different sources.
Central to this capability is the concept of runbooks, which can be defined in natural language. These runbooks enable users to create cross-domain workflows for comprehensive issue analysis and resolution, making the entire troubleshooting process more coherent and streamlined.
In essence, HolmesGPT isn’t just an AI tool for Kubernetes — it’s a holistic solution for modern DevOps environments, empowering teams to resolve issues more efficiently and effectively.
Summary
- Debugging and resolving issues often consumes significant time and involve error-prone manual processes for engineers.
- Reducing time-to-resolution is crucial for improving service quality and allowing teams to focus on innovation.
- Internal developer portals represent a significant step towards reducing time-to-resolution by providing refined, contextual information.
- Portals can be further enhanced by leveraging AI insights across various domains.
- The ultimate goal is to achieve cross-domain insights and automated remediation, streamlining problem-solving processes.
Want to see how it could work for you? Check out Port’s live demo or read about driving developer self-service with Crossplane, Kubernetes and a portal, here.
Check out Port's pre-populated demo and see what it's all about.
No email required
Contact sales for a technical product walkthrough
Open a free Port account. No credit card required
Watch Port live coding videos - setting up an internal developer portal & platform
Check out Port's pre-populated demo and see what it's all about.
(no email required)
Contact sales for a technical product walkthrough
Open a free Port account. No credit card required
Watch Port live coding videos - setting up an internal developer portal & platform
Book a demo right now to check out Port's developer portal yourself
Apply to join the Beta for Port's new Backstage plugin
It's a Trap - Jenkins as Self service UI
Further reading:
Example JSON block
Order Domain
Cart System
Products System
Cart Resource
Cart API
Core Kafka Library
Core Payment Library
Cart Service JSON
Products Service JSON
Component Blueprint
Resource Blueprint
API Blueprint
Domain Blueprint
System Blueprint
Microservices SDLC
Scaffold a new microservice
Deploy (canary or blue-green)
Feature flagging
Revert
Lock deployments
Add Secret
Force merge pull request (skip tests on crises)
Add environment variable to service
Add IaC to the service
Upgrade package version
Development environments
Spin up a developer environment for 5 days
ETL mock data to environment
Invite developer to the environment
Extend TTL by 3 days
Cloud resources
Provision a cloud resource
Modify a cloud resource
Get permissions to access cloud resource
SRE actions
Update pod count
Update auto-scaling group
Execute incident response runbook automation
Data Engineering
Add / Remove / Update Column to table
Run Airflow DAG
Duplicate table
Backoffice
Change customer configuration
Update customer software version
Upgrade - Downgrade plan tier
Create - Delete customer
Machine learning actions
Train model
Pre-process dataset
Deploy
A/B testing traffic route
Revert
Spin up remote Jupyter notebook
Engineering tools
Observability
Tasks management
CI/CD
On-Call management
Troubleshooting tools
DevSecOps
Runbooks
Infrastructure
Cloud Resources
K8S
Containers & Serverless
IaC
Databases
Environments
Regions
Software and more
Microservices
Docker Images
Docs
APIs
3rd parties
Runbooks
Cron jobs