Production readiness checklist: ensuring smooth deployments

June 11, 2024

Ready to start?

Production readiness checklist: ensuring smooth deployments

What is production readiness?

Production readiness can be defined as the process that ensures that specific software components are secure, reliable and are able to perform at the level expected. Achieving production readiness can help reduce the chances of downtime, minimize the number of critical incidents or failures, and provide users with a better experience.

{{cta-demo}}

The idea stemmed from the production readiness review process in Google’s SRE book. The process relies on numerous factors that all play a role in making software production ready; but these factors look different in different engineering organizations.

Production readiness is difficult to achieve, partly because it requires incorporating a number of standards across the software development lifecycle such as reviewing code, testing, monitoring, security and access controls, documentation, deployment workflows and more. The process touches everything from code to post-production ops and additionally, the organization’s production engineering requirements can change over time.

In simpler terms, production readiness is similar to ‘definition of done’; an idea born from the product management world. 

The idea is that ‘the definition of done’ means that different stages need to be completed in their entirety to be considered ‘done’, but nothing in the software world is ever ‘done’, as it has to be continually and consistently reviewed, monitored and maintained. For example, if a service was production-ready when it was scaffolded, it won’t necessarily remain that way over time because requirements change and services (and their components) can degrade. 

There are different ways to actually ensure the review of the production readiness checklist:

  • By the DevOps engineer that is performing the action (for instance, when scaffolding a service)
  • By the developer
  • Using manual lists, stored in excel sheets or Jira
  • Using automated checks, such as using scorecards or self-service actions in an internal developer portal.

Ensuring that reviewing and maintaining lists is easy and doesn’t involve manual work is important; even more when services that have already deployed need to be reviewed for production readiness.

Importance of a production readiness checklist

A production readiness checklist is exactly what it sounds like - a ready-made list of everything that you need to check about your software for production readiness.

Ensuring that software is production-ready is closely tied to software standardization; all of which encompass the necessary steps to ensure smooth operation in a live environment.

The checklist is important because it means that from an engineering viewpoint the service has the appropriate resilience, security and performance; a slowdown or shutdown of your software or a breach could have a hugely detrimental impact on your business’ reputation and bottom line. As an engineering team, it could also have negative consequences internally. On the flip side, using a checklist can improve the user experience, and as a byproduct you can retain (and grow) customer trust and revenue. 

Getting started with a production readiness checklist 

Before we get into what you have to include in your list, there are some things you should bear in mind when planning your production readiness checklist:

The checklist shouldn’t be static

The software development life cycle is continually evolving and so new frameworks, dependencies and technologies should be factored into any checks. In fact, because of this evolution, the checklist, just like ‘the definition of done’ should be considered as the first steps of your production readiness checks, but not your final steps as even if you’ve ensured there are no bugs or vulnerabilities before deployment, there needs to be a way to check that new vulnerabilities have not appeared and that there is an approach to resolving such vulnerabilities embedded in the organization’s standards. This is where ongoing review of readiness comes into place.

Automated production readiness checks are vital

While the whole idea of using a checklist sounds like a manual approach in itself - to improve the efficiency and accuracy, it should evolve to automated checks. The only real manual approach of the checklist is compiling the list itself and verifying that it makes sense with all stakeholders.

When it comes to the checks themselves, each organization will have their own approach but it’s clear that manual checks - using spreadsheets, project management software or Configuration Management Databases (CMDBs) are inefficient and may not be up-to-date, which can subsequently hinder the trust that engineers have in the process. Automated checks, which rely on scorecards of production readiness, using internal developer portals, can monitor and validate readiness criteria on a continuous basis, and can consistently perform checks without human error. The automated aspect also enables these checks to go a step further; providing alerts when issues arise, and then enforcing policies, triggering tests and validating configurations; all providing a more efficient and reliable process. 

Checklists vary greatly and are difficult to put together

Creating a production readiness checklist is challenging due to the diverse requirements of different software components (e.g., APIs, microservices). These standards vary based on numerous factors, including the infrastructure, underlying technology, and the role of each component within the overall engineering ecosystem.

Each organization requires its own set of production readiness metrics and checklists tailored to its unique:

  • Business needs (eg. highly regulated industries handling sensitive data); and
  • Technical environments (eg. externally exposed services needing robust security measures, or adherence to specific Kubernetes standards).

Core components of a production readiness checklist

A comprehensive production readiness checklist for a service addresses multiple factors to ensure it’s good to go. These include: 

Security

  • Conduct vulnerability scans (are you connected to relevant scanners?)
  • Identify vulnerabilities through a security audit
  • Ensure you have SLOs and maximums set for vulnerabilities
  • Put role-based access controls in place
  • Ensure authentication and authorization methods are in place for each service
  • Static application security testing (SAST) using tools like Snyk to monitor code in the CI/CD pipeline.  
  • Make sure secrets are properly managed
  • Perform penetration tests and dynamic application security testing (DAST) at the appropriate times
  • Check all dependencies are using the correct versions using scanning tools.
  • Implement data encryption for both data at rest and in transit
  • Verify compliance with industry security standards
  • Checks for other common malicious activities

Scalability

  • Ensure the architecture is designed to handle increased loads efficiently
  • Stress test the application’s components to check their limits
  • Check whether your application can handle user or data growth
  • Use performance monitoring for SLOs
  • Establish performance benchmarks and then check these are met
  • Automate the CI/CD release process to enhance scalability
  • Execute automated unit and integration tests that require passing

Reliability

  • Define and monitor compliance with service-level objectives (SLOs), service-level indicators (SLIs) and service-level agreements (SLAs) 
  • Ensure disaster recovery plans are documented and tested
  • Keep regular backups of data
  • Ensure redundancy mechanisms are in place
  • Include automated rollback capabilities to revert to a stable version if needed.

Observability:

  • Implement monitoring with comprehensive KPI and health metrics, logging, and tracing
  • Ensure you are alerted via preferred method (Slack, email, etc)  if the status of your services change (through broken thresholds or inconsistencies)
  • Use dashboards for real-time status 
  • Use logging for incidents and errors

Ownership

  • Identify owners of services and components, include easily discoverable contact information and methods
  • Map upstream and downstream dependencies
  • Identify and make discoverable related teams, stakeholders, and team members

Incident Management

  • Ensure runbooks have been documented and are accessible.
  • Assign on-call responsibilities for incidents
  • Designate owning teams for each service
  • Establish escalation policies
  • Test incident response process with a drill
  • Ensure on-call is able to find the information they need easily during resolution

Addressing these areas ensures software is production-ready, capable of meeting user demands and maintaining reliability throughout its lifecycle.

Not all services need to track every metric listed; additional metrics might include FinOps, specific Kubernetes standards, or application security standards, which aren't always part of SRE activities.

Where to store the production readiness checklist

Where you store your checklist matters because it may impact how easy it is to find, use, update and even delete (by mistake!). Often, companies will store the checklist inside the GitHub repo as a markdown (.md) file; the benefit of this is that it is in the same space as code, and won’t get lost, but the downside is that it might not be as easily accessible. Alternatives include spreadsheets, which, just like the checks themselves, can be a painstaking exercise to use and manually update.

Key takeaways

In conclusion, a production readiness checklist is essential for guaranteeing that your services are secure, scalable, reliable, and observable. It also plays a critical role in implementing continuous integration and deployment (CI/CD), setting service level objectives (SLOs), and establishing robust disaster recovery and rollback plans. Incorporating these elements from the initial launch and throughout subsequent updates ensures the ongoing health and effectiveness of your services.

Learn how you can manage production readiness in an internal developer portal in this guide

{{cta_1}}

Check out Port's pre-populated demo and see what it's all about.

Check live demo

No email required

{{cta_2}}

Contact sales for a technical product walkthrough

Let’s start
{{cta_3}}

Open a free Port account. No credit card required

Let’s start
{{cta_4}}

Watch Port live coding videos - setting up an internal developer portal & platform

{{cta_5}}

Check out Port's pre-populated demo and see what it's all about.

(no email required)

Let’s start
{{cta_6}}

Contact sales for a technical product walkthrough

Let’s start
{{cta_7}}

Open a free Port account. No credit card required

Let’s start
{{cta_8}}

Watch Port live coding videos - setting up an internal developer portal & platform

{{cta-demo}}
{{reading-box-backstage-vs-port}}

Example JSON block

{
  "foo": "bar"
}

Order Domain

{
  "properties": {},
  "relations": {},
  "title": "Orders",
  "identifier": "Orders"
}

Cart System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Cart",
  "title": "Cart"
}

Products System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Products",
  "title": "Products"
}

Cart Resource

{
  "properties": {
    "type": "postgress"
  },
  "relations": {},
  "icon": "GPU",
  "title": "Cart SQL database",
  "identifier": "cart-sql-sb"
}

Cart API

{
 "identifier": "CartAPI",
 "title": "Cart API",
 "blueprint": "API",
 "properties": {
   "type": "Open API"
 },
 "relations": {
   "provider": "CartService"
 },
 "icon": "Link"
}

Core Kafka Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Kafka Library",
  "identifier": "CoreKafkaLibrary"
}

Core Payment Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Payment Library",
  "identifier": "CorePaymentLibrary"
}

Cart Service JSON

{
 "identifier": "CartService",
 "title": "Cart Service",
 "blueprint": "Component",
 "properties": {
   "type": "service"
 },
 "relations": {
   "system": "Cart",
   "resources": [
     "cart-sql-sb"
   ],
   "consumesApi": [],
   "components": [
     "CorePaymentLibrary",
     "CoreKafkaLibrary"
   ]
 },
 "icon": "Cloud"
}

Products Service JSON

{
  "identifier": "ProductsService",
  "title": "Products Service",
  "blueprint": "Component",
  "properties": {
    "type": "service"
  },
  "relations": {
    "system": "Products",
    "consumesApi": [
      "CartAPI"
    ],
    "components": []
  }
}

Component Blueprint

{
 "identifier": "Component",
 "title": "Component",
 "icon": "Cloud",
 "schema": {
   "properties": {
     "type": {
       "enum": [
         "service",
         "library"
       ],
       "icon": "Docs",
       "type": "string",
       "enumColors": {
         "service": "blue",
         "library": "green"
       }
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "system": {
     "target": "System",
     "required": false,
     "many": false
   },
   "resources": {
     "target": "Resource",
     "required": false,
     "many": true
   },
   "consumesApi": {
     "target": "API",
     "required": false,
     "many": true
   },
   "components": {
     "target": "Component",
     "required": false,
     "many": true
   },
   "providesApi": {
     "target": "API",
     "required": false,
     "many": false
   }
 }
}

Resource Blueprint

{
 “identifier”: “Resource”,
 “title”: “Resource”,
 “icon”: “DevopsTool”,
 “schema”: {
   “properties”: {
     “type”: {
       “enum”: [
         “postgress”,
         “kafka-topic”,
         “rabbit-queue”,
         “s3-bucket”
       ],
       “icon”: “Docs”,
       “type”: “string”
     }
   },
   “required”: []
 },
 “mirrorProperties”: {},
 “formulaProperties”: {},
 “calculationProperties”: {},
 “relations”: {}
}

API Blueprint

{
 "identifier": "API",
 "title": "API",
 "icon": "Link",
 "schema": {
   "properties": {
     "type": {
       "type": "string",
       "enum": [
         "Open API",
         "grpc"
       ]
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "provider": {
     "target": "Component",
     "required": true,
     "many": false
   }
 }
}

Domain Blueprint

{
 "identifier": "Domain",
 "title": "Domain",
 "icon": "Server",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {}
}

System Blueprint

{
 "identifier": "System",
 "title": "System",
 "icon": "DevopsTool",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "domain": {
     "target": "Domain",
     "required": true,
     "many": false
   }
 }
}
{{tabel-1}}

Microservices SDLC

  • Scaffold a new microservice

  • Deploy (canary or blue-green)

  • Feature flagging

  • Revert

  • Lock deployments

  • Add Secret

  • Force merge pull request (skip tests on crises)

  • Add environment variable to service

  • Add IaC to the service

  • Upgrade package version

Development environments

  • Spin up a developer environment for 5 days

  • ETL mock data to environment

  • Invite developer to the environment

  • Extend TTL by 3 days

Cloud resources

  • Provision a cloud resource

  • Modify a cloud resource

  • Get permissions to access cloud resource

SRE actions

  • Update pod count

  • Update auto-scaling group

  • Execute incident response runbook automation

Data Engineering

  • Add / Remove / Update Column to table

  • Run Airflow DAG

  • Duplicate table

Backoffice

  • Change customer configuration

  • Update customer software version

  • Upgrade - Downgrade plan tier

  • Create - Delete customer

Machine learning actions

  • Train model

  • Pre-process dataset

  • Deploy

  • A/B testing traffic route

  • Revert

  • Spin up remote Jupyter notebook

{{tabel-2}}

Engineering tools

  • Observability

  • Tasks management

  • CI/CD

  • On-Call management

  • Troubleshooting tools

  • DevSecOps

  • Runbooks

Infrastructure

  • Cloud Resources

  • K8S

  • Containers & Serverless

  • IaC

  • Databases

  • Environments

  • Regions

Software and more

  • Microservices

  • Docker Images

  • Docs

  • APIs

  • 3rd parties

  • Runbooks

  • Cron jobs

Starting with Port is simple, fast and free.

Let’s start