Platform Engineering Case Study

Modernizing and Migrating a 20,000-CPU Mission System with Kubernetes

From Overprovisioned and Fragile to Efficient, Scalable, and Observable in a Secure, Air-Gapped Environment

Client: Government - U.S. National Security

About the Client (Government - U.S. National Security)

Our client relied on a massive mission system, and what they called a single “application” was, in reality, a complex system-of-systems consuming roughly 20,000 CPUs across virtual machines and bare metal. Over 15-20 years, subsystems had been passed from contractor to contractor. Environments were snowflaked, deployments were highly manual and slow, and performance tuning was largely guesswork due to a lack of consistent metrics. To keep the mission running, teams overprovisioned compute and scheduled long maintenance windows, sacrificing efficiency and agility.

Outcomes

  • Reduced compute footprint of each subsystem by ~40%-70%, allowing the customer to recapture multiple thousands of CPU units of compute, through containerization and aggressive rightsizing while maintaining mission performance
  • Cut major production deployment windows from 4-5 hours of scheduled downtime to ~15 seconds of container restarts, effectively achieving zero-downtime updates for key services
  • Shrunk worst-case build-and-deploy cycles from ~10 hours to ~30 minutes (and many services from ~5 minutes to ~30 seconds), enabling multiple production-quality builds per day
  • Decreased new Kubernetes environment provisioning from an all-day manual effort to 15-20 minutes using Git-driven, repeatable cluster definitions
  • Introduced end-to-end observability and autoscaling for the most compute-intensive subsystems, turning vague “we need more CPU” requests into data-driven tuning and targeted infrastructure fixes
  • Worked in a collaborative model that accelerated results and gave the client ownership of the platform and resulting migration, reducing dependency on third-party contractors
This wasn’t just incremental improvement. It was a fundamental shift to a scalable, observable, and cost-effective Kubernetes platform that could keep pace with mission needs, inside a secure, air-gapped national security environment.

Problem | Before working with us

Before the engagement, the client’s “single application” was actually a large, tightly coupled mission system split into around 12 subsystems, each owned by different contractor teams. Some of those subsystems were 15-20 years old, built on legacy assumptions and rarely revisited architectural decisions.

The environment looked like this:

  • Legacy patterns and technical debt - Some teams had made real progress toward modern practices, but others were stuck with hard-coded IP addresses, assumptions about fixed CPU counts, and applications that would catastrophically fail on receiving a simple SIGTERM. Many services weren’t built to scale horizontally or handle rolling updates.

  • Massive scale, brittle architecture - The system consumed roughly 20,000 CPUs across virtual machines and bare metal. Individual VMs often ran 16 or more processes, with scaling tied to the number of VMs rather than actual service demand.

  • Limited observability and overprovisioning - Platform teams lacked consistent metrics. When performance issues surfaced, the default answer was “add more CPU.” In one case, a subsystem was allocated 400 CPUs per environment, but later metrics showed an average usage of only ~18 CPUs with a peak around 25.

  • Manual, risky deployments - Production releases required 4-5 hours of scheduled downtime. The process was ticket-based, fragile, and stressful. Teams accepted the risk and disruption as the cost of doing business.

  • Air-gapped constraints - Everything ran inside secure, high-side environments without direct Internet access. Patching, image updates, and software distribution were slow, heavily controlled, and lacked modern automation patterns. Some edge environments were effectively sneaker-net only, further complicating operations.

The client knew they had to modernize this environment, but a big-bang migration was impossible. They needed a strategy that respected mission risk, worked within federal security constraints, and could be repeated across many subsystems.

Solution | After working with us

The client had already completed a bake-off between two primary container options: OpenShift and VMware Kubernetes Service (VKS). OpenShift’s opinionated model made it difficult to run several of their older workloads without significant rework. VKS, by contrast, offered a more flexible, unopinionated Kubernetes experience on top of the existing vSphere stack and a clear path to scale.

We stepped in after that decision to help them implement the following:

  • Development Team Consulting - We identified an early adopter team and paired closely with them to modernize their subsystem for containerized deployment. This included reworking legacy patterns (hardcoded IPs, fixed CPU assumptions, brittle signal handling) and building in graceful shutdown, horizontal scaling, and health checks. As that team built confidence and demonstrated results, they became internal advocates, cross-pollinating knowledge and modern practices to the other subsystem teams. This organic spread proved far more effective than top-down mandates for an environment with multiple independent contractor-owned subsystems, each with their own technical debt and institutional habits.

  • Distributed Monitoring - We deployed an observability stack across all environments, giving teams consistent visibility into resource consumption, application health, and performance bottlenecks for the first time. This replaced the guesswork behind “add more CPU” decisions with actual data, exposing cases like CPU overallocation and bandwidth bottlenecks. Development teams gained self-service access to dashboards and alerting, enabling them to troubleshoot issues independently rather than escalating to platform teams.

  • Platform Team Consulting - We upskilled the platform engineering team through collaborative architecture sessions and knowledge transfer, building well-rounded Kubernetes expertise across the group. This reduced reliance on any single decision-maker and gave the team confidence to own infrastructure choices independently. We also helped them establish patterns for managing the complexity of multiple subsystems owned by different contractors, creating shared standards without requiring lock-step coordination.

  • Airgapped-First Approach - We designed every component of the platform assuming disconnected, high-side deployment from day one. Image pipelines, GitOps workflows, and monitoring infrastructure all operated without upstream internet dependencies. For edge environments with sneakernet constraints, we built offline-capable distribution patterns that maintained the same automation and auditability as connected clusters.

This approach preserved security requirements while still enabling modern platform practices and frequent updates.

Services Provided

Our partnership with the client was not just about Kubernetes clusters, it was about delivering the right combination of platform engineering, security, and coaching to make the change stick. We used flexible scoping so our team could focus on what mattered most at any given moment, from cluster internals to application architecture to security posture.

Platform Engineering

We treated the Kubernetes platform as a product for internal developers, not just a collection of clusters. Platform capabilities (observability, ingress, CI/CD integration, environment bootstrap, autoscaling support) were prioritized based on direct developer and mission impact. We applied agile practices so platform work was always aligned with the most pressing needs: the next subsystem migration, the next performance test, or the next security requirement. The platform was treated as a living product, continuously refined as teams learned more about operating large-scale Kubernetes in a national security context.

Developer Enablement

We worked directly with development teams to modernize their applications for containerized deployment. This included coaching on cloud-native best practices: twelve-factor app principles, externalized configuration, graceful shutdown handling, and horizontal scaling patterns. We gave teams hands-on access to observability tooling and taught them how to instrument, monitor, and optimize their own services rather than relying on platform teams to diagnose issues. By partnering closely with an early adopter team, we created internal advocates who spread these practices organically across the other subsystem teams.

Migration Services

We used the strangler pattern to guide the transition from VM-based deployments to containers, training teams on the operational and security implications along the way. We partnered with the client’s security organization to adapt traditional Risk Management Framework (RMF) processes to a containerized environment, mapping platform components to relevant STIGs and controls. We shifted security thinking from host-based scanning on every VM to image scanning, least privilege, immutable images, and frequent rotation, preserving the same or stronger security outcomes while creating a path toward faster, more continuous approvals. Throughout, we focused on transfer of ownership, ensuring teams had the skills and confidence to operate in the new model long after the engagement ended.

Kubernetes Engineering

We served as resident platform engineers and SREs, working side-by-side with the client’s platform team on cluster internals and day-two operations. We helped design and implement a Git-based deployment model that integrated with the client’s existing deployment controller rather than forcing a wholesale replacement. We established baseline operational patterns (monitoring, alerting, capacity planning, and incident response) tuned for large-scale Kubernetes in air-gapped environments.

How we worked together

We didn’t just deliver manifests and walk away. We worked as an extension of the client’s teams across disciplines and clearance boundaries.

Together, we:

  • Paired extensively with developers and platform engineers, roughly 60% pairing, 30% workshops and teaching, and 10% solo implementation to keep momentum high
  • Coached application teams on container-friendly architecture, including signal handling, configuration management, and horizontal scaling
  • Worked directly with security, QA, and operations teams to update their mental models for how to test, approve, and operate containerized services
  • Facilitated regular sessions with the chief architects for both the application and platform, aligning technical execution with long-term mission and modernization goals
  • Built cross-team trust so that platform engineers understood application constraints and developers understood how the platform was operated, resulting in a degree of cross-training rarely seen in similar environments

This collaborative model accelerated results and gave the client real ownership of both the platform and the migration process, rather than dependency on external consultants.

Tech Stack Leveraged

Platform Engineering VMware Kubernetes Service (VKS) Kubernetes VCF ESXi vCenter NSX Advanced Load Balancer (Avi) GitLab Artifactory Jenkins Prometheus Grafana cert-manager Contour

Detailed Engagement Summary

Over the engagement, we delivered a repeatable, low-risk playbook for modernizing a mission-critical distributed system at massive scale. Working side-by-side with the client’s chief architects, platform engineers, security stakeholders, and developers, we:

1. Designed a migration strategy using the strangler pattern

Instead of a risky, all-at-once cutover, we helped the client adopt a strangler pattern:

  • Treat the existing mission system as a monolith composed of subsystems.
  • Identify one subsystem as the first modernization candidate.
  • Break that subsystem into independently scalable services on Kubernetes.
  • Plug it back into the larger system, verify behavior and performance, then repeat.

We worked closely with the chief application architect to understand:

  • How each subsystem was wired.
  • What external touchpoints and dependencies existed (e.g., shared notification services).
  • Which technology stacks and deployment patterns were in play.

We then met with each development team to assess:

  • Their familiarity with containers and 12-factor-style principles.
  • How they currently built and deployed code.
  • Their appetite and enthusiasm for leading the modernization effort.

Some subsystems were quickly deprioritized as early candidates:

  • One required exactly four CPUs to function correctly.
  • Another was riddled with hard-coded IP addresses.
  • Another would lose data after receiving a SIGTERM due to zero crash handling.

All of these could be addressed over time, but not as the first exemplar.

Instead, we chose a high-value, high-impact subsystem:

  • It represented roughly 40% of the total 20,000-CPU footprint.
  • It was highly compute-intensive and already needed to spread workloads across multiple VMs.
  • The team had adopted queues, environment-specific configuration, and other practices aligned with 12-factor principles.
  • The lead developer was sharp, motivated, and already experimenting with containers.

This became our flagship modernization workload and the internal champion for the broader transformation.

2. Built a platform team and secure-by-design foundation

In parallel, we helped form a dedicated platform team from existing infrastructure engineers:

  • A core group of three engineers took ownership of Kubernetes operations, including provisioning clusters, patching, scaling, and building the platform layer.
  • We paired with the client’s security representatives to translate VM-era controls into container-native patterns.

Because this was a federal, national security environment, we:

  • Mapped Broadcom’s VKS and underlying stack documentation to relevant Kubernetes and Photon STIGs.
  • Explained why host-based scanning inside every container would harm stability, and instead shifted focus to image scanning, immutability, and frequent rotation.
  • Clarified new inheritance models: what the platform secured vs. what each application team remained responsible for.

The goal wasn’t just to pass audits. It was to give the security team a clear, repeatable model for approving containerized workloads without slowing modernization to a crawl.

3. Re-architected the first subsystem for Kubernetes

We then dove deep into the first candidate subsystem:

  • Each legacy VM ran 10 or more distinct processes.
  • Scaling was tied to VM count, not actual per-process load.

We:

  • Mapped those processes into separate Kubernetes Deployments, allowing each to scale independently.
  • Deployed the subsystem into a dedicated Kubernetes cluster (via VKS) and namespace.
  • Initially matched the VM-based CPU and memory footprint to ensure a fair performance comparison.

We then ran the migrated subsystem through the client’s performance lab:

  • This lab mirrored production load patterns and was the gate for any mission-critical change.
  • Criteria for success were clear: hit or exceed existing performance metrics without regression or instability.

The first large-scale test validated the approach and surfaced a networking bottleneck in newly provisioned infrastructure, not in the application itself. Because we had robust metrics in place, we could quickly differentiate between application issues and network saturation and work with the client to fix the underlying switches.

4. Introduced observability, autoscaling, and data-driven tuning

To support both the platform and the application teams, we deployed:

  • Prometheus and Grafana for cluster and workload metrics.
  • Dashboards to visualize CPU, memory, and network usage across all services and nodes.

For the flagship subsystem:

  • We implemented Horizontal Pod Autoscaling (HPA) to handle highly variable, spiky workloads.
  • Autoscaling allowed the service to scale up under heavy load and scale down during quiet periods, without manual intervention or chronic overprovisioning.

Once the first containerized service went to production, we observed it for 30+ days:

  • The development team had noticed their CPU and memory was overprovisioned using our deployed Grafana instance; they then self-corrected their resource consumption resulting in their team using ~40% less resources.
  • With hard data in hand, the developer confidently reduced CPU reservations, allowing the platform team to retire multiple hypervisors that had previously been dedicated solely to this subsystem.

We repeated this pattern across additional subsystems: migrate, test, harden, observe, and then aggressively rightsize.

Over the year-long engagement, we:

  • Modernized almost every subsystem in the system.
  • Left a small number of deeply legacy services on VMs, with a path for eventual full rewrites where auto-scaling or containerization would require starting from scratch.

5. Modernized build, deploy, and environment provisioning

Kubernetes alone wasn’t enough; the surrounding delivery machinery needed to evolve as well.

We helped the client:

  • Adopt Git-first patterns for both application manifests and cluster configuration.
  • Keep all Kubernetes resources in Git, while integrating with an existing, long-lived deployment controller instead of forcing an immediate switch to Argo CD or Flux.
  • Trigger deployments via Git commits, allowing the legacy deployment controller to push resources to clusters while still following GitOps-like principles.

For platform provisioning:

  • VKS made cluster creation largely declarative.
  • We stored cluster definitions and day-two add-ons (monitoring, ingress, cert management, etc.) in a dedicated Git repository.
  • Standing up a new environment went from an all-day manual effort and scattered scripts on home directories to a 15-20 minute, fully scripted bootstrap.

For application builds:

  • The worst-case build-and-deploy pipeline shrank from ~10 hours (kicked off at night, checked the next morning) to about 30 minutes, fast enough for iterative work during the day.
  • Other services saw rebuild times drop from 5 minutes to ~30 seconds, enabling multiple builds and tests per day.

The result: development teams could iterate quickly, platform teams could provision environments on demand, and leadership could trust that both were happening in a controlled, auditable way.

6. Operated effectively in a secure, air-gapped environment

Operating all of this inside an environment with no direct Internet required careful architectural and process design.

We helped the client:

  • Prototype changes in unclassified lab environments on the “low side,” where engineers could experiment, iterate, and validate.
  • Package validated container images and manifests for ingestion into the high-side environment.
  • Run a hub-and-spoke model using Artifactory as the central hub:
    • The hub ingested and scanned container images.
    • Controlled replication pushed images and artifacts out to disconnected spoke environments, even those that still relied on sneaker-net.

This approach preserved security requirements while still enabling modern platform practices and frequent updates.