Kubernetes manifests look simple. They're just YAML. A model that can write valid YAML should be able to write valid Kubernetes manifests, right? In practice, Kubernetes is one of the most semantically complex YAML dialects in the ecosystem — with version-specific API changes, subtle interactions between resource types, security contexts that are easy to misconfigure, and networking behavior that depends on cluster-specific plugins. The surface area for plausible-looking-but-wrong outputs is enormous.
Models that perform well on Kubernetes tasks have internalized not just the manifest schema but the operational logic: when to use a Deployment vs a StatefulSet, how to correctly configure liveness vs readiness probes, how RBAC policies compose, what resource requests and limits actually mean for scheduling.
The Kubernetes-Specific Failure Modes
API version staleness. Kubernetes has deprecated and removed many APIs across versions. extensions/v1beta1 is gone. networking.k8s.io/v1beta1 Ingress is gone. A model trained on older data will confidently generate configurations using removed APIs. The best models know not just the current APIs but when and why things changed.
Security context mistakes. The difference between a secure and insecure pod spec often comes down to a few fields: runAsNonRoot, allowPrivilegeEscalation, readOnlyRootFilesystem, capabilities. Models with shallow Kubernetes knowledge generate working pods that fail security scanning. This matters enormously in production clusters.
Resource limit misconfiguration. Missing resource requests break the scheduler. Missing limits create noisy neighbors. Setting CPU limits too aggressively causes throttling that looks like a memory problem. Getting resource configuration right requires understanding how the Kubernetes scheduler and kubelet actually use these values.
Service and networking topology errors. ClusterIP vs NodePort vs LoadBalancer has different implications that many models conflate. Network policies are easy to get directionally wrong (blocking traffic you meant to allow). DNS names within cluster follow specific patterns that models sometimes misremember.
Helm chart generation is harder than raw manifest generation. Helm requires understanding template syntax, values schema design, named templates, and the difference between what should be configurable vs hardcoded. Models that score well on manifest tasks don't always score proportionally well on Helm chart design.
Current Rankings
Kubernetes manifest generation
devops sre
| # | Model | Score |
|---|---|---|
| 1 | anthropic/claude-sonnet-4 external/anthropic/claude-sonnet-4 | 36.7 |
| 2 | gemini-2.5-pro external/google/gemini-2-5-pro | 31.7 |
| 3 | gpt-5-2025-08-07 external/openai/gpt-5-2025-08-07 | 29.1 |
| 4 | gpt-4.1-20250414 external/openai/gpt-4-1-20250414 | 27.8 |
| 5 | o3-20250416 external/openai/o3-20250416 | 24.4 |
| 6 | gpt-5-mini-2025-08-07 external/openai/gpt-5-mini-2025-08-07 | 23.6 |
| 7 | gpt-5.2-2025-12-11 external/openai/gpt-5-2-2025-12-11 | 23.0 |
| 8 | gemini-3-pro-preview external/google/gemini-3-pro-preview | 22.7 |
| 9 | Grok-4-0709 external/xai/grok-4-0709 | 22.2 |
| 10 | claude-opus-4-5-20251101 external/anthropic/claude-opus-4-5-20251101 | 21.5 |
| 11 | google/gemini-3.1-pro-preview external/google/gemini-3-1-pro-preview | 20.1 |
| 12 | gpt-4.1-mini-20250414 external/openai/gpt-4-1-mini-20250414 | 19.3 |
| 13 | kimi/kimi-k2.5-thinking external/kimi/kimi-k2-5-thinking | 18.4 |
| 14 | gemini-2.5-flash external/google/gemini-2-5-flash | 16.9 |
| 15 | o4-mini external/openai/o4-mini | 16.8 |
What the Data Shows
The same models that perform well on Terraform also tend to perform well on Kubernetes. Both tasks reward deep API knowledge, precise instruction following, and security awareness. If a model is in the top tier for Terraform, check its Kubernetes ranking — it's usually competitive.
Models with strong code generation AND strong instruction following outperform models strong on only one. Kubernetes work requires both writing correct YAML (a code generation task) and interpreting operational requirements correctly (an instruction following task). Models that are very good coders but mediocre at parsing complex natural language requirements miss the operational intent.
Debugging outperforms generation on usefulness. Anecdotally, AI assistance with Kubernetes is more valuable for understanding error messages and diagnosing issues than for generating configs from scratch. Models with strong reasoning and broad Kubernetes knowledge provide more value explaining a CrashLoopBackOff than generating a boilerplate Deployment.
Practical Notes
Provide the output of kubectl explain for the resource types you're working with. Even top-ranked models benefit from being given the current schema, particularly for newer resources or CRDs. Don't assume the model's training data reflects your cluster version.
Always run kubeval or kubeconform on generated manifests. Schema validation catches API version errors and required field omissions before you kubectl apply. Add kube-score for security and best-practice analysis.
For Helm, provide your existing values.yaml schema. Models generate much better chart templates when they can see the values structure they're expected to consume. Without it, they tend to invent their own naming conventions that won't match your existing patterns.
Related Use Cases
- Terraform & IaC — complementary SRE tooling; often the same models
- Log triage — for the operational side of cluster management
- Agentic incident response — for autonomous cluster remediation workflows
Full methodology at /methodology.