Session 050 — EKS: Karpenter — Dynamic Node Provisioning and NodePools
Prerequisite: session-049 (EKS add-ons, VPC CNI, EBS CSI)
Session objectives
- Understand the Karpenter architecture and how it differs from Cluster Autoscaler
- Install Karpenter via Helm with the correct IAM policies
- Create NodePool and EC2NodeClass with instance constraints, taint/toleration and disruption budgets
- Verify that Karpenter provisions and consolidates nodes in response to Pending/removed pods
- Decide when to use Karpenter vs Cluster Autoscaler vs managed node groups
1. Karpenter vs Cluster Autoscaler — Comparison
1.1 Mental model
[FACT] The Cluster Autoscaler (CAS) operates at the Auto Scaling Groups (ASGs) level: when the Kubernetes scheduler cannot place a pod, the CAS checks which ASGs could absorb the pod and increases the group's desiredCapacity. Karpenter operates at the pods level directly: when the scheduler cannot allocate a pod, Karpenter calls the EC2 API directly to create the instance that best meets the requested resources.
┌─────────────────────────────────────────────────────────────────────┐
│ Cluster Autoscaler │
│ │
│ Pod Pending │
│ │ │
│ ▼ │
│ CAS verifica node groups pré-configurados │
│ (cada node group = 1 tipo de instância ou familia restrita) │
│ │ │
│ ▼ │
│ Aumenta desiredCapacity do ASG mais adequado │
│ │ │
│ ▼ │
│ EC2 Auto Scaling cria instância (pode levar 2-5 min) │
│ │ │
│ ▼ │
│ Node registra no cluster → pod é agendado │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Karpenter │
│ │
│ Pod Pending (watch via K8s API) │
│ │ │
│ ▼ │
│ Karpenter lê requirements do pod (resources, nodeSelector, │
│ affinity, tolerations, topologySpread) │
│ │ │
│ ▼ │
│ Seleciona instância mais econômica que atende os requirements │
│ de todos os pods pendentes simultaneamente (bin-packing) │
│ │ │
│ ▼ │
│ Chama EC2 Fleet API diretamente (RunInstances / CreateFleet) │
│ Cria NodeClaim CRD para rastrear o estado │
│ │ │
│ ▼ │
│ Node registra → pod agendado (geralmente em < 60s) │
└─────────────────────────────────────────────────────────────────────┘
[FACT] Structural comparison:
╔══════════════════════════╦═══════════════════════════╦═══════════════════════════╗
║ Dimensão ║ Cluster Autoscaler (CAS) ║ Karpenter ║
╠══════════════════════════╬═══════════════════════════╬═══════════════════════════╣
║ Unidade de escala ║ Node Group (ASG) ║ Instância EC2 individual ║
║ Decisão de instância ║ Pré-configurada no NG ║ Runtime (melhor fit) ║
║ Nº de node groups ║ Muitos (1 por workload) ║ Poucos (1-3 NodePools) ║
║ Velocidade de scale-up ║ 2-5 min (ASG trigger) ║ < 60s (EC2 API direto) ║
║ Consolidação (scale-down)║ 10 min de idle padrão ║ Configurável (1m+) ║
║ Versionamento K8s ║ Acoplado (versão-específi)║ Desacoplado ║
║ Spot diversidade ║ Manual (1 NG p/ família) ║ Automático (pool amplo) ║
║ Bin-packing ║ Limitado (por ASG) ║ Global (todos os pods) ║
║ AWS API ║ Auto Scaling API ║ EC2 Fleet / RunInstances ║
╚══════════════════════════╩═══════════════════════════╩═══════════════════════════╝
[CONSENSUS] When to prefer Karpenter: clusters with variable/spiky demand, heterogeneous workloads, need for Spot with high instance diversity, or when the overhead of maintaining dozens of node groups is excessive.
[CONSENSUS] When to prefer CAS or static node groups: stable and predictable workloads, when organizational constraints prevent IAM with broad RunInstances/TerminateInstances powers, or clusters requiring compliance with very specific node configurations.
2. Karpenter Architecture
2.1 Components
[FACT] Karpenter runs as a Deployment with 2 replicas (controller + webhook) in kube-system. It is not a managed EKS add-on — it is installed via Helm chart from the OCI registry public.ecr.aws/karpenter/karpenter.
[FACT] CRDs created by Karpenter:
karpenter.sh/v1:
NodePool — restrições de scheduling e políticas de disruption
NodeClaim — representa uma instância EC2 em provisionamento/ativa
karpenter.k8s.aws/v1:
EC2NodeClass — configuração AWS-específica (AMI, subnet, SG, role)
karpenter.sh/v1 (readonly):
NodeOverlay — sobreposição de configuração sobre EC2NodeClass existente
[FACT] Karpenter must run on a node not managed by itself — on a managed node group or on Fargate. If the only node in the cluster is provisioned by Karpenter and it needs to be removed (consolidation), the Karpenter controller would have nowhere to run.
2.2 Provisioning flow
1. Pod fica Pending (scheduler não encontra nó adequado)
2. Karpenter detecta o pod via watch na K8s API
3. Karpenter agrupa pods pendentes que podem ser co-localizados
4. Seleciona EC2NodeClass e NodePool adequados
5. Escolhe instância ótima (bin-packing + custo + disponibilidade)
6. Cria NodeClaim CRD (rastreia estado)
7. Chama EC2 Fleet API para criar a instância
8. Instância bootstrapping: nodeadm/userdata configura o kubelet
9. Node aparece no kubectl get nodes
10. Karpenter associa o NodeClaim ao Node
11. Scheduler agenda os pods no novo nó
2.3 IAM — Karpenter Policies
[FACT] The official installation CloudFormation creates 6 separate IAM policies (v1.13):
KarpenterControllerNodeLifecyclePolicy → RunInstances, TerminateInstances,
CreateFleet, CreateLaunchTemplate,
DeleteLaunchTemplate, ...
KarpenterControllerIAMIntegrationPolicy → iam:PassRole (para KarpenterNodeRole),
iam:AddRoleToInstanceProfile,
iam:CreateInstanceProfile, ...
KarpenterControllerEKSIntegrationPolicy → eks:DescribeCluster
KarpenterControllerInterruptionPolicy → sqs:ReceiveMessage, sqs:DeleteMessage,
sqs:GetQueueUrl,
events:CreateEventBus, ...
KarpenterControllerResourceDiscoveryPolicy → ec2:Describe* (instâncias, AZs, subnets,
SGs), pricing:GetProducts
KarpenterControllerZonalShiftPolicy → arc-zonal-shift:GetManagedResource
[FACT] The KarpenterNodeRole-<cluster> is the IAM Role assigned to EC2 nodes created by Karpenter. It must have: AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly, AmazonSSMManagedInstanceCore.
[FACT] Tag security risk: Karpenter uses 3 tags to associate EC2 instances with NodeClaims:
- karpenter.sh/managed-by: <cluster-name>
- karpenter.sh/nodepool: <nodepool-name>
- kubernetes.io/cluster/<cluster-name>: owned
Any user with ec2:CreateTags/ec2:DeleteTags on these tags for i-* instances can manipulate Karpenter. The recommendation is to use tag-based IAM policies to restrict CreateTags/DeleteTags only to the Karpenter role.
3. NodePool and EC2NodeClass — Complete Anatomy
3.1 NodePool
[FACT] The NodePool defines the constraints on nodes that Karpenter can create. Each Pending pod is compared against available NodePools and scheduled on the NodePool that best fits.
# NodePool anotado com todos os campos relevantes
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-compute
spec:
# ── Template: configuração dos nós que serão criados ──────────────
template:
metadata:
labels:
team: platform # labels propagadas para o Node K8s
annotations:
example.com/owner: platform-team
spec:
# Referência ao EC2NodeClass (configuração AWS-específica)
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
# Taints no nó — pods precisam tolerar para serem agendados aqui
taints: []
# startupTaints: aplicados ao nó, mas pods NÃO precisam tolerar.
# Usados para aguardar inicialização (ex: Cilium CNI agent).
# Um DaemonSet ou controller externo deve remover o taint.
startupTaints: []
# Expiração do nó (TTL): após 720h, o nó é drenado e terminado.
# Útil para forçar rotação e aplicar patches de OS/K8s.
# 'Never' desabilita a expiração.
expireAfter: 720h
# Tempo máximo de drain antes de forçar terminação
terminationGracePeriod: 48h
# Requirements: constraints de scheduling (interseção com pod spec)
# Operadores: In, NotIn, Exists, DoesNotExist, Gt, Lt, Gte, Lte
requirements:
# Arquitetura
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
# OS
- key: kubernetes.io/os
operator: In
values: ["linux"]
# Tipo de capacidade
# Prioridade automática: reserved > spot > on-demand
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
# Categorias de instância (c=compute, m=general, r=memory)
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
# minValues: exige pelo menos N categorias distintas no pool
# (evita overfitting em uma única família para Spot)
minValues: 2
# Geração mínima (evita instâncias antigas)
- key: karpenter.k8s.aws/instance-generation
operator: Gte
values: ["3"]
# Excluir instâncias bare-metal (geralmente não necessárias)
- key: karpenter.k8s.aws/instance-hypervisor
operator: In
values: ["nitro"]
# ── Disruption: controle de consolidação e rotação ────────────────
disruption:
# WhenEmptyOrUnderutilized: consolida nós vazios E subutilizados
# WhenEmpty: consolida apenas nós sem workload pods
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m # aguarda 1 min de inatividade antes de consolidar
# Budgets: limita quantos nós podem ser interrompidos simultaneamente
budgets:
- nodes: "10%" # máximo 10% dos nós disruptados de uma vez
# Durante horário comercial (seg-sex 9h-17h): sem disruption
- schedule: "0 9 * * mon-fri"
duration: 8h
nodes: "0"
# ── Limits: teto de recursos que este NodePool pode consumir ──────
limits:
cpu: "1000" # 1000 vCPUs totais
memory: 1000Gi # 1 TiB de memória total
# nodes: 50 # opcional: máximo de nós
# ── Weight: prioridade quando múltiplos NodePools são candidatos ──
weight: 10
3.2 EC2NodeClass
[FACT] The EC2NodeClass contains all AWS-specific configuration. Multiple NodePools can reference the same EC2NodeClass.
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
# IAM role para os nós EC2 (deve existir com as políticas de worker node)
role: "KarpenterNodeRole-checkout-prod"
# AMI: 'alias' permite usar a AMI EKS otimizada mais recente
# Formatos: al2023@latest, al2023@v20240101, al2@latest, bottlerocket@latest
amiSelectorTerms:
- alias: "al2023@latest"
# Subnets: Karpenter usa tags para descobrir subnets disponíveis
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "checkout-prod"
# Alternativa: por ID de subnet
# - id: subnet-0abc123
# Security Groups: mesma lógica de tags
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "checkout-prod"
# Kubelet: configuração do kubelet nos nós (movido de NodePool para EC2NodeClass)
kubelet:
maxPods: 110 # aumentar se usar Prefix Delegation (ex: 737 para m5.xlarge)
systemReserved:
cpu: "100m"
memory: "100Mi"
ephemeral-storage: "1Gi"
kubeReserved:
cpu: "100m"
memory: "200Mi"
ephemeral-storage: "3Gi"
evictionHard:
memory.available: "5%"
nodefs.available: "10%"
# Block device mapping: tamanho e criptografia do volume root
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
encrypted: true
iops: 3000
throughput: 125
# Tags adicionais a todos os nós criados
tags:
Environment: production
ManagedBy: karpenter
# userData: script adicional executado no bootstrap (RARE — preferir AMI customizada)
# userData: |
# #!/bin/bash
# echo "custom init" >> /var/log/init.log
4. CDK Python — Karpenter Installation
"""
CDK Stack para instalar o Karpenter em um cluster EKS existente.
Usa Pod Identity (preferido em v1.13) para o controller role.
"""
from aws_cdk import (
Stack, CfnOutput,
aws_eks as eks,
aws_iam as iam,
aws_sqs as sqs,
aws_ec2 as ec2,
)
from constructs import Construct
class KarpenterStack(Stack):
def __init__(self, scope: Construct, construct_id: str,
cluster: eks.Cluster, **kwargs):
super().__init__(scope, construct_id, **kwargs)
CLUSTER_NAME = cluster.cluster_name
# ──────────────────────────────────────────────────────────────
# 1. IAM Role para os NODOS criados pelo Karpenter
# (não confundir com o role do Karpenter controller)
# ──────────────────────────────────────────────────────────────
node_role = iam.Role(self, "KarpenterNodeRole",
role_name=f"KarpenterNodeRole-{CLUSTER_NAME}",
description="IAM role para EC2 nodes criados pelo Karpenter",
assumed_by=iam.ServicePrincipal("ec2.amazonaws.com"),
managed_policies=[
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEKSWorkerNodePolicy"),
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEKS_CNI_Policy"),
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEC2ContainerRegistryReadOnly"),
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSSMManagedInstanceCore"),
],
)
# Instance profile (obrigatório para EC2 usar o role)
node_instance_profile = iam.CfnInstanceProfile(self, "KarpenterNodeInstanceProfile",
instance_profile_name=f"KarpenterNodeRole-{CLUSTER_NAME}",
roles=[node_role.role_name],
)
# Adicionar o node role ao aws-auth (EKS Access Entries)
cluster.grant_access("KarpenterNodeAccess",
principal=node_role.role_arn,
access_policies=[
eks.AccessPolicy.from_access_policy_name(
"AmazonEKSWorkerNodePolicy",
access_scope=eks.AccessScope(type=eks.AccessScopeType.CLUSTER),
),
],
)
# ──────────────────────────────────────────────────────────────
# 2. SQS Queue para interruption handling (Spot + maintenance)
# ──────────────────────────────────────────────────────────────
interruption_queue = sqs.Queue(self, "KarpenterInterruptionQueue",
queue_name=CLUSTER_NAME, # nome deve ser = cluster name
retention_period=None,
)
# Permite que EC2 e SQS publiquem eventos de interrupção na fila
interruption_queue.add_to_resource_policy(iam.PolicyStatement(
principals=[
iam.ServicePrincipal("sqs.amazonaws.com"),
iam.ServicePrincipal("events.amazonaws.com"),
],
actions=["sqs:SendMessage"],
resources=[interruption_queue.queue_arn],
))
# ──────────────────────────────────────────────────────────────
# 3. IAM Role do Karpenter Controller (via Pod Identity)
# ──────────────────────────────────────────────────────────────
controller_role = iam.Role(self, "KarpenterControllerRole",
role_name=f"{CLUSTER_NAME}-karpenter",
description="Karpenter controller role — chama EC2 API para criar/terminar nós",
assumed_by=iam.ServicePrincipal("pods.eks.amazonaws.com"),
)
controller_role.assume_role_policy.add_statements(
iam.PolicyStatement(
effect=iam.Effect.ALLOW,
principals=[iam.ServicePrincipal("pods.eks.amazonaws.com")],
actions=["sts:AssumeRole", "sts:TagSession"],
)
)
# Política de lifecycle de nós (RunInstances, TerminateInstances, etc.)
controller_role.add_to_policy(iam.PolicyStatement(
sid="NodeLifecycle",
actions=[
"ec2:RunInstances",
"ec2:CreateFleet",
"ec2:CreateLaunchTemplate",
"ec2:DeleteLaunchTemplate",
"ec2:TerminateInstances",
"ec2:CreateTags",
"ec2:DeleteTags",
],
resources=["*"],
conditions={
"StringEquals": {
f"aws:RequestedRegion": self.region,
}
},
))
# Política de descoberta de recursos
controller_role.add_to_policy(iam.PolicyStatement(
sid="ResourceDiscovery",
actions=[
"ec2:DescribeAvailabilityZones",
"ec2:DescribeImages",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:DescribeInstanceTypeOfferings",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSpotPriceHistory",
"ec2:DescribeSubnets",
"ssm:GetParameter",
"pricing:GetProducts",
],
resources=["*"],
))
# Passagem de role para instâncias (IAMIntegration)
controller_role.add_to_policy(iam.PolicyStatement(
sid="IAMIntegration",
actions=["iam:PassRole"],
resources=[node_role.role_arn],
conditions={
"StringEquals": {"iam:PassedToService": "ec2.amazonaws.com"},
},
))
controller_role.add_to_policy(iam.PolicyStatement(
sid="IAMInstanceProfile",
actions=[
"iam:AddRoleToInstanceProfile",
"iam:CreateInstanceProfile",
"iam:DeleteInstanceProfile",
"iam:GetInstanceProfile",
"iam:RemoveRoleFromInstanceProfile",
"iam:TagInstanceProfile",
"iam:UntagInstanceProfile",
],
resources=["*"],
))
# Acesso à fila de interrupção
interruption_queue.grant_consume_messages(controller_role)
controller_role.add_to_policy(iam.PolicyStatement(
sid="InterruptionQueue",
actions=["sqs:GetQueueUrl", "sqs:GetQueueAttributes"],
resources=[interruption_queue.queue_arn],
))
# EKS DescribeCluster
controller_role.add_to_policy(iam.PolicyStatement(
sid="EKSIntegration",
actions=["eks:DescribeCluster"],
resources=[cluster.cluster_arn],
))
# ──────────────────────────────────────────────────────────────
# 4. Pod Identity Association para o controller
# ──────────────────────────────────────────────────────────────
eks.CfnPodIdentityAssociation(self, "KarpenterPodIdentity",
cluster_name=CLUSTER_NAME,
namespace="kube-system",
service_account="karpenter",
role_arn=controller_role.role_arn,
)
# ──────────────────────────────────────────────────────────────
# 5. Tags nas subnets e security groups para descoberta
# (Karpenter usa tags para descobrir resources via EC2 API)
# ──────────────────────────────────────────────────────────────
# NOTA: no CDK, adicionar tags via vpc.select_subnets() + tags
# Na prática, mais fácil via eksctl ou CLI:
# aws ec2 create-tags --resources <subnet-ids> \
# --tags Key=karpenter.sh/discovery,Value=<cluster-name>
# ──────────────────────────────────────────────────────────────
# 6. Helm chart do Karpenter
# ──────────────────────────────────────────────────────────────
karpenter_chart = cluster.add_helm_chart("Karpenter",
chart="karpenter",
repository="oci://public.ecr.aws/karpenter/karpenter",
version="1.13.0",
namespace="kube-system",
create_namespace=False,
values={
"settings": {
"clusterName": CLUSTER_NAME,
"interruptionQueue": CLUSTER_NAME,
"enableZonalShift": True,
},
"controller": {
"resources": {
"requests": {"cpu": "1", "memory": "1Gi"},
"limits": {"cpu": "1", "memory": "1Gi"},
},
},
# dnsPolicy: Default se CoreDNS roda em nós Karpenter
# "dnsPolicy": "ClusterFirst", # padrão
},
wait=True,
)
CfnOutput(self, "KarpenterControllerRoleArn",
value=controller_role.role_arn)
CfnOutput(self, "KarpenterInterruptionQueueUrl",
value=interruption_queue.queue_url)
5. Python — NodePool Generation by Workload Profile
"""
Gera manifestos NodePool + EC2NodeClass para diferentes perfis.
Útil para aplicar via kubectl apply ou via cluster.add_manifest() no CDK.
"""
import yaml
from typing import Literal
def render_ec2_node_class(
name: str,
cluster_name: str,
root_volume_size_gi: int = 50,
max_pods: int = 110,
custom_kubelet: dict | None = None,
) -> dict:
"""EC2NodeClass compartilhada por múltiplos NodePools."""
kubelet_config = {
"maxPods": max_pods,
"systemReserved": {"cpu": "100m", "memory": "100Mi"},
"kubeReserved": {"cpu": "100m", "memory": "200Mi"},
"evictionHard": {"memory.available": "5%", "nodefs.available": "10%"},
}
if custom_kubelet:
kubelet_config.update(custom_kubelet)
return {
"apiVersion": "karpenter.k8s.aws/v1",
"kind": "EC2NodeClass",
"metadata": {"name": name},
"spec": {
"role": f"KarpenterNodeRole-{cluster_name}",
"amiSelectorTerms": [{"alias": "al2023@latest"}],
"subnetSelectorTerms": [{"tags": {"karpenter.sh/discovery": cluster_name}}],
"securityGroupSelectorTerms": [{"tags": {"karpenter.sh/discovery": cluster_name}}],
"kubelet": kubelet_config,
"blockDeviceMappings": [{
"deviceName": "/dev/xvda",
"ebs": {
"volumeSize": f"{root_volume_size_gi}Gi",
"volumeType": "gp3",
"encrypted": True,
},
}],
"tags": {"ManagedBy": "karpenter", "Cluster": cluster_name},
},
}
def render_nodepool(
name: str,
node_class_name: str,
capacity_types: list[str],
instance_categories: list[str],
min_instance_categories: int = 2,
taints: list[dict] | None = None,
labels: dict | None = None,
expire_after: str = "720h",
consolidation_policy: Literal["WhenEmpty", "WhenEmptyOrUnderutilized"] = "WhenEmptyOrUnderutilized",
consolidate_after: str = "1m",
cpu_limit: str = "200",
weight: int = 10,
) -> dict:
"""NodePool genérico com configurações parametrizadas."""
requirements = [
{"key": "kubernetes.io/arch", "operator": "In", "values": ["amd64"]},
{"key": "kubernetes.io/os", "operator": "In", "values": ["linux"]},
{"key": "karpenter.sh/capacity-type", "operator": "In", "values": capacity_types},
{
"key": "karpenter.k8s.aws/instance-category",
"operator": "In",
"values": instance_categories,
"minValues": min_instance_categories,
},
{"key": "karpenter.k8s.aws/instance-generation", "operator": "Gte", "values": ["3"]},
{"key": "karpenter.k8s.aws/instance-hypervisor", "operator": "In", "values": ["nitro"]},
]
spec_node = {
"nodeClassRef": {"group": "karpenter.k8s.aws", "kind": "EC2NodeClass", "name": node_class_name},
"expireAfter": expire_after,
"requirements": requirements,
}
if taints:
spec_node["taints"] = taints
template: dict = {"spec": spec_node}
if labels:
template["metadata"] = {"labels": labels}
return {
"apiVersion": "karpenter.sh/v1",
"kind": "NodePool",
"metadata": {"name": name},
"spec": {
"template": template,
"disruption": {
"consolidationPolicy": consolidation_policy,
"consolidateAfter": consolidate_after,
"budgets": [
{"nodes": "10%"},
{"schedule": "0 9 * * mon-fri", "duration": "8h", "nodes": "0"},
],
},
"limits": {"cpu": cpu_limit},
"weight": weight,
},
}
def generate_cluster_nodepools(cluster_name: str) -> list[dict]:
"""
Gera 3 NodePools + 1 EC2NodeClass para um cluster de produção típico:
1. general-od — on-demand, para workloads críticos
2. general-spot — spot, para workloads tolerantes a interrupção
3. gpu — on-demand p3/g4, para ML workloads (com taint)
"""
manifests = []
# EC2NodeClass compartilhada
manifests.append(render_ec2_node_class("default", cluster_name))
# EC2NodeClass para GPU (volume maior)
manifests.append(render_ec2_node_class("gpu", cluster_name, root_volume_size_gi=100))
# NodePool 1: On-demand para workloads críticos
manifests.append(render_nodepool(
name="general-od",
node_class_name="default",
capacity_types=["on-demand"],
instance_categories=["c", "m", "r"],
min_instance_categories=2,
expire_after="720h",
cpu_limit="500",
weight=10,
))
# NodePool 2: Spot para workloads tolerantes (batch, workers)
manifests.append(render_nodepool(
name="general-spot",
node_class_name="default",
capacity_types=["spot"],
instance_categories=["c", "m", "r"],
min_instance_categories=3, # maior diversidade = menor risco de interrupção
expire_after="168h", # 7 dias (mais curto para spot)
consolidation_policy="WhenEmptyOrUnderutilized",
consolidate_after="30s", # consolidar mais rápido para Spot
cpu_limit="1000",
labels={"karpenter.sh/capacity-type-preference": "spot"},
weight=5, # peso menor = segunda opção
))
# NodePool 3: GPU com taint para isolar workloads ML
manifests.append({
"apiVersion": "karpenter.sh/v1",
"kind": "NodePool",
"metadata": {"name": "gpu"},
"spec": {
"template": {
"spec": {
"nodeClassRef": {"group": "karpenter.k8s.aws", "kind": "EC2NodeClass", "name": "gpu"},
"taints": [{"key": "nvidia.com/gpu", "value": "true", "effect": "NoSchedule"}],
"expireAfter": "Never", # GPU nodes não expiram (caros para substituir)
"requirements": [
{"key": "kubernetes.io/os", "operator": "In", "values": ["linux"]},
{"key": "kubernetes.io/arch", "operator": "In", "values": ["amd64"]},
{"key": "karpenter.sh/capacity-type", "operator": "In", "values": ["on-demand"]},
{
"key": "node.kubernetes.io/instance-type",
"operator": "In",
"values": ["p3.2xlarge", "p3.8xlarge", "g4dn.xlarge", "g4dn.2xlarge"],
},
],
},
},
"disruption": {
"consolidationPolicy": "WhenEmpty",
"consolidateAfter": "5m",
},
"limits": {"cpu": "128"},
"weight": 20, # maior peso = prioridade para pods GPU
},
})
return manifests
if __name__ == "__main__":
manifests = generate_cluster_nodepools("checkout-prod")
for m in manifests:
print("---")
print(yaml.dump(m, default_flow_style=False))
6. CLI — Karpenter Installation and Operation
# ═══════════════════════════════════════════════════════════════
# Setup inicial
# ═══════════════════════════════════════════════════════════════
export CLUSTER_NAME="checkout-prod"
export KARPENTER_VERSION="1.13.0"
export K8S_VERSION="1.36"
export AWS_REGION="us-east-1"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export AWS_PARTITION="aws"
export KARPENTER_NAMESPACE="kube-system"
# Obter versão da AMI EKS otimizada (para EC2NodeClass alias)
export ALIAS_VERSION=$(aws ssm get-parameter \
--name "/aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2023/x86_64/standard/recommended/image_id" \
--query Parameter.Value \
| xargs aws ec2 describe-images --query 'Images[0].Name' --image-ids \
| sed -r 's/^.*(v[[:digit:]]+).*$/\1/')
echo "AMI alias version: al2023@${ALIAS_VERSION}"
# Criar service-linked role para Spot (necessário se nunca usado na conta)
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com || true
# ─────────────────────────────────────────────────────────────
# Taggear subnets e security groups para descoberta do Karpenter
# ─────────────────────────────────────────────────────────────
# Obter subnets privadas do cluster
SUBNET_IDS=$(aws eks describe-cluster \
--name "$CLUSTER_NAME" \
--query 'cluster.resourcesVpcConfig.subnetIds[]' \
--output text)
# Taggear subnets para descoberta
aws ec2 create-tags \
--resources $SUBNET_IDS \
--tags Key=karpenter.sh/discovery,Value="$CLUSTER_NAME"
# Taggear security group do cluster
CLUSTER_SG=$(aws eks describe-cluster \
--name "$CLUSTER_NAME" \
--query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' \
--output text)
aws ec2 create-tags \
--resources "$CLUSTER_SG" \
--tags Key=karpenter.sh/discovery,Value="$CLUSTER_NAME"
# ─────────────────────────────────────────────────────────────
# Instalar Karpenter via Helm (OCI registry)
# ─────────────────────────────────────────────────────────────
# Logout primeiro para pull anônimo do ECR público
helm registry logout public.ecr.aws 2>/dev/null || true
helm upgrade --install karpenter \
oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace "${KARPENTER_NAMESPACE}" \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set "settings.enableZonalShift=true" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
# Verificar instalação
kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter
kubectl get crd | grep karpenter
# ─────────────────────────────────────────────────────────────
# Criar EC2NodeClass e NodePool
# ─────────────────────────────────────────────────────────────
cat <<EOF | envsubst | kubectl apply -f -
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: "KarpenterNodeRole-${CLUSTER_NAME}"
amiSelectorTerms:
- alias: "al2023@${ALIAS_VERSION}"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
kubelet:
maxPods: 110
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
encrypted: true
tags:
ManagedBy: karpenter
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gte
values: ["3"]
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
limits:
cpu: "1000"
EOF
# Verificar status do NodePool
kubectl get nodepool
kubectl describe nodepool default
# Verificar: status.conditions.type=Ready deve ser True
# ═══════════════════════════════════════════════════════════════
# Testar scale-up e scale-down
# ═══════════════════════════════════════════════════════════════
# Deploy de teste (pause containers, 1 CPU cada)
kubectl apply -f - << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 0
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests:
cpu: "1"
EOF
# Scale up: 5 pods = 5 vCPUs → Karpenter provisiona um nó
kubectl scale deployment inflate --replicas 5
# Observar logs do Karpenter em tempo real
kubectl logs -f -n kube-system \
-l app.kubernetes.io/name=karpenter \
-c controller \
--since=1m | grep -E "provisioned|launched|registered|scheduled"
# Verificar NodeClaim criado
kubectl get nodeclaims
kubectl describe nodeclaim <name> # mostra qual instance type foi escolhido
# Verificar novo nó
kubectl get nodes -L karpenter.sh/nodepool,node.kubernetes.io/instance-type
# Scale down: Karpenter consolida e termina o nó
kubectl scale deployment inflate --replicas 0
# Após ~1 min, o nó deve ser terminado
kubectl get nodes --watch
# ─────────────────────────────────────────────────────────────
# Operações de manutenção
# ─────────────────────────────────────────────────────────────
# Listar nós por NodePool
kubectl get nodes -l karpenter.sh/nodepool=default
# Status de recursos consumidos pelo NodePool
kubectl get nodepool default -o jsonpath='{.status.resources}' | python3 -m json.tool
# Forçar consolidação imediata de um NodePool (drift)
kubectl annotate nodepool default \
karpenter.sh/disruption-reason="manual-consolidation" \
--overwrite
# Proteger um pod específico de disruption
kubectl annotate pod <pod-name> karpenter.sh/do-not-disrupt="true"
# Deletar um nó Karpenter de forma graciosa (drain + terminate EC2)
kubectl delete node <node-name>
# Karpenter tem um finalizer que garante drain antes de terminar a instância
# Ver métricas do Karpenter (se Prometheus disponível)
kubectl port-forward -n kube-system \
deployment/karpenter 8080:8080
# Acessar em http://localhost:8080/metrics
# Métricas relevantes:
# karpenter_provisioner_scheduling_duration_seconds
# karpenter_nodes_total_daemon_requests
# karpenter_nodes_total_pod_requests
# karpenter_disruption_consolidation_timeouts_total
7. Pitfalls
[FACT] Karpenter must not manage the nodes where it runs: if the only managed node group in the cluster is removed and the Karpenter controller tries to consolidate the node where it itself runs, there will be a deadlock. Keep at least 2 nodes in the system managed node group (or use Fargate for kube-system).
[FACT] Overlapping NodePools without defined weight cause random behavior: if two NodePools can schedule the same pod and don't have different weights, Karpenter chooses randomly. Use spec.weight or taints/requirements to make them mutually exclusive.
[FACT] Spot without instance diversity causes InsufficientInstanceCapacity: fixing only 1 or 2 Spot instance types is risky. Karpenter uses price-capacity-optimized — leave at least 10 eligible instance types. The minValues on instance-family forces minimum diversity.
[FACT] Pods without defined requests cause incorrect bin-packing: Karpenter sizes nodes based on pod requests, not limits. Pods without requests are treated as if they consume no resources, leading to undersized nodes and OOM kills. Use LimitRange to define defaults per namespace.
[FACT] expireAfter: Never on general workload NodePools accumulates CVEs: long-lived nodes accumulate vulnerabilities. The default of 720h (30 days) ensures periodic rotation with OS patches.
[FACT] Karpenter and Node Termination Handler (NTH) must not coexist: NTH and Karpenter's interruption handling mechanism can conflict when handling Spot events. If Karpenter is configured with interruptionQueue, uninstall NTH.
[FACT] Controller DNS policy: Karpenter uses ClusterFirst by default, which creates a circular dependency if CoreDNS runs on nodes managed by Karpenter. Solution: keep CoreDNS on a fixed node group, OR use dnsPolicy: Default on Karpenter.
Reflection Exercise
An EKS cluster with 3 teams has the following workloads:
- API Team: critical service
checkout-api, 20 replicas withrequests: cpu=500m, memory=512Mi. Cannot be interrupted during business hours (Mon-Fri 8am–8pm). Must run oncorminstances generation 5+. - ML Team: training jobs that run at night, tolerate interruption (checkpoints every 5 min), need
p3.2xlargeorg4dn.xlargeGPUs, and must have minimized cost. - Platform Team: DaemonSets and observability tools that must run on all nodes.
Answer:
-
How many NodePools would you create and what is the justification for each one? Describe the
requirements,taints,weight, anddisruptionfor each. -
The ML Team wants to use Spot for their GPU jobs. What specific risks exist with Spot for GPUs and how would Karpenter mitigate these risks with the SQS interruption queue?
-
For
checkout-api, how would you ensure that Karpenter does not consolidate nodes during business hours? (Describe using the correct field from the NodePool spec). -
The platform team wants their observability tools (DaemonSet) to run on all nodes created by Karpenter, including GPU nodes. Why do DaemonSets not need tolerations for
NoScheduletaints created by Karpenter? (Hint:startupTaintsvstaints). -
An architect proposes migrating from Karpenter to EKS Auto Mode. What are the fundamental differences between EKS Auto Mode and Karpenter for this scenario with multiple workload profiles?
References
- [FACT] Getting Started with Karpenter (v1.13) — karpenter.sh
- [FACT] NodePools — Karpenter v1.13 — karpenter.sh
- [FACT] NodeClasses — Karpenter v1.13 — karpenter.sh
- [FACT] Disruption — Karpenter v1.13 — karpenter.sh
- [FACT] Karpenter Best Practices — EKS Best Practices — aws.github.io
- [FACT] Scale cluster compute with Karpenter and Cluster Autoscaler — Amazon EKS — docs.aws.amazon.com