Session 050 — EKS: Karpenter — Dynamic Node Provisioning and NodePools

June 19, 2026

Prerequisite: session-049 (EKS add-ons, VPC CNI, EBS CSI)

Session objectives

Understand the Karpenter architecture and how it differs from Cluster Autoscaler
Install Karpenter via Helm with the correct IAM policies
Create NodePool and EC2NodeClass with instance constraints, taint/toleration and disruption budgets
Verify that Karpenter provisions and consolidates nodes in response to Pending/removed pods
Decide when to use Karpenter vs Cluster Autoscaler vs managed node groups

1. Karpenter vs Cluster Autoscaler — Comparison

1.1 Mental model

[FACT] The Cluster Autoscaler (CAS) operates at the Auto Scaling Groups (ASGs) level: when the Kubernetes scheduler cannot place a pod, the CAS checks which ASGs could absorb the pod and increases the group's desiredCapacity. Karpenter operates at the pods level directly: when the scheduler cannot allocate a pod, Karpenter calls the EC2 API directly to create the instance that best meets the requested resources.

┌─────────────────────────────────────────────────────────────────────┐
│ Cluster Autoscaler                                                  │
│                                                                     │
│  Pod Pending                                                        │
│       │                                                             │
│       ▼                                                             │
│  CAS verifica node groups pré-configurados                         │
│  (cada node group = 1 tipo de instância ou familia restrita)        │
│       │                                                             │
│       ▼                                                             │
│  Aumenta desiredCapacity do ASG mais adequado                       │
│       │                                                             │
│       ▼                                                             │
│  EC2 Auto Scaling cria instância (pode levar 2-5 min)              │
│       │                                                             │
│       ▼                                                             │
│  Node registra no cluster → pod é agendado                         │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Karpenter                                                           │
│                                                                     │
│  Pod Pending (watch via K8s API)                                    │
│       │                                                             │
│       ▼                                                             │
│  Karpenter lê requirements do pod (resources, nodeSelector,         │
│  affinity, tolerations, topologySpread)                             │
│       │                                                             │
│       ▼                                                             │
│  Seleciona instância mais econômica que atende os requirements      │
│  de todos os pods pendentes simultaneamente (bin-packing)           │
│       │                                                             │
│       ▼                                                             │
│  Chama EC2 Fleet API diretamente (RunInstances / CreateFleet)       │
│  Cria NodeClaim CRD para rastrear o estado                          │
│       │                                                             │
│       ▼                                                             │
│  Node registra → pod agendado (geralmente em < 60s)                 │
└─────────────────────────────────────────────────────────────────────┘

[FACT] Structural comparison:

╔══════════════════════════╦═══════════════════════════╦═══════════════════════════╗
║ Dimensão                 ║ Cluster Autoscaler (CAS)  ║ Karpenter                 ║
╠══════════════════════════╬═══════════════════════════╬═══════════════════════════╣
║ Unidade de escala        ║ Node Group (ASG)          ║ Instância EC2 individual  ║
║ Decisão de instância     ║ Pré-configurada no NG     ║ Runtime (melhor fit)      ║
║ Nº de node groups        ║ Muitos (1 por workload)   ║ Poucos (1-3 NodePools)    ║
║ Velocidade de scale-up   ║ 2-5 min (ASG trigger)     ║ < 60s (EC2 API direto)    ║
║ Consolidação (scale-down)║ 10 min de idle padrão     ║ Configurável (1m+)        ║
║ Versionamento K8s        ║ Acoplado (versão-específi)║ Desacoplado               ║
║ Spot diversidade         ║ Manual (1 NG p/ família)  ║ Automático (pool amplo)   ║
║ Bin-packing              ║ Limitado (por ASG)        ║ Global (todos os pods)    ║
║ AWS API                  ║ Auto Scaling API          ║ EC2 Fleet / RunInstances  ║
╚══════════════════════════╩═══════════════════════════╩═══════════════════════════╝

[CONSENSUS] When to prefer Karpenter: clusters with variable/spiky demand, heterogeneous workloads, need for Spot with high instance diversity, or when the overhead of maintaining dozens of node groups is excessive.

[CONSENSUS] When to prefer CAS or static node groups: stable and predictable workloads, when organizational constraints prevent IAM with broad RunInstances/TerminateInstances powers, or clusters requiring compliance with very specific node configurations.

2. Karpenter Architecture

2.1 Components

[FACT] Karpenter runs as a Deployment with 2 replicas (controller + webhook) in kube-system. It is not a managed EKS add-on — it is installed via Helm chart from the OCI registry public.ecr.aws/karpenter/karpenter.

[FACT] CRDs created by Karpenter:

karpenter.sh/v1:
  NodePool      — restrições de scheduling e políticas de disruption
  NodeClaim     — representa uma instância EC2 em provisionamento/ativa

karpenter.k8s.aws/v1:
  EC2NodeClass  — configuração AWS-específica (AMI, subnet, SG, role)

karpenter.sh/v1 (readonly):
  NodeOverlay   — sobreposição de configuração sobre EC2NodeClass existente

[FACT] Karpenter must run on a node not managed by itself — on a managed node group or on Fargate. If the only node in the cluster is provisioned by Karpenter and it needs to be removed (consolidation), the Karpenter controller would have nowhere to run.

2.2 Provisioning flow

1. Pod fica Pending (scheduler não encontra nó adequado)
2. Karpenter detecta o pod via watch na K8s API
3. Karpenter agrupa pods pendentes que podem ser co-localizados
4. Seleciona EC2NodeClass e NodePool adequados
5. Escolhe instância ótima (bin-packing + custo + disponibilidade)
6. Cria NodeClaim CRD (rastreia estado)
7. Chama EC2 Fleet API para criar a instância
8. Instância bootstrapping: nodeadm/userdata configura o kubelet
9. Node aparece no kubectl get nodes
10. Karpenter associa o NodeClaim ao Node
11. Scheduler agenda os pods no novo nó

2.3 IAM — Karpenter Policies

[FACT] The official installation CloudFormation creates 6 separate IAM policies (v1.13):

KarpenterControllerNodeLifecyclePolicy     → RunInstances, TerminateInstances,
                                             CreateFleet, CreateLaunchTemplate,
                                             DeleteLaunchTemplate, ...
KarpenterControllerIAMIntegrationPolicy    → iam:PassRole (para KarpenterNodeRole),
                                             iam:AddRoleToInstanceProfile,
                                             iam:CreateInstanceProfile, ...
KarpenterControllerEKSIntegrationPolicy    → eks:DescribeCluster
KarpenterControllerInterruptionPolicy      → sqs:ReceiveMessage, sqs:DeleteMessage,
                                             sqs:GetQueueUrl,
                                             events:CreateEventBus, ...
KarpenterControllerResourceDiscoveryPolicy → ec2:Describe* (instâncias, AZs, subnets,
                                             SGs), pricing:GetProducts
KarpenterControllerZonalShiftPolicy        → arc-zonal-shift:GetManagedResource

[FACT] The KarpenterNodeRole-<cluster> is the IAM Role assigned to EC2 nodes created by Karpenter. It must have: AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly, AmazonSSMManagedInstanceCore.

[FACT] Tag security risk: Karpenter uses 3 tags to associate EC2 instances with NodeClaims:
- karpenter.sh/managed-by: <cluster-name>
- karpenter.sh/nodepool: <nodepool-name>
- kubernetes.io/cluster/<cluster-name>: owned

Any user with ec2:CreateTags/ec2:DeleteTags on these tags for i-* instances can manipulate Karpenter. The recommendation is to use tag-based IAM policies to restrict CreateTags/DeleteTags only to the Karpenter role.

3. NodePool and EC2NodeClass — Complete Anatomy

3.1 NodePool

[FACT] The NodePool defines the constraints on nodes that Karpenter can create. Each Pending pod is compared against available NodePools and scheduled on the NodePool that best fits.

# NodePool anotado com todos os campos relevantes
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-compute
spec:
  # ── Template: configuração dos nós que serão criados ──────────────
  template:
    metadata:
      labels:
        team: platform        # labels propagadas para o Node K8s
      annotations:
        example.com/owner: platform-team
    spec:
      # Referência ao EC2NodeClass (configuração AWS-específica)
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

      # Taints no nó — pods precisam tolerar para serem agendados aqui
      taints: []

      # startupTaints: aplicados ao nó, mas pods NÃO precisam tolerar.
      # Usados para aguardar inicialização (ex: Cilium CNI agent).
      # Um DaemonSet ou controller externo deve remover o taint.
      startupTaints: []

      # Expiração do nó (TTL): após 720h, o nó é drenado e terminado.
      # Útil para forçar rotação e aplicar patches de OS/K8s.
      # 'Never' desabilita a expiração.
      expireAfter: 720h

      # Tempo máximo de drain antes de forçar terminação
      terminationGracePeriod: 48h

      # Requirements: constraints de scheduling (interseção com pod spec)
      # Operadores: In, NotIn, Exists, DoesNotExist, Gt, Lt, Gte, Lte
      requirements:
        # Arquitetura
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]

        # OS
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]

        # Tipo de capacidade
        # Prioridade automática: reserved > spot > on-demand
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

        # Categorias de instância (c=compute, m=general, r=memory)
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
          # minValues: exige pelo menos N categorias distintas no pool
          # (evita overfitting em uma única família para Spot)
          minValues: 2

        # Geração mínima (evita instâncias antigas)
        - key: karpenter.k8s.aws/instance-generation
          operator: Gte
          values: ["3"]

        # Excluir instâncias bare-metal (geralmente não necessárias)
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values: ["nitro"]

  # ── Disruption: controle de consolidação e rotação ────────────────
  disruption:
    # WhenEmptyOrUnderutilized: consolida nós vazios E subutilizados
    # WhenEmpty: consolida apenas nós sem workload pods
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m    # aguarda 1 min de inatividade antes de consolidar

    # Budgets: limita quantos nós podem ser interrompidos simultaneamente
    budgets:
      - nodes: "10%"        # máximo 10% dos nós disruptados de uma vez
      # Durante horário comercial (seg-sex 9h-17h): sem disruption
      - schedule: "0 9 * * mon-fri"
        duration: 8h
        nodes: "0"

  # ── Limits: teto de recursos que este NodePool pode consumir ──────
  limits:
    cpu: "1000"       # 1000 vCPUs totais
    memory: 1000Gi    # 1 TiB de memória total
    # nodes: 50       # opcional: máximo de nós

  # ── Weight: prioridade quando múltiplos NodePools são candidatos ──
  weight: 10

3.2 EC2NodeClass

[FACT] The EC2NodeClass contains all AWS-specific configuration. Multiple NodePools can reference the same EC2NodeClass.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  # IAM role para os nós EC2 (deve existir com as políticas de worker node)
  role: "KarpenterNodeRole-checkout-prod"

  # AMI: 'alias' permite usar a AMI EKS otimizada mais recente
  # Formatos: al2023@latest, al2023@v20240101, al2@latest, bottlerocket@latest
  amiSelectorTerms:
    - alias: "al2023@latest"

  # Subnets: Karpenter usa tags para descobrir subnets disponíveis
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "checkout-prod"
    # Alternativa: por ID de subnet
    # - id: subnet-0abc123

  # Security Groups: mesma lógica de tags
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "checkout-prod"

  # Kubelet: configuração do kubelet nos nós (movido de NodePool para EC2NodeClass)
  kubelet:
    maxPods: 110      # aumentar se usar Prefix Delegation (ex: 737 para m5.xlarge)
    systemReserved:
      cpu: "100m"
      memory: "100Mi"
      ephemeral-storage: "1Gi"
    kubeReserved:
      cpu: "100m"
      memory: "200Mi"
      ephemeral-storage: "3Gi"
    evictionHard:
      memory.available: "5%"
      nodefs.available: "10%"

  # Block device mapping: tamanho e criptografia do volume root
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        encrypted: true
        iops: 3000
        throughput: 125

  # Tags adicionais a todos os nós criados
  tags:
    Environment: production
    ManagedBy: karpenter

  # userData: script adicional executado no bootstrap (RARE — preferir AMI customizada)
  # userData: |
  #   #!/bin/bash
  #   echo "custom init" >> /var/log/init.log

4. CDK Python — Karpenter Installation

"""
CDK Stack para instalar o Karpenter em um cluster EKS existente.
Usa Pod Identity (preferido em v1.13) para o controller role.
"""
from aws_cdk import (
    Stack, CfnOutput,
    aws_eks as eks,
    aws_iam as iam,
    aws_sqs as sqs,
    aws_ec2 as ec2,
)
from constructs import Construct


class KarpenterStack(Stack):
    def __init__(self, scope: Construct, construct_id: str,
                 cluster: eks.Cluster, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        CLUSTER_NAME = cluster.cluster_name

        # ──────────────────────────────────────────────────────────────
        # 1. IAM Role para os NODOS criados pelo Karpenter
        #    (não confundir com o role do Karpenter controller)
        # ──────────────────────────────────────────────────────────────
        node_role = iam.Role(self, "KarpenterNodeRole",
            role_name=f"KarpenterNodeRole-{CLUSTER_NAME}",
            description="IAM role para EC2 nodes criados pelo Karpenter",
            assumed_by=iam.ServicePrincipal("ec2.amazonaws.com"),
            managed_policies=[
                iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEKSWorkerNodePolicy"),
                iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEKS_CNI_Policy"),
                iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEC2ContainerRegistryReadOnly"),
                iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSSMManagedInstanceCore"),
            ],
        )

        # Instance profile (obrigatório para EC2 usar o role)
        node_instance_profile = iam.CfnInstanceProfile(self, "KarpenterNodeInstanceProfile",
            instance_profile_name=f"KarpenterNodeRole-{CLUSTER_NAME}",
            roles=[node_role.role_name],
        )

        # Adicionar o node role ao aws-auth (EKS Access Entries)
        cluster.grant_access("KarpenterNodeAccess",
            principal=node_role.role_arn,
            access_policies=[
                eks.AccessPolicy.from_access_policy_name(
                    "AmazonEKSWorkerNodePolicy",
                    access_scope=eks.AccessScope(type=eks.AccessScopeType.CLUSTER),
                ),
            ],
        )

        # ──────────────────────────────────────────────────────────────
        # 2. SQS Queue para interruption handling (Spot + maintenance)
        # ──────────────────────────────────────────────────────────────
        interruption_queue = sqs.Queue(self, "KarpenterInterruptionQueue",
            queue_name=CLUSTER_NAME,    # nome deve ser = cluster name
            retention_period=None,
        )

        # Permite que EC2 e SQS publiquem eventos de interrupção na fila
        interruption_queue.add_to_resource_policy(iam.PolicyStatement(
            principals=[
                iam.ServicePrincipal("sqs.amazonaws.com"),
                iam.ServicePrincipal("events.amazonaws.com"),
            ],
            actions=["sqs:SendMessage"],
            resources=[interruption_queue.queue_arn],
        ))

        # ──────────────────────────────────────────────────────────────
        # 3. IAM Role do Karpenter Controller (via Pod Identity)
        # ──────────────────────────────────────────────────────────────
        controller_role = iam.Role(self, "KarpenterControllerRole",
            role_name=f"{CLUSTER_NAME}-karpenter",
            description="Karpenter controller role — chama EC2 API para criar/terminar nós",
            assumed_by=iam.ServicePrincipal("pods.eks.amazonaws.com"),
        )
        controller_role.assume_role_policy.add_statements(
            iam.PolicyStatement(
                effect=iam.Effect.ALLOW,
                principals=[iam.ServicePrincipal("pods.eks.amazonaws.com")],
                actions=["sts:AssumeRole", "sts:TagSession"],
            )
        )

        # Política de lifecycle de nós (RunInstances, TerminateInstances, etc.)
        controller_role.add_to_policy(iam.PolicyStatement(
            sid="NodeLifecycle",
            actions=[
                "ec2:RunInstances",
                "ec2:CreateFleet",
                "ec2:CreateLaunchTemplate",
                "ec2:DeleteLaunchTemplate",
                "ec2:TerminateInstances",
                "ec2:CreateTags",
                "ec2:DeleteTags",
            ],
            resources=["*"],
            conditions={
                "StringEquals": {
                    f"aws:RequestedRegion": self.region,
                }
            },
        ))

        # Política de descoberta de recursos
        controller_role.add_to_policy(iam.PolicyStatement(
            sid="ResourceDiscovery",
            actions=[
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeImages",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeInstanceTypeOfferings",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSpotPriceHistory",
                "ec2:DescribeSubnets",
                "ssm:GetParameter",
                "pricing:GetProducts",
            ],
            resources=["*"],
        ))

        # Passagem de role para instâncias (IAMIntegration)
        controller_role.add_to_policy(iam.PolicyStatement(
            sid="IAMIntegration",
            actions=["iam:PassRole"],
            resources=[node_role.role_arn],
            conditions={
                "StringEquals": {"iam:PassedToService": "ec2.amazonaws.com"},
            },
        ))

        controller_role.add_to_policy(iam.PolicyStatement(
            sid="IAMInstanceProfile",
            actions=[
                "iam:AddRoleToInstanceProfile",
                "iam:CreateInstanceProfile",
                "iam:DeleteInstanceProfile",
                "iam:GetInstanceProfile",
                "iam:RemoveRoleFromInstanceProfile",
                "iam:TagInstanceProfile",
                "iam:UntagInstanceProfile",
            ],
            resources=["*"],
        ))

        # Acesso à fila de interrupção
        interruption_queue.grant_consume_messages(controller_role)
        controller_role.add_to_policy(iam.PolicyStatement(
            sid="InterruptionQueue",
            actions=["sqs:GetQueueUrl", "sqs:GetQueueAttributes"],
            resources=[interruption_queue.queue_arn],
        ))

        # EKS DescribeCluster
        controller_role.add_to_policy(iam.PolicyStatement(
            sid="EKSIntegration",
            actions=["eks:DescribeCluster"],
            resources=[cluster.cluster_arn],
        ))

        # ──────────────────────────────────────────────────────────────
        # 4. Pod Identity Association para o controller
        # ──────────────────────────────────────────────────────────────
        eks.CfnPodIdentityAssociation(self, "KarpenterPodIdentity",
            cluster_name=CLUSTER_NAME,
            namespace="kube-system",
            service_account="karpenter",
            role_arn=controller_role.role_arn,
        )

        # ──────────────────────────────────────────────────────────────
        # 5. Tags nas subnets e security groups para descoberta
        #    (Karpenter usa tags para descobrir resources via EC2 API)
        # ──────────────────────────────────────────────────────────────
        # NOTA: no CDK, adicionar tags via vpc.select_subnets() + tags
        # Na prática, mais fácil via eksctl ou CLI:
        # aws ec2 create-tags --resources <subnet-ids> \
        #   --tags Key=karpenter.sh/discovery,Value=<cluster-name>

        # ──────────────────────────────────────────────────────────────
        # 6. Helm chart do Karpenter
        # ──────────────────────────────────────────────────────────────
        karpenter_chart = cluster.add_helm_chart("Karpenter",
            chart="karpenter",
            repository="oci://public.ecr.aws/karpenter/karpenter",
            version="1.13.0",
            namespace="kube-system",
            create_namespace=False,
            values={
                "settings": {
                    "clusterName": CLUSTER_NAME,
                    "interruptionQueue": CLUSTER_NAME,
                    "enableZonalShift": True,
                },
                "controller": {
                    "resources": {
                        "requests": {"cpu": "1", "memory": "1Gi"},
                        "limits":   {"cpu": "1", "memory": "1Gi"},
                    },
                },
                # dnsPolicy: Default se CoreDNS roda em nós Karpenter
                # "dnsPolicy": "ClusterFirst",  # padrão
            },
            wait=True,
        )

        CfnOutput(self, "KarpenterControllerRoleArn",
            value=controller_role.role_arn)
        CfnOutput(self, "KarpenterInterruptionQueueUrl",
            value=interruption_queue.queue_url)

5. Python — NodePool Generation by Workload Profile

"""
Gera manifestos NodePool + EC2NodeClass para diferentes perfis.
Útil para aplicar via kubectl apply ou via cluster.add_manifest() no CDK.
"""
import yaml
from typing import Literal


def render_ec2_node_class(
    name: str,
    cluster_name: str,
    root_volume_size_gi: int = 50,
    max_pods: int = 110,
    custom_kubelet: dict | None = None,
) -> dict:
    """EC2NodeClass compartilhada por múltiplos NodePools."""
    kubelet_config = {
        "maxPods": max_pods,
        "systemReserved": {"cpu": "100m", "memory": "100Mi"},
        "kubeReserved":   {"cpu": "100m", "memory": "200Mi"},
        "evictionHard": {"memory.available": "5%", "nodefs.available": "10%"},
    }
    if custom_kubelet:
        kubelet_config.update(custom_kubelet)

    return {
        "apiVersion": "karpenter.k8s.aws/v1",
        "kind": "EC2NodeClass",
        "metadata": {"name": name},
        "spec": {
            "role": f"KarpenterNodeRole-{cluster_name}",
            "amiSelectorTerms": [{"alias": "al2023@latest"}],
            "subnetSelectorTerms": [{"tags": {"karpenter.sh/discovery": cluster_name}}],
            "securityGroupSelectorTerms": [{"tags": {"karpenter.sh/discovery": cluster_name}}],
            "kubelet": kubelet_config,
            "blockDeviceMappings": [{
                "deviceName": "/dev/xvda",
                "ebs": {
                    "volumeSize": f"{root_volume_size_gi}Gi",
                    "volumeType": "gp3",
                    "encrypted": True,
                },
            }],
            "tags": {"ManagedBy": "karpenter", "Cluster": cluster_name},
        },
    }


def render_nodepool(
    name: str,
    node_class_name: str,
    capacity_types: list[str],
    instance_categories: list[str],
    min_instance_categories: int = 2,
    taints: list[dict] | None = None,
    labels: dict | None = None,
    expire_after: str = "720h",
    consolidation_policy: Literal["WhenEmpty", "WhenEmptyOrUnderutilized"] = "WhenEmptyOrUnderutilized",
    consolidate_after: str = "1m",
    cpu_limit: str = "200",
    weight: int = 10,
) -> dict:
    """NodePool genérico com configurações parametrizadas."""
    requirements = [
        {"key": "kubernetes.io/arch",   "operator": "In", "values": ["amd64"]},
        {"key": "kubernetes.io/os",     "operator": "In", "values": ["linux"]},
        {"key": "karpenter.sh/capacity-type", "operator": "In", "values": capacity_types},
        {
            "key": "karpenter.k8s.aws/instance-category",
            "operator": "In",
            "values": instance_categories,
            "minValues": min_instance_categories,
        },
        {"key": "karpenter.k8s.aws/instance-generation", "operator": "Gte", "values": ["3"]},
        {"key": "karpenter.k8s.aws/instance-hypervisor", "operator": "In", "values": ["nitro"]},
    ]

    spec_node = {
        "nodeClassRef": {"group": "karpenter.k8s.aws", "kind": "EC2NodeClass", "name": node_class_name},
        "expireAfter": expire_after,
        "requirements": requirements,
    }
    if taints:
        spec_node["taints"] = taints

    template: dict = {"spec": spec_node}
    if labels:
        template["metadata"] = {"labels": labels}

    return {
        "apiVersion": "karpenter.sh/v1",
        "kind": "NodePool",
        "metadata": {"name": name},
        "spec": {
            "template": template,
            "disruption": {
                "consolidationPolicy": consolidation_policy,
                "consolidateAfter": consolidate_after,
                "budgets": [
                    {"nodes": "10%"},
                    {"schedule": "0 9 * * mon-fri", "duration": "8h", "nodes": "0"},
                ],
            },
            "limits": {"cpu": cpu_limit},
            "weight": weight,
        },
    }


def generate_cluster_nodepools(cluster_name: str) -> list[dict]:
    """
    Gera 3 NodePools + 1 EC2NodeClass para um cluster de produção típico:
    1. general-od  — on-demand, para workloads críticos
    2. general-spot — spot, para workloads tolerantes a interrupção
    3. gpu         — on-demand p3/g4, para ML workloads (com taint)
    """
    manifests = []

    # EC2NodeClass compartilhada
    manifests.append(render_ec2_node_class("default", cluster_name))
    # EC2NodeClass para GPU (volume maior)
    manifests.append(render_ec2_node_class("gpu", cluster_name, root_volume_size_gi=100))

    # NodePool 1: On-demand para workloads críticos
    manifests.append(render_nodepool(
        name="general-od",
        node_class_name="default",
        capacity_types=["on-demand"],
        instance_categories=["c", "m", "r"],
        min_instance_categories=2,
        expire_after="720h",
        cpu_limit="500",
        weight=10,
    ))

    # NodePool 2: Spot para workloads tolerantes (batch, workers)
    manifests.append(render_nodepool(
        name="general-spot",
        node_class_name="default",
        capacity_types=["spot"],
        instance_categories=["c", "m", "r"],
        min_instance_categories=3,   # maior diversidade = menor risco de interrupção
        expire_after="168h",         # 7 dias (mais curto para spot)
        consolidation_policy="WhenEmptyOrUnderutilized",
        consolidate_after="30s",     # consolidar mais rápido para Spot
        cpu_limit="1000",
        labels={"karpenter.sh/capacity-type-preference": "spot"},
        weight=5,                    # peso menor = segunda opção
    ))

    # NodePool 3: GPU com taint para isolar workloads ML
    manifests.append({
        "apiVersion": "karpenter.sh/v1",
        "kind": "NodePool",
        "metadata": {"name": "gpu"},
        "spec": {
            "template": {
                "spec": {
                    "nodeClassRef": {"group": "karpenter.k8s.aws", "kind": "EC2NodeClass", "name": "gpu"},
                    "taints": [{"key": "nvidia.com/gpu", "value": "true", "effect": "NoSchedule"}],
                    "expireAfter": "Never",   # GPU nodes não expiram (caros para substituir)
                    "requirements": [
                        {"key": "kubernetes.io/os",  "operator": "In", "values": ["linux"]},
                        {"key": "kubernetes.io/arch", "operator": "In", "values": ["amd64"]},
                        {"key": "karpenter.sh/capacity-type", "operator": "In", "values": ["on-demand"]},
                        {
                            "key": "node.kubernetes.io/instance-type",
                            "operator": "In",
                            "values": ["p3.2xlarge", "p3.8xlarge", "g4dn.xlarge", "g4dn.2xlarge"],
                        },
                    ],
                },
            },
            "disruption": {
                "consolidationPolicy": "WhenEmpty",
                "consolidateAfter": "5m",
            },
            "limits": {"cpu": "128"},
            "weight": 20,   # maior peso = prioridade para pods GPU
        },
    })

    return manifests


if __name__ == "__main__":
    manifests = generate_cluster_nodepools("checkout-prod")
    for m in manifests:
        print("---")
        print(yaml.dump(m, default_flow_style=False))

6. CLI — Karpenter Installation and Operation

# ═══════════════════════════════════════════════════════════════
# Setup inicial
# ═══════════════════════════════════════════════════════════════

export CLUSTER_NAME="checkout-prod"
export KARPENTER_VERSION="1.13.0"
export K8S_VERSION="1.36"
export AWS_REGION="us-east-1"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export AWS_PARTITION="aws"
export KARPENTER_NAMESPACE="kube-system"

# Obter versão da AMI EKS otimizada (para EC2NodeClass alias)
export ALIAS_VERSION=$(aws ssm get-parameter \
  --name "/aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2023/x86_64/standard/recommended/image_id" \
  --query Parameter.Value \
  | xargs aws ec2 describe-images --query 'Images[0].Name' --image-ids \
  | sed -r 's/^.*(v[[:digit:]]+).*$/\1/')

echo "AMI alias version: al2023@${ALIAS_VERSION}"

# Criar service-linked role para Spot (necessário se nunca usado na conta)
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com || true

# ─────────────────────────────────────────────────────────────
# Taggear subnets e security groups para descoberta do Karpenter
# ─────────────────────────────────────────────────────────────

# Obter subnets privadas do cluster
SUBNET_IDS=$(aws eks describe-cluster \
  --name "$CLUSTER_NAME" \
  --query 'cluster.resourcesVpcConfig.subnetIds[]' \
  --output text)

# Taggear subnets para descoberta
aws ec2 create-tags \
  --resources $SUBNET_IDS \
  --tags Key=karpenter.sh/discovery,Value="$CLUSTER_NAME"

# Taggear security group do cluster
CLUSTER_SG=$(aws eks describe-cluster \
  --name "$CLUSTER_NAME" \
  --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' \
  --output text)

aws ec2 create-tags \
  --resources "$CLUSTER_SG" \
  --tags Key=karpenter.sh/discovery,Value="$CLUSTER_NAME"

# ─────────────────────────────────────────────────────────────
# Instalar Karpenter via Helm (OCI registry)
# ─────────────────────────────────────────────────────────────

# Logout primeiro para pull anônimo do ECR público
helm registry logout public.ecr.aws 2>/dev/null || true

helm upgrade --install karpenter \
  oci://public.ecr.aws/karpenter/karpenter \
  --version "${KARPENTER_VERSION}" \
  --namespace "${KARPENTER_NAMESPACE}" \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set "settings.enableZonalShift=true" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

# Verificar instalação
kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter
kubectl get crd | grep karpenter

# ─────────────────────────────────────────────────────────────
# Criar EC2NodeClass e NodePool
# ─────────────────────────────────────────────────────────────

cat <<EOF | envsubst | kubectl apply -f -
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: "KarpenterNodeRole-${CLUSTER_NAME}"
  amiSelectorTerms:
    - alias: "al2023@${ALIAS_VERSION}"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  kubelet:
    maxPods: 110
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        encrypted: true
  tags:
    ManagedBy: karpenter
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gte
          values: ["3"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
  limits:
    cpu: "1000"
EOF

# Verificar status do NodePool
kubectl get nodepool
kubectl describe nodepool default
# Verificar: status.conditions.type=Ready deve ser True

# ═══════════════════════════════════════════════════════════════
# Testar scale-up e scale-down
# ═══════════════════════════════════════════════════════════════

# Deploy de teste (pause containers, 1 CPU cada)
kubectl apply -f - << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
      - name: inflate
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        resources:
          requests:
            cpu: "1"
EOF

# Scale up: 5 pods = 5 vCPUs → Karpenter provisiona um nó
kubectl scale deployment inflate --replicas 5

# Observar logs do Karpenter em tempo real
kubectl logs -f -n kube-system \
  -l app.kubernetes.io/name=karpenter \
  -c controller \
  --since=1m | grep -E "provisioned|launched|registered|scheduled"

# Verificar NodeClaim criado
kubectl get nodeclaims
kubectl describe nodeclaim <name>   # mostra qual instance type foi escolhido

# Verificar novo nó
kubectl get nodes -L karpenter.sh/nodepool,node.kubernetes.io/instance-type

# Scale down: Karpenter consolida e termina o nó
kubectl scale deployment inflate --replicas 0

# Após ~1 min, o nó deve ser terminado
kubectl get nodes --watch

# ─────────────────────────────────────────────────────────────
# Operações de manutenção
# ─────────────────────────────────────────────────────────────

# Listar nós por NodePool
kubectl get nodes -l karpenter.sh/nodepool=default

# Status de recursos consumidos pelo NodePool
kubectl get nodepool default -o jsonpath='{.status.resources}' | python3 -m json.tool

# Forçar consolidação imediata de um NodePool (drift)
kubectl annotate nodepool default \
  karpenter.sh/disruption-reason="manual-consolidation" \
  --overwrite

# Proteger um pod específico de disruption
kubectl annotate pod <pod-name> karpenter.sh/do-not-disrupt="true"

# Deletar um nó Karpenter de forma graciosa (drain + terminate EC2)
kubectl delete node <node-name>
# Karpenter tem um finalizer que garante drain antes de terminar a instância

# Ver métricas do Karpenter (se Prometheus disponível)
kubectl port-forward -n kube-system \
  deployment/karpenter 8080:8080

# Acessar em http://localhost:8080/metrics
# Métricas relevantes:
# karpenter_provisioner_scheduling_duration_seconds
# karpenter_nodes_total_daemon_requests
# karpenter_nodes_total_pod_requests
# karpenter_disruption_consolidation_timeouts_total

7. Pitfalls

[FACT] Karpenter must not manage the nodes where it runs: if the only managed node group in the cluster is removed and the Karpenter controller tries to consolidate the node where it itself runs, there will be a deadlock. Keep at least 2 nodes in the system managed node group (or use Fargate for kube-system).

[FACT] Overlapping NodePools without defined weight cause random behavior: if two NodePools can schedule the same pod and don't have different weights, Karpenter chooses randomly. Use spec.weight or taints/requirements to make them mutually exclusive.

[FACT] Spot without instance diversity causes InsufficientInstanceCapacity: fixing only 1 or 2 Spot instance types is risky. Karpenter uses price-capacity-optimized — leave at least 10 eligible instance types. The minValues on instance-family forces minimum diversity.

[FACT] Pods without defined requests cause incorrect bin-packing: Karpenter sizes nodes based on pod requests, not limits. Pods without requests are treated as if they consume no resources, leading to undersized nodes and OOM kills. Use LimitRange to define defaults per namespace.

[FACT] expireAfter: Never on general workload NodePools accumulates CVEs: long-lived nodes accumulate vulnerabilities. The default of 720h (30 days) ensures periodic rotation with OS patches.

[FACT] Karpenter and Node Termination Handler (NTH) must not coexist: NTH and Karpenter's interruption handling mechanism can conflict when handling Spot events. If Karpenter is configured with interruptionQueue, uninstall NTH.

[FACT] Controller DNS policy: Karpenter uses ClusterFirst by default, which creates a circular dependency if CoreDNS runs on nodes managed by Karpenter. Solution: keep CoreDNS on a fixed node group, OR use dnsPolicy: Default on Karpenter.

Reflection Exercise

An EKS cluster with 3 teams has the following workloads:

API Team: critical service checkout-api, 20 replicas with requests: cpu=500m, memory=512Mi. Cannot be interrupted during business hours (Mon-Fri 8am–8pm). Must run on c or m instances generation 5+.
ML Team: training jobs that run at night, tolerate interruption (checkpoints every 5 min), need p3.2xlarge or g4dn.xlarge GPUs, and must have minimized cost.
Platform Team: DaemonSets and observability tools that must run on all nodes.

Answer:

How many NodePools would you create and what is the justification for each one? Describe the requirements, taints, weight, and disruption for each.
The ML Team wants to use Spot for their GPU jobs. What specific risks exist with Spot for GPUs and how would Karpenter mitigate these risks with the SQS interruption queue?
For checkout-api, how would you ensure that Karpenter does not consolidate nodes during business hours? (Describe using the correct field from the NodePool spec).
The platform team wants their observability tools (DaemonSet) to run on all nodes created by Karpenter, including GPU nodes. Why do DaemonSets not need tolerations for NoSchedule taints created by Karpenter? (Hint: startupTaints vs taints).
An architect proposes migrating from Karpenter to EKS Auto Mode. What are the fundamental differences between EKS Auto Mode and Karpenter for this scenario with multiple workload profiles?

References

[FACT] Getting Started with Karpenter (v1.13) — karpenter.sh
[FACT] NodePools — Karpenter v1.13 — karpenter.sh
[FACT] NodeClasses — Karpenter v1.13 — karpenter.sh
[FACT] Disruption — Karpenter v1.13 — karpenter.sh
[FACT] Karpenter Best Practices — EKS Best Practices — aws.github.io
[FACT] Scale cluster compute with Karpenter and Cluster Autoscaler — Amazon EKS — docs.aws.amazon.com