luizmachado.dev

PT EN

Session 047 — EKS: Managed Node Groups vs Fargate, Upgrades and Node Drain

Prerequisite: session-046 (EKS cluster provisioning, eksctl, VPC CNI)


Session Objectives

  • Configure managed node groups with labels and taints to separate workloads by profile
  • Create Fargate Profiles with selectors by namespace/labels and understand all their limitations
  • Execute cluster version upgrade in the correct order (control plane → add-ons → node groups)
  • Understand the 4 phases of managed node group update (Setup, Scale Up, Upgrade, Scale Down)
  • Execute kubectl drain on a node without workload downtime

1. Managed Node Groups — Fundamental Concepts

[FACT] Every managed node group is implemented as an EC2 Auto Scaling Group managed by EKS, running inside the customer's account. EKS provides the high-level API; the EC2 and ASG resources are in the customer's account and visible in the EC2 console.

[FACT] EKS automatically applies the following labels to each node in a managed node group (prefixed with eks.amazonaws.com/):

eks.amazonaws.com/nodegroup=<nome-do-nodegroup>
eks.amazonaws.com/nodegroup-image=<ami-id>
eks.amazonaws.com/capacityType=ON_DEMAND  # ou SPOT

[FACT] Additional labels and taints can be applied and updated via update-nodegroup-config without needing to recreate the node group. The label/taint update is applied to all new nodes; existing nodes receive the update on the next rolling update (when the AMI is updated).

1.1 Workload separation with labels and taints

Strategy: múltiplos node groups por perfil de workload

┌──────────────────────────────────────────────────────────────────┐
│  EKS Cluster                                                     │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ ng-app-ondemand │  │  ng-spot-batch  │  │  ng-gpu-ml      │  │
│  │                 │  │                 │  │                 │  │
│  │ m5.xlarge       │  │ c5.xlarge +     │  │ p3.2xlarge      │  │
│  │ capacity: OD    │  │ c5a/n.xlarge    │  │ capacity: OD    │  │
│  │                 │  │ capacity: SPOT  │  │                 │  │
│  │ label:          │  │ label:          │  │ label:          │  │
│  │  tier=app       │  │  tier=batch     │  │  tier=ml        │  │
│  │                 │  │                 │  │  accelerator=gpu│  │
│  │ taint:          │  │ taint:          │  │                 │  │
│  │  (nenhum)       │  │  tier=batch     │  │ taint:          │  │
│  │                 │  │  :NoSchedule   │  │  tier=ml        │  │
│  └─────────────────┘  └─────────────────┘  │  :NoSchedule   │  │
│                                             └─────────────────┘  │
│  Workload app:          Workload batch:      Workload ML:         │
│  nodeSelector:          nodeSelector:        nodeSelector:        │
│    tier: app              tier: batch          accelerator: gpu   │
│                         tolerations:          tolerations:        │
│                         - key: tier             - key: tier       │
│                           value: batch            value: ml       │
│                           effect: NoSchedule      effect: NOS..  │
└──────────────────────────────────────────────────────────────────┘

1.2 Capacity Types: On-Demand and Spot in node groups

[FACT] A managed node group can only be ON_DEMAND or SPOT — it is not possible to mix both types in the same node group. For workloads that use both, create two separate node groups (one OD and one Spot).

[FACT] For Spot, the allocation strategy is:
- price-capacity-optimized (PCO) — K8s clusters 1.28+ (most recent, recommended)
- capacity-optimized (CO) — K8s clusters 1.27 and earlier (does not change for existing node groups)

[FACT] Managed Spot node groups have Capacity Rebalancing enabled by default: when a Spot node receives a Rebalance Recommendation, EKS tries to launch a replacement node before draining the original.


2. Fargate Profiles — Selectors and Limitations

2.1 Fargate execution model

[FACT] Each Pod on Fargate has its own dedicated VM (isolated kernel, CPU, memory, and ENI). There is no shared host between Pods. This provides stronger isolation than EC2, but removes capabilities that depend on the host.

[FACT] On Fargate, each Pod = one node in kubectl get nodes. The node name follows the pattern fargate-ip-<ip>.region.compute.internal.

kubectl get nodes
# NAME                                                       STATUS  VERSION
# fargate-ip-10-0-1-20.us-east-1.compute.internal           Ready   v1.33.0-eks-xxx
# fargate-ip-10-0-1-21.us-east-1.compute.internal           Ready   v1.33.0-eks-xxx
# ip-10-0-2-10.us-east-1.compute.internal                   Ready   v1.33.0-eks-xxx  ← EC2 node

2.2 Fargate Limitations (complete table)

[FACT] The following limitations apply to Fargate on EKS:

╔═══════════════════════════════════════════════╦═══════╦═════════════════════════════╗
║ Recurso / Capacidade                          ║ Suport║ Alternativa                 ║
╠═══════════════════════════════════════════════╬═══════╬═════════════════════════════╣
║ DaemonSets                                    ║  NÃO  ║ Sidecar container           ║
║ Privileged containers                         ║  NÃO  ║ Usar EC2 node group         ║
║ HostPort / HostNetwork                        ║  NÃO  ║ Service ClusterIP ou LB     ║
║ GPUs                                          ║  NÃO  ║ EC2 node group (p3, g4dn)   ║
║ Amazon EBS (volumes)                          ║  NÃO  ║ Amazon EFS (auto-mount)     ║
║ Public subnets                                ║  NÃO  ║ Private subnet + NAT GW     ║
║ Spot capacity type                            ║  NÃO  ║ Managed node group Spot     ║
║ Custom AMI                                    ║  NÃO  ║ EC2 node group              ║
║ Custom CNI                                    ║  NÃO  ║ EC2 node group              ║
║ SSH access                                    ║  NÃO  ║ kubectl exec                ║
║ AWS Outposts / Wavelength / Local Zones       ║  NÃO  ║ EC2 node group              ║
║ Windows containers                            ║  NÃO  ║ EC2 Windows node group      ║
║ ARM / Graviton                                ║  NÃO  ║ EC2 Graviton node group     ║
║ IMDS (instance metadata)                      ║  NÃO  ║ IRSA para credenciais IAM   ║
╠═══════════════════════════════════════════════╬═══════╬═════════════════════════════╣
║ Amazon EFS (static provisioning)              ║  SIM  ║ —                           ║
║ ALB / NLB (target type IP)                    ║  SIM  ║ — (não usa node IP mode)    ║
║ Security Groups per Pod (via IRSA/SG for Pods)║  SIM  ║ —                           ║
║ VPA / HPA                                     ║  SIM  ║ —                           ║
╚═══════════════════════════════════════════════╩═══════╩═════════════════════════════╝

[FACT] IMDS is not available for Pods on Fargate. Applications that need IAM credentials must use IRSA (IAM Roles for Service Accounts). Applications that access IMDS to obtain Region or AZ must have those values hard-coded in the Pod spec or via downwardAPI.

2.3 Fargate Profile — structure

[FACT] Each Fargate Profile can have up to 5 selectors. Each selector requires namespace (mandatory field) and can have optional labels. A Pod must match at least one selector to be scheduled on Fargate.

# Fargate profile via eksctl ClusterConfig
fargateProfiles:
  - name: fp-apps
    selectors:
      # Selector 1: qualquer pod no namespace "analytics" com label fargate=true
      - namespace: analytics
        labels:
          fargate: "true"
      # Selector 2: qualquer pod no namespace "batch" (sem filtro de label)
      - namespace: batch
      # Selector 3: pods do CoreDNS no kube-system
      - namespace: kube-system
        labels:
          k8s-app: kube-dns
    # Subnets privadas para os pods Fargate (obrigatório)
    subnets:
      - subnet-0111aaaa
      - subnet-0222bbbb
      - subnet-0333cccc

[FACT] Pods that don't match any Fargate profile remain in Pending state indefinitely (they are not scheduled on Fargate nor automatically on EC2). If there is an EC2 node group in the cluster, the scheduler will try to place the Pod there.

2.4 Fargate Pod Sizing

[FACT] Fargate provisions a micro-VM with resources equal to the sum of requests from all containers in the Pod, rounded up to the next supported combination. Available combinations: 0.25–16 vCPU and 0.5–120 GB (with minimum proportion rules).

[FACT] Use Vertical Pod Autoscaler (VPA) with mode Auto or Recreate to adjust Fargate pod sizing — VPA recreates the Pod with new resource requests, and Fargate provisions the corresponding VM.


3. Cluster Version Upgrade — Order and Restrictions

3.1 Fundamental rules

[FACT] The EKS cluster upgrade must follow the mandatory order:

1. Control Plane  →  2. Add-ons  →  3. Node Groups  →  4. kubectl
    (EKS gerencia)     (VPC CNI,       (managed NG         (1 minor
                       CoreDNS,        rolling update)      version skew)
                       kube-proxy,
                       EBS CSI)

[FACT] The upgrade is only one minor version at a time — it is not possible to jump from 1.32 directly to 1.34. Each step (1.32→1.33, 1.33→1.34) requires a separate upgrade.

[FACT] The control plane upgrade is irreversible — it is not possible to downgrade. If you need to revert to a previous version, you must create a new cluster at the desired version and migrate the workloads.

[FACT] Starting from K8s 1.28, the kubelet can be up to 3 minor versions behind the kube-apiserver (upstream skew policy). In practice, with kubelet 1.30 you can upgrade the control plane to 1.31, 1.32, and 1.33 before updating the node group.

[FACT] EKS requires up to 5 available IPs in the cluster subnets to create new ENIs during the control plane upgrade. If the subnet is full, the upgrade may fail.

3.2 Complete upgrade flow

┌─────────────────────────────────────────────────────────────────────┐
│ Upgrade 1.32 → 1.33                                                  │
│                                                                      │
│ Step 1: Verificar estado do cluster                                  │
│   aws eks list-insights --cluster-name checkout-prod                 │
│   kubectl get nodes (todos devem estar em Ready + versão atual)      │
│                                                                      │
│ Step 2: Upgrade Control Plane                                        │
│   aws eks update-cluster-version --name checkout-prod                │
│   --kubernetes-version 1.33                                          │
│   [~15 min — rolling update de API servers, sem downtime]            │
│                                                                      │
│ Step 3: Upgrade Add-ons                                              │
│   aws eks update-addon --cluster-name checkout-prod                  │
│   --addon-name vpc-cni --resolve-conflicts OVERWRITE                 │
│   aws eks update-addon ... --addon-name coredns                      │
│   aws eks update-addon ... --addon-name kube-proxy                   │
│   aws eks update-addon ... --addon-name aws-ebs-csi-driver           │
│                                                                      │
│ Step 4: Upgrade Node Groups                                          │
│   aws eks update-nodegroup-version --cluster-name checkout-prod      │
│   --nodegroup-name ng-app-workers                                    │
│   [rolling update: Setup → Scale Up → Upgrade → Scale Down]         │
│                                                                      │
│ Step 5: Fargate Pods                                                 │
│   kubectl rollout restart deployment/my-app -n analytics             │
│   (Fargate pods existentes não são auto-atualizados)                 │
│                                                                      │
│ Step 6: Atualizar kubectl                                            │
│   brew upgrade kubectl  # ou equivalente                             │
└─────────────────────────────────────────────────────────────────────┘

3.3 Managed node group update phases

[FACT] When EKS updates a managed node group (AMI update or K8s version update), it executes 4 phases:

┌─────────────────────────────────────────────────────────────────────┐
│ Fase 1: SETUP                                                        │
│   • Cria nova versão do EC2 Launch Template para o ASG               │
│   • Atualiza o ASG para usar a nova versão do LT                     │
│   • Determina maxUnavailable (padrão: 1 node, máximo: 100)          │
│                                                                      │
│ Fase 2: SCALE UP                                                     │
│   • Aumenta max e desired count do ASG                               │
│   • Lança novos nodes (nova AMI/versão) nas mesmas AZs               │
│   • Aguarda novos nodes ficarem Ready com labels EKS                 │
│   • Cordon + label "exclude-from-external-load-balancers" nos old    │
│   • Timeout: 15 min por node para bootar e entrar no cluster         │
│                                                                      │
│ Fase 3: UPGRADE (default strategy)                                   │
│   • Seleciona node antigo aleatoriamente (respeita maxUnavailable)   │
│   • Drena pods do node (eviction — respeita PDB)                    │
│   • Aguarda 60 segundos após cordon                                  │
│   • Envia terminação ao ASG                                          │
│   • Repete até todos os nodes usarem nova LT version                 │
│   • PodEvictionFailure → upgrade falha (se 15 min de timeout)       │
│                                                                      │
│ Fase 4: SCALE DOWN                                                   │
│   • Retorna max e desired count do ASG aos valores originais         │
│   • Se Cluster Autoscaler estiver escalando durante este passo,      │
│     o workflow sai imediatamente                                      │
└─────────────────────────────────────────────────────────────────────┘

[FACT] There are two update strategies for Phase 3:
- Default (recommended): launches new nodes first, then terminates old ones — total capacity never drops below the configured value
- Minimal: terminates old nodes first, then launches new ones — capacity drops temporarily; recommended for GPU (to avoid paying for two sets of GPU simultaneously)

[FACT] During a version update, EKS respects Pod Disruption Budgets (PDB). However PDBs are not respected in AZRebalance operations or desired count reduction — those actions attempt eviction for up to 15 minutes and terminate the node regardless.


4. kubectl drain — Zero-Downtime Node Maintenance

4.1 Difference between cordon and drain

kubectl cordon <node>
  → Marca o node como Unschedulable
  → Pods EXISTENTES continuam rodando
  → Nenhum pod NOVO é agendado no node
  → Usado quando você precisa de tempo para preparar o drain

kubectl drain <node>
  → Faz o cordon automaticamente
  → EVICTS todos os pods (exceto DaemonSets se --ignore-daemonsets)
  → Aguarda cada pod ser reagendado em outro node
  → Indica que o node está pronto para manutenção/desligamento

4.2 Essential drain flags

[FACT] Mandatory flags for drain in real clusters:

kubectl drain <node-name> \
  --ignore-daemonsets \          # obrigatório: DaemonSet pods não podem ser evicted
  --delete-emptydir-data \       # obrigatório: pods com emptyDir (local) seriam bloqueados
  --grace-period=60 \            # override do terminationGracePeriodSeconds (60s padrão)
  --timeout=300s \               # tempo máximo total para o drain completar
  --force                        # força remoção de pods sem controller (não gerenciados)

[FACT] If a Pod has a PDB that blocks eviction (e.g.: minAvailable=1 and there is only 1 replica), the drain waits indefinitely (or until --timeout). Solution: temporarily increase replicas, or adjust the PDB.

4.3 PodDisruptionBudget — prerequisite for safe drain

# PDB para garantir que sempre há ao menos 1 pod disponível durante manutenção
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout-api-pdb
  namespace: production
spec:
  # minAvailable: mínimo de pods que devem estar Up durante disruption
  minAvailable: 1
  # OU maxUnavailable: máximo de pods que podem estar Down durante disruption
  # maxUnavailable: 1
  selector:
    matchLabels:
      app: checkout-api

[FACT] minAvailable and maxUnavailable can be absolute numbers or percentages (e.g.: 50%). With minAvailable: 1 and 3 replicas, the drain can evict up to 2 pods at a time (2 maxUnavailable).


5. CDK Python — Node Groups, Fargate Profile, Taints and Labels

from aws_cdk import (
    Stack,
    aws_eks as eks,
    aws_ec2 as ec2,
    aws_iam as iam,
)
from constructs import Construct


class EksNodeGroupsStack(Stack):
    def __init__(self, scope: Construct, construct_id: str,
                 cluster: eks.Cluster, node_role: iam.Role, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        # ──────────────────────────────────────────────────────────────
        # Node group 1: Application — On-Demand, sem taint (workload principal)
        # ──────────────────────────────────────────────────────────────
        ng_app = cluster.add_nodegroup_capacity("NgApp",
            nodegroup_name="ng-app-ondemand",
            instance_types=[
                ec2.InstanceType("m5.xlarge"),
                ec2.InstanceType("m5a.xlarge"),
            ],
            min_size=3,
            desired_size=6,
            max_size=30,
            disk_size=100,
            node_role=node_role,
            capacity_type=eks.CapacityType.ON_DEMAND,
            labels={"tier": "app", "workload": "api"},
            # Sem taint — aceita qualquer pod que não tenha nodeSelector/affinity
            tags={"project": "checkout", "env": "prod"},
        )

        # ──────────────────────────────────────────────────────────────
        # Node group 2: Batch/Jobs — Spot, com taint NoSchedule
        # Apenas pods com toleration tier=batch:NoSchedule são agendados aqui
        # ──────────────────────────────────────────────────────────────
        ng_batch = cluster.add_nodegroup_capacity("NgBatch",
            nodegroup_name="ng-batch-spot",
            instance_types=[
                ec2.InstanceType("c5.xlarge"),
                ec2.InstanceType("c5a.xlarge"),
                ec2.InstanceType("c5n.xlarge"),
                ec2.InstanceType("c4.xlarge"),
                ec2.InstanceType("m5.xlarge"),   # diversificar pools Spot
                ec2.InstanceType("m5a.xlarge"),
            ],
            min_size=0,
            desired_size=2,
            max_size=50,
            disk_size=50,
            node_role=node_role,
            capacity_type=eks.CapacityType.SPOT,
            labels={"tier": "batch", "workload": "jobs"},
            taints=[
                eks.TaintSpec(
                    key="tier",
                    value="batch",
                    effect=eks.TaintEffect.NO_SCHEDULE,  # impede pods sem toleration
                )
            ],
        )

        # ──────────────────────────────────────────────────────────────
        # Node group 3: Infra — On-Demand, taint PreferNoSchedule
        # Monitoring, logging, cluster-autoscaler etc.
        # PreferNoSchedule: pods SEM toleration preferem outros nodes mas podem usar este
        # ──────────────────────────────────────────────────────────────
        ng_infra = cluster.add_nodegroup_capacity("NgInfra",
            nodegroup_name="ng-infra",
            instance_types=[ec2.InstanceType("m5.large")],
            min_size=2,
            desired_size=2,
            max_size=5,
            node_role=node_role,
            capacity_type=eks.CapacityType.ON_DEMAND,
            labels={"tier": "infra"},
            taints=[
                eks.TaintSpec(
                    key="tier",
                    value="infra",
                    effect=eks.TaintEffect.PREFER_NO_SCHEDULE,
                )
            ],
        )

        # ──────────────────────────────────────────────────────────────
        # Fargate Profile — namespace analytics e batch
        # ──────────────────────────────────────────────────────────────
        fargate_profile = cluster.add_fargate_profile("FargateAnalytics",
            fargate_profile_name="fp-analytics",
            selectors=[
                # Analytics: somente pods com label fargate=true
                eks.Selector(
                    namespace="analytics",
                    labels={"fargate": "true"},
                ),
                # Batch jobs: qualquer pod no namespace batch
                eks.Selector(
                    namespace="batch-fargate",
                ),
            ],
            # Fargate requer subnets privadas
            subnet_selection=ec2.SubnetSelection(
                subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS,
            ),
        )


class EksFargateWorkloadStack(Stack):
    """
    Exemplos de Kubernetes manifests aplicados via CDK (eks.KubernetesManifest).
    """
    def __init__(self, scope: Construct, construct_id: str,
                 cluster: eks.Cluster, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        # ──────────────────────────────────────────────────────────────
        # PodDisruptionBudget para o workload de API
        # ──────────────────────────────────────────────────────────────
        cluster.add_manifest("CheckoutApiPDB", {
            "apiVersion": "policy/v1",
            "kind": "PodDisruptionBudget",
            "metadata": {"name": "checkout-api-pdb", "namespace": "production"},
            "spec": {
                "minAvailable": 1,
                "selector": {"matchLabels": {"app": "checkout-api"}},
            },
        })

        # ──────────────────────────────────────────────────────────────
        # Deployment de API — nodeSelector para ng-app, sem toleration de batch
        # ──────────────────────────────────────────────────────────────
        cluster.add_manifest("CheckoutApiDeployment", {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "metadata": {"name": "checkout-api", "namespace": "production"},
            "spec": {
                "replicas": 3,
                "selector": {"matchLabels": {"app": "checkout-api"}},
                "template": {
                    "metadata": {"labels": {"app": "checkout-api"}},
                    "spec": {
                        # Garante agendamento apenas em nodes com tier=app
                        "nodeSelector": {"tier": "app"},
                        # topologySpreadConstraints: distribui réplicas entre AZs
                        "topologySpreadConstraints": [{
                            "maxSkew": 1,
                            "topologyKey": "topology.kubernetes.io/zone",
                            "whenUnsatisfiable": "DoNotSchedule",
                            "labelSelector": {"matchLabels": {"app": "checkout-api"}},
                        }],
                        "containers": [{
                            "name": "api",
                            "image": "my-ecr.dkr.ecr.us-east-1.amazonaws.com/checkout-api:latest",
                            "resources": {
                                "requests": {"cpu": "500m", "memory": "512Mi"},
                                "limits":   {"cpu": "2000m", "memory": "2Gi"},
                            },
                            # terminationGracePeriodSeconds padrão é 30s
                        }],
                        "terminationGracePeriodSeconds": 60,
                    },
                },
            },
        })

        # ──────────────────────────────────────────────────────────────
        # Job batch no node group Spot — com toleration de tier=batch
        # ──────────────────────────────────────────────────────────────
        cluster.add_manifest("BatchJob", {
            "apiVersion": "batch/v1",
            "kind": "Job",
            "metadata": {"name": "report-generator", "namespace": "batch"},
            "spec": {
                "ttlSecondsAfterFinished": 3600,   # limpa job 1h após completar
                "template": {
                    "spec": {
                        "restartPolicy": "Never",
                        "nodeSelector": {"tier": "batch"},
                        "tolerations": [{
                            "key": "tier",
                            "value": "batch",
                            "effect": "NoSchedule",
                            "operator": "Equal",
                        }],
                        "containers": [{
                            "name": "report",
                            "image": "my-ecr.dkr.ecr.us-east-1.amazonaws.com/report-gen:v1",
                            "resources": {
                                "requests": {"cpu": "2000m", "memory": "4Gi"},
                            },
                        }],
                    },
                },
            },
        })

        # ──────────────────────────────────────────────────────────────
        # Pod no Fargate — namespace analytics com label fargate=true
        # IMDS indisponível: sem AWS_METADATA_SERVICE_TIMEOUT hardcoded
        # Credenciais via IRSA (serviceAccountName + annotation)
        # ──────────────────────────────────────────────────────────────
        cluster.add_manifest("AnalyticsPod", {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "metadata": {"name": "analytics-processor", "namespace": "analytics"},
            "spec": {
                "replicas": 2,
                "selector": {"matchLabels": {"app": "analytics-processor"}},
                "template": {
                    "metadata": {
                        "labels": {
                            "app": "analytics-processor",
                            "fargate": "true",    # match Fargate profile selector
                        },
                    },
                    "spec": {
                        # serviceAccountName com annotation IRSA para credenciais IAM
                        "serviceAccountName": "analytics-sa",
                        "containers": [{
                            "name": "processor",
                            "image": "my-ecr.dkr.ecr.us-east-1.amazonaws.com/analytics:v2",
                            "resources": {
                                # Fargate arredonda para combinação suportada
                                "requests": {"cpu": "1000m", "memory": "2Gi"},
                                "limits":   {"cpu": "1000m", "memory": "2Gi"},
                            },
                            "env": [
                                # AWS_REGION hard-coded — IMDS indisponível no Fargate
                                {"name": "AWS_REGION", "value": "us-east-1"},
                                {"name": "AWS_DEFAULT_REGION", "value": "us-east-1"},
                            ],
                        }],
                        # Fargate ignora nodeSelector — o pod é agendado via profile
                    },
                },
            },
        })

6. Python — Upgrade orchestration script

"""
Script de upgrade do cluster EKS.
Valida pré-condições e executa upgrade em ordem correta.
"""
import boto3
import subprocess
import time
import sys
from dataclasses import dataclass

eks = boto3.client("eks")


@dataclass
class ClusterState:
    name: str
    current_version: str
    target_version: str
    status: str
    node_groups: list[dict]
    addons: list[dict]


def get_cluster_state(cluster_name: str, target_version: str) -> ClusterState:
    cluster = eks.describe_cluster(name=cluster_name)["cluster"]
    ng_resp = eks.list_nodegroups(clusterName=cluster_name)
    node_groups = []
    for ng_name in ng_resp["nodegroups"]:
        ng = eks.describe_nodegroup(clusterName=cluster_name, nodegroupName=ng_name)["nodegroup"]
        node_groups.append({
            "name": ng_name,
            "version": ng["version"],
            "status": ng["status"],
        })
    addon_resp = eks.list_addons(clusterName=cluster_name)
    addons = []
    for addon_name in addon_resp["addons"]:
        addon = eks.describe_addon(clusterName=cluster_name, addonName=addon_name)["addon"]
        addons.append({"name": addon_name, "version": addon["addonVersion"], "status": addon["status"]})

    return ClusterState(
        name=cluster_name,
        current_version=cluster["version"],
        target_version=target_version,
        status=cluster["status"],
        node_groups=node_groups,
        addons=addons,
    )


def validate_upgrade_prerequisites(state: ClusterState) -> list[str]:
    """Valida pré-condições para o upgrade. Retorna lista de erros."""
    errors = []

    # 1. Verificar status do cluster
    if state.status != "ACTIVE":
        errors.append(f"Cluster não está ACTIVE: {state.status}")

    # 2. Verificar version skew (apenas 1 minor version)
    curr_parts = [int(x) for x in state.current_version.split(".")]
    tgt_parts  = [int(x) for x in state.target_version.split(".")]
    if tgt_parts[1] - curr_parts[1] != 1:
        errors.append(
            f"Upgrade deve ser 1 minor version por vez: {state.current_version} → {state.target_version} "
            f"(diferença: {tgt_parts[1] - curr_parts[1]} minor versions)"
        )

    # 3. Verificar node groups — todos devem estar na versão atual do control plane
    for ng in state.node_groups:
        if ng["version"] != state.current_version:
            errors.append(
                f"Node group {ng['name']} está na versão {ng['version']}, "
                f"mas control plane está em {state.current_version}. "
                "Atualize o node group antes de fazer upgrade do control plane."
            )
        if ng["status"] != "ACTIVE":
            errors.append(f"Node group {ng['name']} não está ACTIVE: {ng['status']}")

    return errors


def wait_for_cluster_active(cluster_name: str, timeout_min: int = 30) -> bool:
    """Aguarda cluster ficar ACTIVE. Retorna True se sucesso."""
    deadline = time.time() + timeout_min * 60
    while time.time() < deadline:
        status = eks.describe_cluster(name=cluster_name)["cluster"]["status"]
        print(f"  Cluster status: {status}")
        if status == "ACTIVE":
            return True
        if status == "FAILED":
            return False
        time.sleep(30)
    return False


def upgrade_control_plane(state: ClusterState) -> str:
    """Inicia upgrade do control plane. Retorna update_id."""
    print(f"\n[Step 2] Upgrading control plane {state.current_version} → {state.target_version}...")
    response = eks.update_cluster_version(
        name=state.name,
        kubernetesVersion=state.target_version,
    )
    update_id = response["update"]["id"]
    print(f"  Update iniciado: {update_id}")
    return update_id


def wait_for_cluster_update(cluster_name: str, update_id: str, timeout_min: int = 30) -> bool:
    deadline = time.time() + timeout_min * 60
    while time.time() < deadline:
        resp = eks.describe_update(name=cluster_name, updateId=update_id)
        status = resp["update"]["status"]
        print(f"  Update status: {status}")
        if status == "Successful":
            return True
        if status in ("Cancelled", "Failed"):
            print(f"  Update falhou: {resp['update'].get('errors', [])}")
            return False
        time.sleep(30)
    return False


def upgrade_addon(cluster_name: str, addon_name: str) -> bool:
    """Atualiza um add-on para a versão mais recente compatível com o cluster."""
    try:
        # Obter versão atual do cluster
        cluster_version = eks.describe_cluster(name=cluster_name)["cluster"]["version"]

        # Listar versões compatíveis e escolher a mais recente
        versions_resp = eks.describe_addon_versions(
            kubernetesVersion=cluster_version,
            addonName=addon_name,
        )
        latest = versions_resp["addons"][0]["addonVersions"][0]["addonVersion"]

        print(f"  Atualizando add-on {addon_name} → {latest}")
        eks.update_addon(
            clusterName=cluster_name,
            addonName=addon_name,
            addonVersion=latest,
            resolveConflicts="OVERWRITE",
        )
        # Aguarda add-on ficar ACTIVE
        for _ in range(20):
            status = eks.describe_addon(clusterName=cluster_name, addonName=addon_name)
            if status["addon"]["status"] == "ACTIVE":
                return True
            time.sleep(15)
        return False
    except Exception as e:
        print(f"  Erro ao atualizar add-on {addon_name}: {e}")
        return False


def upgrade_nodegroup(cluster_name: str, ng_name: str, target_version: str) -> bool:
    """Inicia rolling update do node group."""
    print(f"\n  Upgrading node group {ng_name} → {target_version}...")
    try:
        resp = eks.update_nodegroup_version(
            clusterName=cluster_name,
            nodegroupName=ng_name,
            version=target_version,
        )
        update_id = resp["update"]["id"]
        # Aguarda conclusão (pode levar 30-60 min para node groups grandes)
        deadline = time.time() + 90 * 60   # 90 min timeout
        while time.time() < deadline:
            status = eks.describe_update(
                name=cluster_name,
                updateId=update_id,
                nodegroupName=ng_name,
            )["update"]["status"]
            print(f"    Node group {ng_name} update: {status}")
            if status == "Successful":
                return True
            if status in ("Cancelled", "Failed"):
                return False
            time.sleep(30)
        return False
    except Exception as e:
        print(f"  Erro ao atualizar node group {ng_name}: {e}")
        return False


def run_cluster_upgrade(cluster_name: str, target_version: str) -> None:
    print(f"=== EKS Cluster Upgrade: {cluster_name} → {target_version} ===\n")

    # Step 1: Validar pré-condições
    print("[Step 1] Validando pré-condições...")
    state = get_cluster_state(cluster_name, target_version)
    errors = validate_upgrade_prerequisites(state)
    if errors:
        print("  ERROS ENCONTRADOS — upgrade abortado:")
        for err in errors:
            print(f"  ✗ {err}")
        sys.exit(1)
    print("  ✓ Pré-condições OK")

    # Step 2: Upgrade control plane
    update_id = upgrade_control_plane(state)
    if not wait_for_cluster_update(cluster_name, update_id):
        print("  ✗ Upgrade do control plane falhou")
        sys.exit(1)
    print("  ✓ Control plane atualizado")

    # Step 3: Upgrade add-ons
    print("\n[Step 3] Atualizando add-ons...")
    for addon_name in ["vpc-cni", "coredns", "kube-proxy", "aws-ebs-csi-driver"]:
        if not upgrade_addon(cluster_name, addon_name):
            print(f"  ✗ Falha ao atualizar add-on {addon_name}")

    # Step 4: Upgrade node groups
    print("\n[Step 4] Atualizando node groups...")
    state = get_cluster_state(cluster_name, target_version)  # reler estado atual
    for ng in state.node_groups:
        if ng["version"] != target_version:
            if not upgrade_nodegroup(cluster_name, ng["name"], target_version):
                print(f"  ✗ Falha ao atualizar node group {ng['name']}")
            else:
                print(f"  ✓ Node group {ng['name']} atualizado")

    print("\n=== Upgrade completo ===")
    print(f"Próximo passo: kubectl rollout restart (Fargate pods não são auto-atualizados)")

7. CLI — Essential Examples

# ─────────────────────────────────────────────────────────────
# MANAGED NODE GROUPS — labels e taints
# ─────────────────────────────────────────────────────────────

# Adicionar/atualizar labels e taints no node group (sem recriar nodes)
aws eks update-nodegroup-config \
  --cluster-name checkout-prod \
  --nodegroup-name ng-batch-spot \
  --labels 'addOrUpdateLabels={tier=batch,updated-at=2026-06-13}' \
  --taints 'addOrUpdateTaints=[{key=tier,value=batch,effect=NO_SCHEDULE}]'

# Remover label de um node group
aws eks update-nodegroup-config \
  --cluster-name checkout-prod \
  --nodegroup-name ng-batch-spot \
  --labels 'removeLabels=[updated-at]'

# Remover taint de um node group
aws eks update-nodegroup-config \
  --cluster-name checkout-prod \
  --nodegroup-name ng-batch-spot \
  --taints 'removeTaints=[{key=tier,effect=NO_SCHEDULE}]'

# Listar todos os node groups com versão e status
aws eks list-nodegroups --cluster-name checkout-prod
aws eks describe-nodegroup \
  --cluster-name checkout-prod \
  --nodegroup-name ng-app-ondemand \
  --query 'nodegroup.{Name:nodegroupName,Version:version,Status:status,Capacity:capacityType,Labels:labels,Taints:taints}'

# ─────────────────────────────────────────────────────────────
# FARGATE PROFILES
# ─────────────────────────────────────────────────────────────

# Criar Fargate profile via CLI
aws eks create-fargate-profile \
  --cluster-name checkout-prod \
  --fargate-profile-name fp-analytics \
  --pod-execution-role-arn arn:aws:iam::123456789012:role/EKSFargatePodExecutionRole \
  --subnets subnet-0111aaaa subnet-0222bbbb subnet-0333cccc \
  --selectors 'namespace=analytics,labels={fargate=true}' \
              'namespace=batch-fargate'

# Listar Fargate profiles
aws eks list-fargate-profiles --cluster-name checkout-prod

# Verificar status e selectors de um profile
aws eks describe-fargate-profile \
  --cluster-name checkout-prod \
  --fargate-profile-name fp-analytics \
  --query 'fargateProfile.{Status:status,Selectors:selectors}'

# ─────────────────────────────────────────────────────────────
# CLUSTER UPGRADE
# ─────────────────────────────────────────────────────────────

# Step 0: Verificar upgrade insights (API deprecated, issues de compatibilidade)
aws eks list-insights \
  --cluster-name checkout-prod \
  --filter '{"categories":["UPGRADE_READINESS"]}' \
  --query 'insights[*].{Name:name,Status:insightStatus.status,Recommendation:recommendation}'

# Step 1: Verificar versão atual e nodes
kubectl version
kubectl get nodes -o wide

# Step 2: Upgrade control plane (aguardar ~15 min)
aws eks update-cluster-version \
  --name checkout-prod \
  --kubernetes-version 1.33 \
  --region us-east-1

# Monitorar progresso
aws eks describe-cluster --name checkout-prod \
  --query 'cluster.{Version:version,Status:status,PlatformVersion:platformVersion}'

# Step 3: Upgrade add-ons
for addon in vpc-cni coredns kube-proxy aws-ebs-csi-driver; do
  echo "Upgrading add-on: $addon"
  LATEST=$(aws eks describe-addon-versions \
    --kubernetes-version 1.33 \
    --addon-name "$addon" \
    --query 'addons[0].addonVersions[0].addonVersion' \
    --output text)
  aws eks update-addon \
    --cluster-name checkout-prod \
    --addon-name "$addon" \
    --addon-version "$LATEST" \
    --resolve-conflicts OVERWRITE
done

# Step 4: Upgrade node groups (um por vez, aguardar conclusão)
aws eks update-nodegroup-version \
  --cluster-name checkout-prod \
  --nodegroup-name ng-app-ondemand \
  --kubernetes-version 1.33   # ou sem --kubernetes-version para só atualizar AMI

# Monitorar node group update
aws eks describe-update \
  --name checkout-prod \
  --nodegroup-name ng-app-ondemand \
  --update-id <update-id> \
  --query 'update.{Status:status,Type:type,Errors:errors}'

# Step 5: Redeployar pods Fargate (não são auto-atualizados)
kubectl rollout restart deployment/analytics-processor -n analytics

# ─────────────────────────────────────────────────────────────
# kubectl drain — manutenção de node
# ─────────────────────────────────────────────────────────────

# Listar nodes e identificar o que será drenado
kubectl get nodes -o wide

# Cordon (impede novos pods sem evictar existentes)
kubectl cordon ip-10-0-2-10.us-east-1.compute.internal

# Verificar pods no node
kubectl get pods -A -o wide --field-selector spec.nodeName=ip-10-0-2-10.us-east-1.compute.internal

# Drain (cordon + evict)
kubectl drain ip-10-0-2-10.us-east-1.compute.internal \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=300s

# Verificar que todos os pods foram movidos
kubectl get pods -A -o wide | grep ip-10-0-2-10

# Após manutenção: uncordon para voltar a aceitar pods
kubectl uncordon ip-10-0-2-10.us-east-1.compute.internal

# Verificar PDBs que podem bloquear drain
kubectl get pdb -A
kubectl describe pdb checkout-api-pdb -n production

8. Pitfalls

[FACT] Fargate ignores nodeSelector — pods scheduled on Fargate do not respect user-defined nodeSelector. The scheduling criterion is exclusively the Fargate Profile (namespace + labels). If a pod doesn't match any profile, it stays Pending.

[FACT] Existing DaemonSets in clusters with Fargate cause Pending pods: the Fargate scheduler tries to create a DaemonSet pod on the "Fargate node", but DaemonSets are not supported. Solution: add nodeAffinity to the DaemonSet to run only on EC2 nodes (via label eks.amazonaws.com/compute-type: ec2).

[FACT] Control plane upgrade can fail silently if subnets are full: the error InsufficientFreeAddressesInSubnet appears in describe-update but the cluster remains on the previous version (automatic rollback). Check available IPs in the cluster subnets before upgrading.

[FACT] Spot node groups with maxUnavailable=1 and many AZs can launch many nodes temporarily: scale up launches up to 2 × number of AZs nodes (e.g.: 3 AZs = 6 extra nodes before draining the old ones). This can trigger EC2 instance quota limits.

[CONSENSUS] PDB with minAvailable=N where N = number of replicas blocks drain indefinitely: if the deployment has 2 replicas and the PDB requires minAvailable=2, no pod can be evicted. Always configure PDB with headroom (minAvailable = total_replicas - 1 or maxUnavailable >= 1).

[FACT] Fargate jobs need ttlSecondsAfterFinished: completed Job pods on Fargate continue incurring charges (pod CPU/memory) after completion. Always configure TTL to avoid zombie pod costs.


Reflection Exercise

You need to redesign the compute topology of an EKS cluster with 4 types of workload:

  1. API Gateway (stateless, high traffic, 99.9% SLA) — 10 replicas, no tolerance for interruption
  2. ML Inference (stateless, CPU-intensive, tolerant to 10% interruptions) — 50 replicas
  3. Batch ETL (short jobs, 5-30 min, highly tolerant to interruptions) — 0–200 variable pods
  4. Monitoring Stack (Prometheus, Grafana — stateful, DaemonSets required) — 3 replicas

Answer:

  1. How many managed node groups are needed? What capacity type (OD/Spot) for each? What labels and taints?
  2. Should ML Inference go to Fargate or EC2 node group? Why? (consider: DaemonSets for metrics, pod volume, cost)
  3. Can the Monitoring Stack go to Fargate? Why?
  4. How to configure PDBs to ensure that kubectl drain works without blocking for more than 5 minutes for each workload?
  5. During the cluster upgrade 1.32→1.33, in what exact order should the 4 node groups be updated? Is there any order dependency?

References