luizmachado.dev

PT EN

Session 017 — ECS: Deploy strategies — rolling update and blue/green with CodeDeploy

Estimated duration: 60 minutes
Prerequisites: session-016-ecs-capacity-providers-autoscaling, session-010-cdk-pipelines-stages-shellsteps


Objective

By the end, you will be able to configure an ECS Service with deployment type CODE_DEPLOY, write the AppSpec for blue/green deployment, implement test hooks after the traffic shift, and configure automatic rollback based on CloudWatch Alarms.


Context

[FACT] ECS supports three deployment types: rolling update (default, managed directly by ECS), blue/green with CodeDeploy (CODE_DEPLOY), and external (for integrations with third-party tools). Rolling update is suitable for most services; blue/green is necessary when you need explicit verification before redirecting production traffic, zero-downtime with the possibility of instant rollback, or testing in a production environment before the full traffic shift.

[FACT] The CodeDeploy blue/green model for ECS is fundamentally different from rolling update: instead of gradually replacing tasks within the same target group, CodeDeploy creates a completely separate replacement task set (green), runs tests on it, and then switches the ALB from pointing to the original task set (blue) to the green one. After the stabilization period, the blue is destroyed. Rollback means switching the ALB back to blue — a seconds-long operation, without needing to create new tasks.

[CONSENSUS] Blue/green has higher operational cost than rolling update: it requires more components (two target groups, test listener, CodeDeploy deployment group, additional IAM role), doubles task capacity during deployment, and adds configuration complexity. It is justified for critical production services where the cost of a bad deploy (downtime, slow rollback) is high. For internal or lower-criticality services, rolling update with circuit breaker is generally sufficient.


Key concepts

1. Blue/green architecture in ECS

[FACT] Blue/green deployment for ECS requires the following additional components compared to a rolling update service:

┌─────────────────────────────────────────────────────────────────┐
│  ALB                                                            │
│                                                                 │
│  Listener :443 (produção)          Listener :8080 (teste)      │
│       │                                   │                     │
│       ▼                                   ▼                     │
│  Target Group BLUE (ativo)    Target Group GREEN (novo)        │
│  [tasks versão anterior]      [tasks nova versão]              │
└─────────────────────────────────────────────────────────────────┘

Antes do traffic shift:
  :443 → TG-BLUE (100% tráfego de produção)
  :8080 → TG-GREEN (tráfego de teste apenas)

Durante traffic shift (canary):
  :443 → TG-BLUE (90%) + TG-GREEN (10%)
  :8080 → TG-GREEN (100% do tráfego de teste)

Após traffic shift completo:
  :443 → TG-GREEN (100%)
  :8080 → N/A (pode ser destruído)

Após estabilização (termination wait):
  TG-BLUE e suas tasks são destruídos

[FACT] Required components:

  1. Two ALB Target Groups: blue (active) and green (for the new task set).
  2. Two ALB Listeners: production port (e.g., 443) and test port (e.g., 8080). The test listener is optional but recommended for validation hooks.
  3. CodeDeploy Application + Deployment Group: with computePlatform: ECS.
  4. IAM Role for CodeDeploy: AWSCodeDeployRoleForECS managed policy.
  5. ECS Service with deploymentController: CODE_DEPLOY.

2. The AppSpec file

[FACT] The AppSpec is a YAML or JSON file that describes what to deploy and how to orchestrate the deployment. For ECS, it has two main blocks:

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        # ARN ou nome da task definition a deployar (nova versão)
        TaskDefinition: "arn:aws:ecs:us-east-1:123456789012:task-definition/api-service:42"
        LoadBalancerInfo:
          ContainerName: "api"        # container que recebe tráfego (portMappings)
          ContainerPort: 3000
        # Opcional: override de configurações de rede do service
        # PlatformVersion: "LATEST"
        # NetworkConfiguration:
        #   AwsvpcConfiguration:
        #     Subnets: ["subnet-abc", "subnet-def"]
        #     SecurityGroups: ["sg-xyz"]
        #     AssignPublicIp: "DISABLED"
        # Opcional: override de Capacity Provider strategy
        # CapacityProviderStrategy:
        #   - CapacityProvider: "FARGATE"
        #     Base: 2
        #     Weight: 1

Hooks:
  # Hook ANTES do ALB enviar qualquer tráfego de produção para o green
  - BeforeAllowTraffic:
      - location: "arn:aws:lambda:us-east-1:123456789012:function:PreTrafficHook"
        timeout: 300        # segundos — máx 3600

  # Hook APÓS o ALB ter direcionado tráfego de produção para o green
  - AfterAllowTraffic:
      - location: "arn:aws:lambda:us-east-1:123456789012:function:PostTrafficHook"
        timeout: 300

[FACT] Complete cycle of available hooks for ECS blue/green (in execution order):

1. BeforeInstall
   → antes de criar o replacement task set (green)
   → raramente usado para ECS

2. AfterInstall
   → após o task set green ser criado e tasks estarem RUNNING
   → antes de qualquer tráfego

3. AfterAllowTestTraffic
   → após o listener de TESTE (:8080) apontar para o green
   → antes do traffic shift de produção
   → LOCAL IDEAL para smoke tests automatizados

4. BeforeAllowTraffic
   → após os smoke tests, antes do traffic shift de produção
   → validação final

5. AfterAllowTraffic
   → após o traffic shift de produção estar completo
   → monitoramento de métricas, alertas de sucesso

[FACT] A hook is a Lambda that must call codedeploy:PutLifecycleEventHookExecutionStatus with status: Succeeded or status: Failed. If the Lambda does not call this API within the timeout, CodeDeploy considers the hook as failed and initiates rollback.

# Estrutura de uma Lambda de hook
import boto3

codedeploy = boto3.client('codedeploy')

def handler(event, context):
    deployment_id = event['DeploymentId']
    hook_execution_id = event['LifecycleEventHookExecutionId']

    try:
        # Executa os testes aqui
        run_smoke_tests()

        # Sinaliza sucesso para o CodeDeploy continuar
        codedeploy.put_lifecycle_event_hook_execution_status(
            deploymentId=deployment_id,
            lifecycleEventHookExecutionId=hook_execution_id,
            status='Succeeded'
        )
    except Exception as e:
        print(f"Hook falhou: {e}")
        codedeploy.put_lifecycle_event_hook_execution_status(
            deploymentId=deployment_id,
            lifecycleEventHookExecutionId=hook_execution_id,
            status='Failed'
        )
        raise

3. Traffic shifting strategies

[FACT] CodeDeploy offers predefined and customizable traffic shifting configurations for ECS:

Configurações predefinidas (ECS):
  CodeDeployDefault.ECSAllAtOnce
    → 100% do tráfego imediatamente para o green
    → Sem período de validação com tráfego parcial
    → Rollback manual ou por alarm

  CodeDeployDefault.ECSCanary10Percent5Minutes
    → 10% para green por 5 minutos, depois 100%
    → Permite monitorar erros com tráfego real antes do shift completo

  CodeDeployDefault.ECSCanary10Percent15Minutes
    → 10% por 15 minutos, depois 100%

  CodeDeployDefault.ECSLinear10PercentEvery1Minutes
    → +10% a cada 1 minuto até 100% (10 incrementos)
    → Mais gradual, maior janela de detecção de problemas

  CodeDeployDefault.ECSLinear10PercentEvery3Minutes
    → +10% a cada 3 minutos até 100% (30 minutos total)

[FACT] Custom traffic shifting configuration (via CDK or CLI):

// CDK — deployment config customizado
import * as codedeploy from 'aws-cdk-lib/aws-codedeploy';

const deploymentConfig = new codedeploy.EcsDeploymentConfig(this, 'Config', {
  trafficRouting: codedeploy.TrafficRouting.timeBasedCanary({
    interval: Duration.minutes(5),   // aguarda 5 min entre incrementos
    percentage: 20,                   // 20% no primeiro incremento
    // → 20% por 5 min, depois 100%
  }),
  // Alternativa: linear
  // trafficRouting: codedeploy.TrafficRouting.timeBasedLinear({
  //   interval: Duration.minutes(2),
  //   percentage: 10,  // +10% a cada 2 minutos
  // }),
});

4. Automatic rollback with CloudWatch Alarms

[FACT] CodeDeploy can monitor CloudWatch Alarms during deployment and trigger automatic rollback if any alarm enters the ALARM state. Alarms are associated with the deployment group, not the AppSpec.

Deployment group
│
├── autoRollback:
│   ├── onDeploymentFailure: true     (hook retornou Failed)
│   ├── onAlarmThreshold: true        (CW Alarm disparou)
│   └── alarms:
│       ├── "HighErrorRate"           (ex: 5xx > 5% por 2 min)
│       └── "HighLatency"            (ex: P99 > 2s por 2 min)
│
└── terminationWait: 60 minutos       (tempo antes de destruir o blue)

[FACT] The terminationWait is the period after the full traffic shift during which CodeDeploy waits before destroying the blue task set. During this period, the blue still exists and can be used for instant rollback. The default is 0 minutes (destroys immediately). In production, 30-60 minutes is recommended to have a post-deploy rollback window.


Practical example

Complete scenario: A critical api-service with:
- Blue/green via CodeDeploy with canary 10% for 5 minutes.
- AfterAllowTestTraffic hook that validates the green health check via the test listener.
- Automatic rollback if the HighErrorRate alarm triggers.
- Automatic rollback if the hook fails.

Complete CDK

import { Stack, StackProps, Duration } from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as codedeploy from 'aws-cdk-lib/aws-codedeploy';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';

export class BlueGreenStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    const vpc = ec2.Vpc.fromLookup(this, 'Vpc', { vpcName: 'prod-vpc' });
    const cluster = ecs.Cluster.fromClusterAttributes(this, 'Cluster', {
      clusterName: 'prod-cluster', vpc, securityGroups: [],
    });

    // ─── ALB com dois target groups e dois listeners ───────────────────────

    const alb = new elbv2.ApplicationLoadBalancer(this, 'ALB', {
      vpc, internetFacing: true,
    });

    // Target Group BLUE (ativo, começa com as tasks da versão atual)
    const blueTG = new elbv2.ApplicationTargetGroup(this, 'BlueTG', {
      vpc,
      protocol: elbv2.ApplicationProtocol.HTTP,
      port: 3000,
      targetType: elbv2.TargetType.IP,
      healthCheck: {
        path: '/health',
        interval: Duration.seconds(15),
        healthyThresholdCount: 2,
      },
      deregistrationDelay: Duration.seconds(30),
    });

    // Target Group GREEN (vazio, será preenchido pelo CodeDeploy no deploy)
    const greenTG = new elbv2.ApplicationTargetGroup(this, 'GreenTG', {
      vpc,
      protocol: elbv2.ApplicationProtocol.HTTP,
      port: 3000,
      targetType: elbv2.TargetType.IP,
      healthCheck: {
        path: '/health',
        interval: Duration.seconds(15),
        healthyThresholdCount: 2,
      },
      deregistrationDelay: Duration.seconds(30),
    });

    // Listener de produção (:443) → começa apontando para BLUE
    const prodListener = alb.addListener('ProdListener', {
      port: 443,
      defaultTargetGroups: [blueTG],
      open: true,
    });

    // Listener de teste (:8080) → usado pelos hooks para validação
    const testListener = alb.addListener('TestListener', {
      port: 8080,
      defaultTargetGroups: [greenTG],
      open: false,  // restrito (acesso interno apenas)
    });

    // ─── ECS Service com deploymentController: CODE_DEPLOY ────────────────

    const taskDef = new ecs.FargateTaskDefinition(this, 'TaskDef', {
      cpu: 512,
      memoryLimitMiB: 1024,
    });
    taskDef.addContainer('api', {
      image: ecs.ContainerImage.fromRegistry('my-org/api:v1'),
      portMappings: [{ containerPort: 3000 }],
      logging: ecs.LogDrivers.awsLogs({ streamPrefix: 'api' }),
    });

    const service = new ecs.FargateService(this, 'Service', {
      cluster,
      taskDefinition: taskDef,
      desiredCount: 3,
      deploymentController: {
        type: ecs.DeploymentControllerType.CODE_DEPLOY,  // chave para blue/green
      },
      // Registra no target group BLUE (green é gerenciado pelo CodeDeploy)
      loadBalancers: [{
        targetGroup: blueTG,
        containerName: 'api',
        containerPort: 3000,
      }],
      vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
    });

    // ─── Lambda de hook (AfterAllowTestTraffic) ────────────────────────────

    const hookFn = new lambda.Function(this, 'TestHook', {
      runtime: lambda.Runtime.PYTHON_3_12,
      handler: 'index.handler',
      timeout: Duration.seconds(300),
      code: lambda.Code.fromInline(`
import boto3
import urllib.request
import json
import os

codedeploy = boto3.client('codedeploy')
ALB_TEST_URL = os.environ.get('ALB_TEST_URL', '')

def handler(event, context):
    deployment_id = event['DeploymentId']
    hook_id = event['LifecycleEventHookExecutionId']

    try:
        # Chama o health endpoint via listener de teste (:8080)
        req = urllib.request.urlopen(f'{ALB_TEST_URL}/health', timeout=10)
        body = json.loads(req.read())

        if body.get('status') != 'ok':
            raise ValueError(f"Health check falhou: {body}")

        print("Smoke test passou. Continuando deployment.")
        codedeploy.put_lifecycle_event_hook_execution_status(
            deploymentId=deployment_id,
            lifecycleEventHookExecutionId=hook_id,
            status='Succeeded'
        )
    except Exception as e:
        print(f"Smoke test FALHOU: {e}")
        codedeploy.put_lifecycle_event_hook_execution_status(
            deploymentId=deployment_id,
            lifecycleEventHookExecutionId=hook_id,
            status='Failed'
        )
`),
      environment: {
        ALB_TEST_URL: `http://${alb.loadBalancerDnsName}:8080`,
      },
    });

    // A Lambda precisa chamar codedeploy:PutLifecycleEventHookExecutionStatus
    hookFn.addToRolePolicy(new iam.PolicyStatement({
      actions: ['codedeploy:PutLifecycleEventHookExecutionStatus'],
      resources: ['*'],
    }));

    // ─── CloudWatch Alarms para rollback automático ────────────────────────

    const errorRateAlarm = new cloudwatch.Alarm(this, 'HighErrorRate', {
      metric: new cloudwatch.MathExpression({
        expression: '(errors / requests) * 100',
        usingMetrics: {
          errors: new cloudwatch.Metric({
            namespace: 'AWS/ApplicationELB',
            metricName: 'HTTPCode_Target_5XX_Count',
            dimensionsMap: {
              LoadBalancer: alb.loadBalancerFullName,
              TargetGroup: blueTG.targetGroupFullName,
            },
            statistic: 'Sum',
          }),
          requests: new cloudwatch.Metric({
            namespace: 'AWS/ApplicationELB',
            metricName: 'RequestCount',
            dimensionsMap: {
              LoadBalancer: alb.loadBalancerFullName,
              TargetGroup: blueTG.targetGroupFullName,
            },
            statistic: 'Sum',
          }),
        },
        period: Duration.minutes(2),
      }),
      threshold: 5,           // rollback se 5xx > 5%
      evaluationPeriods: 2,
      comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
      alarmName: 'api-HighErrorRate',
    });

    // ─── CodeDeploy Application e Deployment Group ─────────────────────────

    const app = new codedeploy.EcsApplication(this, 'App', {
      applicationName: 'api-service',
    });

    const deploymentGroup = new codedeploy.EcsDeploymentGroup(this, 'DG', {
      application: app,
      deploymentGroupName: 'api-service-dg',
      service,
      blueGreenDeploymentConfig: {
        // Listeners e target groups para o blue/green
        listener: prodListener,
        testListener,
        blueTargetGroup: blueTG,
        greenTargetGroup: greenTG,
        // Aguarda 60 min antes de destruir o task set blue
        terminationWaitTime: Duration.minutes(60),
      },
      deploymentConfig: codedeploy.EcsDeploymentConfig.CANARY_10PERCENT_5MINUTES,
      autoRollback: {
        failedDeployment: true,       // rollback se hook falhar
        deploymentInAlarm: true,      // rollback se alarm disparar
        stoppedDeployment: false,
      },
      alarms: [errorRateAlarm],
    });
  }
}

Generated AppSpec (for pipeline reference)

# appspec.yaml — gerado pelo pipeline, substituindo <TASK_DEF_ARN>
version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "<TASK_DEF_ARN>"
        LoadBalancerInfo:
          ContainerName: "api"
          ContainerPort: 3000

Hooks:
  - AfterAllowTestTraffic:
      - location: "arn:aws:lambda:us-east-1:123456789012:function:BlueGreenStack-TestHook"
        timeout: 300

Timeline of a successful deployment

t=0m    CodeDeploy cria task set green com nova task definition
t=2m    Tasks green passam no health check do TG verde
t=2m    Listener :8080 aponta para TG verde (tráfego de teste disponível)
t=2m    Hook AfterAllowTestTraffic é invocado
t=3m    Hook valida /health via :8080 → Succeeded
t=3m    BeforeAllowTraffic (nenhum hook configurado, prossegue)
t=3m    Traffic shift: :443 passa a 10% green / 90% blue
t=8m    Nenhum alarm disparou no período de observação (5 min)
t=8m    Traffic shift: :443 passa a 100% green / 0% blue
t=8m    Hook AfterAllowTraffic invocado (se configurado)
t=8m    Deployment marcado como Succeeded
t=68m   terminationWaitTime esgotado → task set blue destruído

Common pitfalls

Pitfall 1: Hook Lambda without calling PutLifecycleEventHookExecutionStatus

The error: You create a hook Lambda that runs the tests but returns normally (without exception). The Lambda finishes successfully (exit 200), but CodeDeploy does not receive confirmation via PutLifecycleEventHookExecutionStatus. After the configured timeout (e.g., 300 seconds), CodeDeploy considers the hook as Failed and initiates rollback. The deployment is reverted even though all tests passed.

Why it happens: CodeDeploy does not use the Lambda's exit code to determine hook success/failure. It exclusively waits for the explicit call to the PutLifecycleEventHookExecutionStatus API. Returning from the Lambda without calling this API is indistinguishable from a timeout for CodeDeploy.

How to recognize: In the CodeDeploy console, the deployment shows Lifecycle event hook timed out or Lifecycle event hook failed. The Lambda logs show normal execution without errors. The Lambda was invoked, executed, and returned — but the deployment reverted.

How to avoid: Use a try/finally block to ensure PutLifecycleEventHookExecutionStatus is always called, even in case of exception. Never return from the Lambda without having called this API.


Pitfall 2: terminationWait zero — no post-deploy rollback window

The error: The terminationWaitTime is configured as 0 (default). A successful deploy destroys the blue task set immediately after the traffic shift. Thirty minutes later, a latent problem appears in the new version (e.g., a batch job that runs every hour starts failing). To revert, you need to create a new deployment with the previous version — which takes time and may have downtime.

Why it happens: With terminationWaitTime = 0, CodeDeploy destroys the blue task set as soon as the traffic shift is complete. There is no blue to instantly switch back to.

How to avoid: Configure terminationWaitTime of 30-60 minutes for critical services. During this period, a rollback is literally an ALB redirect operation — seconds, without creating new tasks. The cost is maintaining double the tasks for this period, which is acceptable for important services. If you want to save costs, 30 minutes is sufficient for most post-deploy problems to become visible.


Pitfall 3: Alarm configured on the wrong target group causes false positives

The error: You configure the rollback alarm monitoring metrics from the blueTG (original target group). During deployment, CodeDeploy is sending 10% of traffic to the greenTG and 90% to the blueTG. If the blueTG already had a high error rate before the deployment (which motivated the deploy of the new version), the alarm is already active and CodeDeploy immediately reverts the newly started deployment.

Why it happens: The alarm monitors the wrong target group. During canary traffic shift, what you want to monitor are errors in the new target group (green), not the old one (blue).

How to recognize: Deployments that revert immediately upon starting the traffic shift, even when the new version is working correctly. The deployment group logs show Alarm entered ALARM state as the rollback reason.

How to avoid: Configure rollback alarms to monitor the greenTG (or the ALB as a whole, which includes both). Alternatively, monitor business metrics (e.g., conversion rate, P99 latency) that reflect service health regardless of which target group is active.


Reflection exercise

You are migrating a payments service from rolling update to blue/green. The service has a characteristic: the new version includes a database schema migration that is compatible with both code versions (the old and new run against the new schema). The deployment typically takes 3 minutes to complete the traffic shift.

How would you structure the hooks? Which lifecycle event (BeforeAllowTraffic, AfterAllowTestTraffic, etc.) would be most appropriate for each type of validation — health check verification, payment flow test via test listener, and post-shift error rate monitoring? What terminationWaitTime would you choose, considering that the database schema rollback is not trivial? And if a high latency alarm triggered 45 minutes after the successful deployment (after the blue has already been destroyed) — what would the rollback procedure be in that case?


Resources for further study

1. CodeDeploy blue/green deployments for Amazon ECS

URL: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-bluegreen.html
What to find: Complete overview of the CODE_DEPLOY deployment type for ECS: required components, deployment flow, differences from rolling update, and limitations (e.g., does not support Capacity Provider with base > 0 for the green task set in all configurations).
Why it's the right source: It is the official entry point for understanding the complete model.

2. AppSpec 'hooks' section for ECS

URL: https://docs.aws.amazon.com/codedeploy/latest/userguide/reference-appspec-file-structure-hooks.html
What to find: Complete reference of all available lifecycle hooks for ECS, execution order, how the hook Lambda is invoked (payload), and the format of the PutLifecycleEventHookExecutionStatus call.
Why it's the right source: It is the technical reference that defines exactly the protocol between CodeDeploy and the hook Lambda.

3. Working with deployment configurations in CodeDeploy

URL: https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-configurations.html
What to find: Complete list of predefined configurations for ECS (Canary, Linear, AllAtOnce), how to create custom configurations, and how traffic shifting works in each mode.
Why it's the right source: It is the reference for choosing or creating the correct traffic shifting strategy for each risk context.