Session 014 — ECS: Services, service discovery and ALB Target Group integration

May 14, 2026

Estimated duration: 60 minutes
Prerequisites: session-013-ecs-task-definitions-logging

Objective

By the end, you will be able to create an ECS Service that registers tasks in an ALB Target Group, configure health checks with grace period, use AWS Cloud Map for service-to-service discovery (without going through the load balancer), and understand when to use each approach.

Context

[FACT] A Task Definition describes what to run. An ECS Service describes how to run it: how many replicas to maintain, how to perform rolling updates, where to register tasks to receive traffic, and what to do when a task fails. The Service is ECS's self-healing mechanism — it continuously monitors the number of healthy tasks and creates new ones if the count drops below the desiredCount.

[FACT] ECS supports two types of integration for receiving traffic: ALB/NLB Target Groups for external traffic (and internal traffic going through the load balancer) and AWS Cloud Map for DNS-based service-to-service discovery without a load balancer. They are not mutually exclusive — a service can have both simultaneously.

[CONSENSUS] The choice between ALB and Cloud Map for internal communication is a trade-off between operability and efficiency. The ALB offers TLS termination, path/header-based routing, circuit breaking, and native observability via access logs. Cloud Map is simpler and more efficient for high-frequency service-to-service communication where the load balancer overhead matters and where dynamic discovery of multiple instances is needed without the indirection of a VIP.

Key concepts

1. Anatomy of an ECS Service

ECS Service
│
├── taskDefinition        → qual revisão deployar
├── desiredCount          → quantas tasks manter rodando
├── launchType            → FARGATE | EC2 | EXTERNAL
│
├── deploymentConfiguration
│   ├── minimumHealthyPercent  → % mínima de tasks saudáveis durante deploy
│   ├── maximumPercent         → % máxima de tasks durante deploy
│   └── deploymentCircuitBreaker
│       ├── enable: true/false
│       └── rollback: true/false
│
├── loadBalancers[]
│   └── { targetGroupArn, containerName, containerPort }
│
├── serviceRegistries[]
│   └── { registryArn }   → Cloud Map service ARN
│
├── healthCheckGracePeriodSeconds  → ignora health checks por N segundos após start
│
└── networkConfiguration
    └── awsvpcConfiguration
        ├── subnets[]
        ├── securityGroups[]
        └── assignPublicIp: ENABLED | DISABLED

[FACT] The Service scheduler is a continuous loop. On each cycle, it:
1. Counts tasks in the RUNNING state that passed the health check.
2. If the count is less than desiredCount, it starts new tasks.
3. If a task fails (exit code, OOM kill, health check failed), it stops the task and starts a new one.
4. During a deploy (new task definition revision), it executes the rolling update respecting minimumHealthyPercent and maximumPercent.

2. Rolling update: minimumHealthyPercent and maximumPercent

[FACT] The two deployment configuration parameters define the capacity envelope during a rolling update:

desiredCount = 4 tasks

Valores padrão (REPLICA service):
  minimumHealthyPercent = 100%  → mínimo: ceil(4 × 1.0) = 4 tasks saudáveis
  maximumPercent        = 200%  → máximo: floor(4 × 2.0) = 8 tasks total

Comportamento com padrões:
  1. ECS inicia 4 novas tasks (nova revisão) → total: 8 tasks
  2. Aguarda as 4 novas passarem no health check
  3. Para as 4 antigas tasks
  4. Resultado: zero downtime, mas capacidade dobrada momentaneamente

Valores agressivos (deploy mais rápido, menor custo):
  minimumHealthyPercent = 50%   → mínimo: ceil(4 × 0.5) = 2 tasks saudáveis
  maximumPercent        = 150%  → máximo: floor(4 × 1.5) = 6 tasks

Comportamento agressivo:
  1. Para 2 tasks antigas → total: 2 tasks (50% de capacidade)
  2. Inicia 4 novas tasks → total: 6 tasks
  3. Aguarda novas passarem no health check
  4. Para as 2 tasks antigas restantes
  5. Resultado: redução temporária de 50% de capacidade durante o deploy

[FACT] For services with desiredCount = 1 (common in development), the defaults min=100%, max=200% are the only ones that guarantee zero downtime: ECS starts the new task before stopping the old one. With min=0%, max=100%, the deploy stops the old task first, causing downtime.

3. Deployment Circuit Breaker

[FACT] The circuit breaker detects when a deploy is continuously failing and, optionally, performs an automatic rollback to the previous revision. It uses a sliding window of failed tasks:

Threshold de falha = max(10, desiredCount × 2)
  → mínimo: 3 falhas   (documentação menciona mínimo de 3 neste contexto)
  → máximo: 200 falhas

Se a contagem de tasks falhando ultrapassar o threshold → circuit abre
  → se rollback=true: ECS reverte para a última revisão bem-sucedida
  → se rollback=false: deploy para, service fica em estado FAILED

[CONSENSUS] In production, deploymentCircuitBreaker: { enable: true, rollback: true } is considered a best practice. Without a circuit breaker, a deploy with a broken image (that crashes immediately) can loop indefinitely — ECS keeps trying to start tasks that fail, paying for the cost of each attempt and preventing the service from recovering.

4. Integration with ALB Target Groups

[FACT] The integration between ECS Service and ALB happens via Target Group in IP mode. In Fargate with networkMode: awsvpc, each task has its own private IP. ECS automatically registers and deregisters these IPs in the Target Group as tasks come up and go down.

Internet / VPC
      │
      ▼
┌──────────────┐
│     ALB      │  Listener: HTTPS:443, HTTP:80
│              │
│  Listener    │──rule: path /* ──▶ Target Group (type: IP)
└──────────────┘                        │
                                        │ registra/desregistra automaticamente
                              ┌─────────┴──────────┐
                              │  Task 1: 10.0.1.5  │
                              │  Task 2: 10.0.1.8  │  ← port 3000
                              │  Task 3: 10.0.1.12 │
                              └────────────────────┘

[FACT] When removing a task from circulation (during rolling update or scale-in), ECS performs connection draining (also called deregistration delay): it sends the deregistration request to the Target Group, waits for the ALB to finish active connections, and only then stops the container. The default deregistration delay time is 300 seconds — configurable on the Target Group.

Health Check Grace Period is distinct from the Target Group health check:

healthCheckGracePeriodSeconds (no Service):
  → quanto tempo o ECS IGNORA falhas do ALB health check após a task iniciar
  → evita que tasks em bootstrap sejam mortas prematuramente
  → valor zero: ECS age imediatamente se o ALB marcar a task como unhealthy

Health Check no Target Group (no ALB):
  → intervalo, threshold, path — define quando o ALB considera uma task saudável
  → independente do grace period do ECS

[FACT] A classic mistake: the application takes 45 seconds to initialize (downloading configs, warming up cache). The default healthCheckGracePeriodSeconds is zero. The ALB health check fails during the first 45 seconds. ECS kills the task. The new task also takes 45 seconds. Infinite loop. Solution: healthCheckGracePeriodSeconds should be slightly greater than the container health check's startPeriod + the application's initialization time.

5. AWS Cloud Map: DNS service discovery without a load balancer

[FACT] AWS Cloud Map is a resource registration and discovery service. In the ECS context, it allows services to discover each other via DNS without going through a load balancer — the client resolves a DNS name and directly obtains the IPs of running tasks.

Cloud Map components:

Private DNS Namespace: internal.svc
│
├── Service: "api"
│   ├── DNS records: A records → IPs das tasks do serviço api
│   └── Health check: syncronizado com health check do ECS
│
└── Service: "worker"
    ├── DNS records: A records → IPs das tasks do serviço worker
    └── Health check: sincronizado com health check do ECS

Consulta DNS: api.internal.svc → [10.0.1.5, 10.0.1.8, 10.0.1.12]

[FACT] When an ECS Service is configured with Cloud Map (serviceRegistries), ECS automatically registers an instance in Cloud Map for each task that passes the health check. When the task stops or fails, ECS removes the registration. The DNS TTL determines how long clients will cache old IPs after a task is removed.

[FACT] Cloud Map supports two types of DNS records for ECS:

Tipo de record	Quando usar	Retorna
A record	`networkMode: awsvpc` (Fargate ou EC2)	IP privado da task
SRV record	`networkMode: bridge` ou `host` (EC2)	IP + porta do container

In Fargate (always awsvpc), use A records. The SRV record is only needed in EC2 with bridge mode, where the host IP is the same for multiple tasks and the port is dynamic.

6. ALB vs Cloud Map: when to use each approach

                    ALB Target Group          AWS Cloud Map
                    ────────────────          ─────────────
Latência            +20-40ms (hop extra)      Zero overhead (DNS)
Custo               ~$18/mês + LCUs           ~$1/mês + queries
TLS termination     Sim (ACM integrado)       Não (app gerencia)
Roteamento avançado Sim (path, header, host)  Não
Métricas nativas    Request count, latency    Não (precisa de X-Ray)
Circuit breaking    Sim (5xx % via listener)  Não
Múltiplos serviços  Sim (rules por path)      Um nome DNS por serviço
Descoberta de porta Não (port fixo no TG)     SRV records (bridge mode)

[CONSENSUS] The rule of thumb adopted by most teams:

Use ALB for: external traffic ingress, public APIs, any service that needs TLS termination, path-based routing, or when you want request-level metrics at the application layer.
Use Cloud Map for: high-frequency service-to-service communication within the VPC where ALB latency matters, background services that don't receive external traffic, or when you need the client to know the IPs of all replicas (e.g., for client-side load balancing in gRPC).
Use both for: services that receive external traffic via ALB AND are called internally by other services with low latency via Cloud Map.

Practical example

Scenario: A system with two services:
- api-service: receives external HTTPS traffic via ALB, exposes /api/*.
- worker-service: background processing, called only by api-service internally via Cloud Map.

Architecture

Internet
    │ HTTPS:443
    ▼
┌───────────────────────────────────┐
│  ALB: api.example.com             │
│  Listener rule: /* → Target Group │
└───────────────┬───────────────────┘
                │ registra IPs
                ▼
     ┌─────────────────────┐
     │  api-service (ECS)  │──── Cloud Map DNS lookup
     │  3 tasks Fargate    │     worker.internal.svc
     │  port 3000          │────────────────▶ ┌─────────────────────┐
     └─────────────────────┘                  │ worker-service (ECS)│
                                              │ 2 tasks Fargate     │
                                              │ port 8080           │
                                              └─────────────────────┘

CDK (TypeScript) — complete stack

import { Stack, StackProps, Duration } from 'aws-cdk-lib';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
import * as servicediscovery from 'aws-cdk-lib/aws-servicediscovery';
import { Construct } from 'constructs';

export class AppStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    const vpc = ec2.Vpc.fromLookup(this, 'Vpc', { vpcName: 'prod-vpc' });

    // Cluster ECS
    const cluster = new ecs.Cluster(this, 'Cluster', {
      vpc,
      containerInsights: true,   // habilita métricas detalhadas por container
    });

    // Cloud Map namespace DNS privado
    const namespace = new servicediscovery.PrivateDnsNamespace(this, 'Namespace', {
      name: 'internal.svc',
      vpc,
    });

    // ─── Worker Service ───────────────────────────────────────────────────

    const workerTaskDef = new ecs.FargateTaskDefinition(this, 'WorkerTaskDef', {
      cpu: 256,
      memoryLimitMiB: 512,
    });
    workerTaskDef.addContainer('worker', {
      image: ecs.ContainerImage.fromRegistry('my-org/worker:latest'),
      portMappings: [{ containerPort: 8080 }],
      logging: ecs.LogDrivers.awsLogs({ streamPrefix: 'worker' }),
    });

    const workerSg = new ec2.SecurityGroup(this, 'WorkerSg', { vpc });

    const workerService = new ecs.FargateService(this, 'WorkerService', {
      cluster,
      taskDefinition: workerTaskDef,
      desiredCount: 2,
      securityGroups: [workerSg],
      // Sem load balancer — descoberta apenas via Cloud Map
      cloudMapOptions: {
        name: 'worker',                     // DNS: worker.internal.svc
        cloudMapNamespace: namespace,
        dnsRecordType: servicediscovery.DnsRecordType.A,
        dnsTtl: Duration.seconds(10),       // TTL baixo: falhas detectadas rápido
      },
      deploymentCircuitBreaker: { rollback: true },
      minHealthyPercent: 50,
      maxHealthyPercent: 200,
    });

    // ─── API Service + ALB ────────────────────────────────────────────────

    const apiTaskDef = new ecs.FargateTaskDefinition(this, 'ApiTaskDef', {
      cpu: 512,
      memoryLimitMiB: 1024,
    });
    apiTaskDef.addContainer('api', {
      image: ecs.ContainerImage.fromRegistry('my-org/api:latest'),
      portMappings: [{ containerPort: 3000 }],
      logging: ecs.LogDrivers.awsLogs({ streamPrefix: 'api' }),
      environment: {
        // O serviço api descobre o worker via Cloud Map
        WORKER_URL: 'http://worker.internal.svc:8080',
      },
    });

    // ALB
    const alb = new elbv2.ApplicationLoadBalancer(this, 'ALB', {
      vpc,
      internetFacing: true,
    });

    const listener = alb.addListener('HttpsListener', {
      port: 443,
      // certificates: [cert],  // ACM certificate em produção real
      open: true,
    });

    const apiSg = new ec2.SecurityGroup(this, 'ApiSg', { vpc });

    const apiService = new ecs.FargateService(this, 'ApiService', {
      cluster,
      taskDefinition: apiTaskDef,
      desiredCount: 3,
      securityGroups: [apiSg],
      deploymentCircuitBreaker: { rollback: true },
      minHealthyPercent: 100,
      maxHealthyPercent: 200,
      // Grace period: app demora ~30s para inicializar
      healthCheckGracePeriod: Duration.seconds(60),
    });

    // Registra o serviço no Target Group do ALB
    apiService.registerLoadBalancerTargets({
      containerName: 'api',
      containerPort: 3000,
      newTargetGroupId: 'ApiTG',
      listener: ecs.ListenerConfig.applicationListener(listener, {
        protocol: elbv2.ApplicationProtocol.HTTP,
        healthCheck: {
          path: '/health',
          interval: Duration.seconds(15),
          healthyThresholdCount: 2,
          unhealthyThresholdCount: 3,
          timeout: Duration.seconds(5),
        },
        deregistrationDelay: Duration.seconds(30),  // reduz de 300s padrão
      }),
    });

    // API precisa falar com worker via porta 8080
    workerSg.addIngressRule(apiSg, ec2.Port.tcp(8080), 'API para Worker');
  }
}

Common pitfalls

Pitfall 1: healthCheckGracePeriodSeconds zero with a slow-starting application

The mistake: The service has healthCheckGracePeriodSeconds at the default value (zero). The application takes 45 seconds to start (loads configurations, establishes database connections). The ALB health check starts immediately with a 30-second interval. On the first check (30s after start), the app is still not ready → returns 503. ECS marks the task as unhealthy and replaces it. The new task also takes 45 seconds. The service can never come up.

Why it happens: The healthCheckGracePeriodSeconds was designed exactly for this scenario. Without it, ECS treats initial health check failures the same way as failures in an already-stabilized task.

How to recognize it: In the ECS console, the service's Events tab shows service X: task Y is unhealthy in target-group Z repeatedly, with tasks being replaced before reaching 60 seconds of life. In the ALB, access logs show only health check requests returning 5xx.

How to avoid it: Measure your application's initialization time under real conditions (not localhost). Configure healthCheckGracePeriodSeconds = initialization time + 20s margin. For Java applications with JVM warmup or Python with ML model loading, this can be 120-180 seconds.

Pitfall 2: High DNS TTL in Cloud Map causing calls to dead tasks

The mistake: You configure dnsTtl: Duration.seconds(60) in Cloud Map (or use the default, which is 60 seconds). During a rolling update of worker-service, a task is removed from Cloud Map. The api-service still has the dead task's IP in DNS cache for up to 60 seconds and keeps trying to connect to it. Calls fail with connection refused during this period.

Why it happens: DNS is a distributed cache. The TTL defines how long resolvers (and the JVM, and Node.js) keep the value. If the TTL is 60s, a client that just resolved the name can continue using the IP for up to 60s after it's removed from Cloud Map.

How to recognize it: ECONNREFUSED or connection reset errors in api-service lasting 30-60 seconds during worker-service deployments, even with health check configured.

How to avoid it: Use dnsTtl: Duration.seconds(10) for services that do frequent rolling updates. Lower values reduce propagation time but increase DNS query volume (minimal cost in Cloud Map, generally acceptable). Additionally, implement retry with backoff in the client to absorb the transition period.

Pitfall 3: High deregistration delay causing deploy slowdowns

The mistake: The Target Group has the default deregistration delay of 300 seconds. During a rolling update of a service with desiredCount=4, ECS needs to stop 4 old tasks. For each task, ECS waits 300 seconds of connection draining before stopping it. A complete rolling update takes: 4 tasks × 300 seconds = 20 minutes, even though the new version has been 100% healthy for 19 minutes.

Why it happens: 300 seconds was the default chosen by AWS for workloads with long-lived connections (WebSockets, streaming). For REST APIs with short connections (< 1 second), it's an excessive value.

How to recognize it: Deploys that take much longer than expected. The ECS console shows tasks in DEREGISTERING state for long periods. The Target Group panel shows targets in draining state.

How to avoid it: Configure deregistrationDelay on the Target Group according to the connection type. For typical REST APIs, 15-30 seconds is sufficient. For WebSockets, keep it at 300s or more. You can configure this via CDK in the HealthCheck when registering the Target Group.

Reflection exercise

You are designing a microservices system with 5 services on ECS Fargate:
- gateway: receives external traffic (HTTPS via ALB)
- auth: verifies JWT tokens, called by gateway for every request
- orders: business logic, called by gateway
- inventory: queried by orders to check stock
- notifications: sends emails/SMS, triggered by orders asynchronously via SQS (not called directly)

For which services would you use ALB? For which would you use Cloud Map? Does notifications need either discovery mechanism? What would be the impact of a 60-second DNS TTL on the auth service, given that it's called on every gateway request? How would you configure healthCheckGracePeriodSeconds for each service, knowing that auth uses a lightweight Node.js image (2-second initialization) and orders uses a Java Spring Boot app (40-second initialization)?

Resources for deeper learning

1. Amazon ECS service definition parameters

URL: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service_definition_parameters.html
What to find: Complete reference of all ECS Service parameters: deployment configuration, load balancer integration, service registries, network configuration. This is the base document when you need to understand the exact meaning of a field.
Why it's the right source: It's the official reference — more precise than the console or tutorials.

2. Deployment circuit breaker

URL: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-circuit-breaker.html
What to find: How the circuit breaker detects deploy failures, activation thresholds, rollback behavior, and limitations (only works with rolling update, not with blue/green).
Why it's the right source: Documents the detection algorithm — essential for understanding when the circuit breaker will and won't protect you.

3. Use service discovery to connect Amazon ECS services with DNS names

URL: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-discovery.html
What to find: How to configure Cloud Map with ECS, difference between A and SRV records, health check synchronization between ECS and Cloud Map, and limitations (maximum of 1000 instances per Cloud Map service).
Why it's the right source: It's the official ECS + Cloud Map integration guide, with configuration examples via console, CLI, and CloudFormation.