Session 018 — ECS Observability: FireLens, Container Insights and X-Ray sidecar

May 18, 2026

Estimated duration: 60 minutes
Prerequisites: session-017-ecs-deploy-rolling-blue-green

Objective

By the end, you will be able to configure FireLens as a Fluent Bit log router (sending logs simultaneously to CloudWatch and S3/Kinesis), enable Container Insights for per-task metrics, and add the X-Ray daemon as a sidecar container with the correct task role.

Context

[FACT] The awslogs driver (session 013) is simple and efficient for sending stdout/stderr directly to CloudWatch, but it has limitations: each container goes to exactly one destination, it's not possible to filter or transform logs before sending them, and it doesn't support routing to non-CloudWatch destinations (S3, Elasticsearch, Datadog, Splunk) without additional code. FireLens solves these limitations by introducing a Fluent Bit sidecar as a log router — the main container sends to FireLens, and FireLens distributes to multiple destinations with transformations.

[FACT] Container Insights is CloudWatch's detailed per-container metrics system. Standard ECS metrics (CPU and memory at the service level) exist without additional configuration, but Container Insights adds granularity: metrics per individual task, per container within the task, disk I/O, and network I/O — published in the ECS/ContainerInsights namespace.

[CONSENSUS] The X-Ray daemon as a sidecar is the AWS-recommended pattern for tracing in ECS Fargate. The model is: the application sends trace segments via UDP to localhost:2000, the daemon collects, batches, and sends them to the X-Ray API asynchronously. This decouples the application code from the public network — the application doesn't need outbound HTTPS to X-Ray, only UDP loopback.

Key concepts

1. FireLens: architecture and data flow

[FACT] FireLens is an ECS logging mechanism that uses Fluent Bit (or Fluentd) as a sidecar container to receive and route logs. The flow is:

Container principal
  │
  │  logDriver: "awsfirelens"    ← não awslogs!
  │  options: { ... destino ... }
  │
  ▼
FireLens sidecar (Fluent Bit)   ← porta 24224 (Fluent protocol)
  │
  ├──▶ CloudWatch Logs           (plugin cloudwatch_logs)
  ├──▶ Amazon Kinesis Firehose   (plugin kinesis_firehose)
  ├──▶ Amazon Kinesis Streams    (plugin kinesis)
  └──▶ Qualquer destino Fluent Bit compatível
       (S3, Elasticsearch, Datadog, Splunk, etc.)

[FACT] The official FireLens image maintained by AWS already includes the plugins for CloudWatch, Kinesis, and Firehose pre-installed:

Fluent Bit: public.ecr.aws/aws-observability/aws-for-fluent-bit:stable
Fluentd: public.ecr.aws/aws-observability/fluentd:latest

Fluent Bit is preferred in most cases for being lighter (C vs Ruby) and having better performance under high load.

[FACT] FireLens configuration happens in two places in the task definition:

FireLens container (the sidecar): type firelens, defines which image to use and optional configurations.
Application containers: use logDriver: awsfirelens with options that specify the destination and optional filters.

Task Definition
│
├── Container: firelens-router    ← tipo: firelens
│   ├── firelensConfiguration:
│   │   ├── type: "fluentbit"
│   │   └── options:
│   │       ├── enable-ecs-log-metadata: "true"  ← adiciona taskId, cluster, etc.
│   │       └── config-file-type: "file"          ← configuração customizada
│   │           config-file-value: "/fluent-bit/custom.conf"
│   └── essential: false   ← app continua se o firelens falhar
│
└── Container: api
    └── logConfiguration:
        ├── logDriver: "awsfirelens"    ← envia para o sidecar FireLens
        └── options:
            ├── Name: "cloudwatch_logs"
            ├── region: "us-east-1"
            ├── log_group_name: "/ecs/api"
            └── log_stream_prefix: "api/"

2. Routing to multiple destinations

[FACT] To send logs to multiple destinations simultaneously with FireLens, you use a custom Fluent Bit configuration. The standard AWS FireLens image automatically generates a base configuration (inputs, parsers, ECS metadata filters), and you can add extra outputs via a custom configuration file.

Example: sending to CloudWatch and Kinesis Firehose simultaneously:

/fluent-bit/custom.conf (packaged in the image or in S3):

# Este arquivo é INCLUÍDO na configuração gerada automaticamente pelo ECS
# Não redefinir [INPUT] ou [FILTER] — o ECS já os gera

# Output 1: CloudWatch Logs
[OUTPUT]
    Name              cloudwatch_logs
    Match             api-*            # filtra apenas logs do container 'api'
    region            us-east-1
    log_group_name    /ecs/api-service
    log_stream_prefix api/
    auto_create_group true

# Output 2: Kinesis Firehose (para S3/analytics)
[OUTPUT]
    Name              kinesis_firehose
    Match             api-*
    region            us-east-1
    delivery_stream   api-logs-firehose

# Output 3: Descarta logs do próprio firelens (evita recursão)
[OUTPUT]
    Name              null
    Match             firelens*

[FACT] The Match in Fluent Bit uses the log tag. ECS FireLens automatically generates tags in the format {container-name}-{task-id}. That's why Match api-* captures all logs from the container named api, regardless of the task ID.

[FACT] The enable-ecs-log-metadata: "true" field in the firelensConfiguration options makes FireLens automatically add ECS metadata to each log record:

{
  "log": "2026-06-10T10:30:00Z INFO request completed",
  "container_id": "abc123",
  "container_name": "api",
  "source": "stdout",
  "ecs_cluster": "prod-cluster",
  "ecs_task_arn": "arn:aws:ecs:...",
  "ecs_task_definition": "api-service:42"
}

This eliminates the need to manually correlate logs with tasks — each log line already knows which task it came from.

3. Container Insights: per-container metrics

[FACT] Container Insights publishes metrics in the ECS/ContainerInsights namespace with dimensions that allow drill-down to the individual container level. The metrics available with enhanced Container Insights:

Métrica	Dimensão	Descrição
`CpuUtilized`	ClusterName, ServiceName, TaskId	CPU usada por task (unidades)
`CpuReserved`	ClusterName, ServiceName	CPU reservada total do service
`MemoryUtilized`	ClusterName, ServiceName, TaskId	Memória usada em MiB
`MemoryReserved`	ClusterName, ServiceName	Memória reservada total
`NetworkRxBytes`	ClusterName, TaskId	Bytes recebidos pela task
`NetworkTxBytes`	ClusterName, TaskId	Bytes transmitidos pela task
`StorageReadBytes`	ClusterName, TaskId	I/O de leitura
`StorageWriteBytes`	ClusterName, TaskId	I/O de escrita
`RunningTaskCount`	ClusterName, ServiceName	Tasks rodando

[FACT] There are two levels of Container Insights:

enabled (default when activated): cluster, service, and task metrics. No individual container level.
enhanced: adds per-container metrics within the task. Requires enhanced explicitly.

# Ativar Container Insights enhanced em um cluster existente
aws ecs update-cluster-settings \
  --cluster prod-cluster \
  --settings name=containerInsights,value=enhanced

# Verificar configuração atual
aws ecs describe-clusters --clusters prod-cluster \
  --query 'clusters[0].settings'

[FACT] Container Insights has additional cost: metrics are published to CloudWatch and charged per metric-per-month (~$0.30/metric/month). A cluster with 10 services can generate hundreds of metrics — check the estimated cost before enabling at scale.

4. X-Ray daemon as sidecar

[FACT] The X-Ray daemon is a process that receives trace segments (UDP datagrams) on port 2000, buffers them, and sends them in batches to the X-Ray API. Running as a sidecar has two advantages over embedding the direct send SDK in the application:

Network decoupling: the application sends to localhost:2000 (UDP, no network latency). The daemon manages the connection to the X-Ray API, retries, and batching.
Language independence: the same sidecar serves Java, Python, Node.js, Go — each application SDK only needs to know how to send to localhost:2000.

[FACT] X-Ray daemon container configuration in the task definition:

{
  "name": "xray-daemon",
  "image": "public.ecr.aws/xray/aws-xray-daemon:3.x",
  "essential": false,
  "cpu": 32,
  "memoryReservation": 256,
  "portMappings": [
    {
      "containerPort": 2000,
      "protocol": "udp"
    }
  ],
  "environment": [
    { "name": "AWS_REGION", "value": "us-east-1" }
  ],
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/xray-daemon",
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "xray",
      "mode": "non-blocking"
    }
  }
}

[FACT] The Task Role needs permission to send traces to X-Ray:

{
  "Effect": "Allow",
  "Action": [
    "xray:PutTraceSegments",
    "xray:PutTelemetryRecords",
    "xray:GetSamplingRules",
    "xray:GetSamplingTargets",
    "xray:GetSamplingStatisticSummaries"
  ],
  "Resource": "*"
}

Or use the managed policy AWSXRayDaemonWriteAccess.

[FACT] In the application container, configure the X-Ray SDK to send to the daemon:

# Variável de ambiente para todos os SDKs X-Ray
AWS_XRAY_DAEMON_ADDRESS=localhost:2000

In CDK/task definition, since the task's containers share the network (loopback), localhost:2000 reaches the X-Ray sidecar without additional network configuration.

Practical example

Complete scenario: An api-service with full-stack observability:
- Logs via FireLens: CloudWatch for operations + Kinesis Firehose for analytics (S3).
- Enhanced Container Insights for per-task metrics.
- X-Ray daemon sidecar for distributed tracing.

Complete task definition in JSON

{
  "family": "api-service-observavel",
  "taskRoleArn": "arn:aws:iam::123456789012:role/api-task-role",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecs-execution-role",
  "networkMode": "awsvpc",
  "cpu": "1024",
  "memory": "2048",
  "requiresCompatibilities": ["FARGATE"],
  "containerDefinitions": [
    {
      "name": "log-router",
      "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
      "essential": false,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "enable-ecs-log-metadata": "true"
        }
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/firelens",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "firelens",
          "mode": "non-blocking"
        }
      }
    },
    {
      "name": "xray-daemon",
      "image": "public.ecr.aws/xray/aws-xray-daemon:3.x",
      "essential": false,
      "cpu": 32,
      "memoryReservation": 256,
      "portMappings": [
        { "containerPort": 2000, "protocol": "udp" }
      ],
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "cloudwatch_logs",
          "region": "us-east-1",
          "log_group_name": "/ecs/api-service",
          "log_stream_prefix": "xray/"
        }
      }
    },
    {
      "name": "api",
      "image": "my-org/api:latest",
      "essential": true,
      "cpu": 512,
      "memoryReservation": 1024,
      "portMappings": [
        { "containerPort": 3000, "protocol": "tcp" }
      ],
      "environment": [
        { "name": "AWS_XRAY_DAEMON_ADDRESS", "value": "localhost:2000" },
        { "name": "AWS_REGION", "value": "us-east-1" }
      ],
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "cloudwatch_logs",
          "region": "us-east-1",
          "log_group_name": "/ecs/api-service",
          "log_stream_prefix": "api/",
          "auto_create_group": "true"
        }
      },
      "dependsOn": [
        { "containerName": "log-router", "condition": "START" },
        { "containerName": "xray-daemon", "condition": "START" }
      ]
    }
  ]
}

CDK equivalent

import { Stack, StackProps, RemovalPolicy } from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as logs from 'aws-cdk-lib/aws-logs';
import * as firehose from 'aws-cdk-lib/aws-kinesisfirehose';
import { Construct } from 'constructs';

export class ObservableServiceStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // ─── Cluster com Container Insights enhanced ───────────────────────────
    const cluster = new ecs.Cluster(this, 'Cluster', {
      vpc: /* vpc */,
      containerInsights: true,   // habilita Container Insights (enhanced via CLI ou console)
    });

    // ─── Task Role com permissões para X-Ray ──────────────────────────────
    const taskRole = new iam.Role(this, 'TaskRole', {
      assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
    });
    taskRole.addManagedPolicy(
      iam.ManagedPolicy.fromAwsManagedPolicyName('AWSXRayDaemonWriteAccess')
    );

    // Permissão para FireLens enviar ao CloudWatch e Firehose
    const execRole = new iam.Role(this, 'ExecRole', {
      assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName(
          'service-role/AmazonECSTaskExecutionRolePolicy'
        ),
      ],
    });
    execRole.addToPolicy(new iam.PolicyStatement({
      actions: [
        'logs:CreateLogGroup',
        'logs:CreateLogStream',
        'logs:PutLogEvents',
        'firehose:PutRecordBatch',
      ],
      resources: ['*'],
    }));

    // ─── Task Definition ──────────────────────────────────────────────────
    const taskDef = new ecs.FargateTaskDefinition(this, 'TaskDef', {
      cpu: 1024,
      memoryLimitMiB: 2048,
      taskRole,
      executionRole: execRole,
    });

    // Log Group para operação
    const logGroup = new logs.LogGroup(this, 'AppLogs', {
      logGroupName: '/ecs/api-service',
      retention: logs.RetentionDays.ONE_MONTH,
      removalPolicy: RemovalPolicy.DESTROY,
    });

    // 1. FireLens sidecar
    const fireLensContainer = taskDef.addFirelensLogRouter('log-router', {
      image: ecs.ContainerImage.fromRegistry(
        'public.ecr.aws/aws-observability/aws-for-fluent-bit:stable'
      ),
      firelensConfig: {
        type: ecs.FirelensLogRouterType.FLUENTBIT,
        options: {
          enableECSLogMetadata: true,
        },
      },
      essential: false,
      logging: ecs.LogDrivers.awsLogs({
        streamPrefix: 'firelens',
        logGroup,
        mode: ecs.AwsLogDriverMode.NON_BLOCKING,
      }),
    });

    // 2. X-Ray sidecar
    const xrayContainer = taskDef.addContainer('xray-daemon', {
      image: ecs.ContainerImage.fromRegistry('public.ecr.aws/xray/aws-xray-daemon:3.x'),
      essential: false,
      cpu: 32,
      memoryReservationMiB: 256,
      logging: ecs.LogDrivers.firelens({
        options: {
          Name: 'cloudwatch_logs',
          region: 'us-east-1',
          log_group_name: logGroup.logGroupName,
          log_stream_prefix: 'xray/',
        },
      }),
    });
    xrayContainer.addPortMappings({ containerPort: 2000, protocol: ecs.Protocol.UDP });

    // 3. Container principal com FireLens + X-Ray
    const appContainer = taskDef.addContainer('api', {
      image: ecs.ContainerImage.fromRegistry('my-org/api:latest'),
      essential: true,
      cpu: 512,
      memoryReservationMiB: 1024,
      environment: {
        AWS_XRAY_DAEMON_ADDRESS: 'localhost:2000',
        AWS_REGION: 'us-east-1',
      },
      // Logs via FireLens → CloudWatch E Firehose simultaneamente
      logging: ecs.LogDrivers.firelens({
        options: {
          // Output primário: CloudWatch
          Name: 'cloudwatch_logs',
          region: 'us-east-1',
          log_group_name: logGroup.logGroupName,
          log_stream_prefix: 'api/',
          // Para múltiplos outputs, use config customizada no sidecar FireLens
        },
      }),
    });
    appContainer.addPortMappings({ containerPort: 3000 });
    appContainer.addContainerDependencies(
      { container: fireLensContainer, condition: ecs.ContainerDependencyCondition.START },
      { container: xrayContainer, condition: ecs.ContainerDependencyCondition.START }
    );
  }
}

Validating observability after deploy

# 1. Verificar logs chegando ao CloudWatch
aws logs tail /ecs/api-service --follow

# 2. Verificar Container Insights metrics
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name CpuUtilized \
  --dimensions Name=ClusterName,Value=prod-cluster \
               Name=ServiceName,Value=api-service \
  --start-time 2026-06-10T10:00:00Z \
  --end-time 2026-06-10T11:00:00Z \
  --period 60 \
  --statistics Average

# 3. Verificar traces X-Ray
aws xray get-trace-summaries \
  --start-time 2026-06-10T10:00:00 \
  --end-time 2026-06-10T11:00:00 \
  --query 'TraceSummaries[*].{Id:Id,Duration:Duration,HasError:HasError}'

Common pitfalls

Pitfall 1: FireLens with `essential: true` bringing down the task if the router fails

The error: The FireLens container is configured with essential: true. In production, the Fluent Bit sidecar runs out of memory (processes a log spike larger than configured) and is killed by the OOM killer. Since it's essential, the entire task is terminated — including the application container that was healthy. The service starts restarting tasks in a loop.

Why it happens: The default FireLens design is essential: false exactly for this scenario: if the log router fails, it's preferable to continue serving traffic (with temporarily lost logs) than to bring down the service.

How to recognize: Tasks stopping with StoppedReason: Essential container in task exited and the container that exited is log-router or similar. In the FireLens logs (via CloudWatch — ironic), you can see OOM or crash messages before termination.

How to avoid: Always configure the FireLens sidecar with essential: false. Adequately size the FireLens container memory (memoryReservation of 256 MiB is the minimum; for high log volume, use 512 MiB or more). Configure mem_buf_limit in Fluent Bit to limit the in-memory buffer and prevent OOM.

Pitfall 2: X-Ray without permissions in the Task Role, but with `essential: false` masking the error

The error: The X-Ray daemon is configured with essential: false and without the xray:PutTraceSegments permission in the Task Role. The daemon starts, tries to send traces, receives AccessDenied, and displays errors in its own logs — but since it's essential: false, the task doesn't crash. The developer doesn't notice that tracing is broken for days or weeks, until someone needs the trace data to debug a production problem.

Why it happens: essential: false decouples the daemon's health from the task's health. Silent errors in the sidecar don't propagate to the service's health checks.

How to recognize: In the X-Ray console, zero traces for the service (or gaps in traces). In the X-Ray daemon logs (/ecs/xray-daemon), messages like Send failed with StatusCode: 403 or AccessDeniedException.

How to avoid: Add a CloudWatch alarm monitoring xray:PutTraceSegments via CloudTrail or monitor the TracesReceivedCount metric in X-Ray. Explicitly test tracing after each deploy: aws xray get-trace-summaries should return recent traces.

Pitfall 3: `enable-ecs-log-metadata: false` losing correlation context

The error: The FireLens configuration doesn't include enable-ecs-log-metadata: true (or it's explicitly set to false). Logs arrive at CloudWatch, but without the ecs_task_arn, ecs_cluster, container_name fields. When an error occurs, you have the log with the error message, but you don't know which specific task it came from — making correlation with metrics, traces, and ECS events difficult.

Why it happens: enable-ecs-log-metadata is not enabled by default in all configurations. When omitted, FireLens doesn't inject ECS context metadata into the logs.

How to avoid: Always include "enable-ecs-log-metadata": "true" in the firelensConfiguration options. This is especially critical in multi-task environments where multiple replicas generate logs in the same log group — without ecs_task_arn, it's impossible to filter logs from a specific task to correlate with an incident.

Reflection exercise

You are designing the observability strategy for an e-commerce system with 8 microservices on ECS Fargate. Each service generates an average of 50 MB of logs per hour during normal operation, reaching 500 MB/hour during peaks. The requirements are:

Logs must be retained for 30 days for operations (CloudWatch) and for 1 year for compliance (S3).
Distributed traces must correlate requests across the 8 services.
Alerts must fire if any service's error rate exceeds 1% for 5 minutes.
Observability cost must be justifiable.

How would you structure FireLens to meet requirements 1 and 4 simultaneously (CloudWatch for 30 days + S3 for 1 year via Firehose, without duplicating ingestion cost)? For requirement 2, is the X-Ray daemon alone sufficient or would you need something additional to correlate traces between services that call each other via Cloud Map? And for requirement 3, would you use Container Insights metrics, ALB metrics, or FireLens logs as the data source for the alarms?

Resources for deeper learning

1. Send Amazon ECS logs to an AWS service or AWS Partner (FireLens)

URL: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_firelens.html
What to find: Overview of FireLens, how to configure the sidecar container in the task definition, the firelensConfiguration options, and links to configuration examples for each supported destination.
Why it's the right source: It's the official entry point — includes the architecture diagram and the list of official images maintained by AWS.

2. Amazon ECS Container Insights metrics

URL: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html
What to find: Complete table of all metrics published by Container Insights for ECS, the available dimensions (ClusterName, ServiceName, TaskId, ContainerName), and the difference between standard and enhanced metrics.
Why it's the right source: It's the reference for which metrics exist and with which dimensions — essential for creating alarms and dashboards.

3. Running the X-Ray daemon on Amazon ECS

URL: https://docs.aws.amazon.com/xray/latest/devguide/xray-daemon-ecs.html
What to find: Complete task definition examples with the X-Ray daemon as a sidecar, the IAM permissions needed in the Task Role, environment variables to configure the application SDK, and how to verify that traces are arriving.
Why it's the right source: It's the canonical X-Ray documentation for ECS — includes task definition examples you can use directly.