luizmachado.dev

PT EN

Amazon ECS Managed Daemons: independent agent control in your cluster

Amazon ECS Managed Daemons: independent agent control in your cluster

If you've ever operated ECS clusters with EC2, you know the pain: to run a monitoring or logging agent on every instance, you had to embed it in the application task definition as a sidecar, bake it into the AMI, or use the DAEMON scheduling strategy from ECS Services. Any agent update meant coordinating with application teams, modifying task definitions and redeploying services. When you manage hundreds of services, this becomes an operational nightmare.

AWS launched Amazon ECS Managed Daemons to solve exactly this. It's a new capability within ECS Managed Instances that lets you deploy and manage infrastructure agents (monitoring, logging, tracing, security) completely independently from applications.

The problem before

Before Managed Daemons, the options for running agents on ECS instances were:

  • Sidecar in the task definition: each task carried the agent along. Updating the agent = updating the task definition = redeploying the application. Plus, if you have 50 tasks running on the same instance, that's 50 copies of the same agent consuming resources
  • Bake into the AMI: works, but updating the agent means creating a new AMI, updating the launch template and doing a rolling replacement of instances
  • DAEMON scheduling strategy: the best option until now, but still coupled to the ECS Service concept, with no ordering guarantees (the daemon might not be ready when the application came up)

None of these options gave platform teams independent control over agents. There was always coupling with the application lifecycle.

How Managed Daemons works

The concept is simple: you create a daemon task definition (separate from regular task definitions), then create a daemon associated with the cluster and one or more Managed Instances capacity providers. ECS ensures exactly one daemon instance runs on each EC2 provisioned by those capacity providers.

Guaranteed lifecycle ordering

The most important point: ECS guarantees execution order. The daemon task starts BEFORE application tasks and drains LAST. This means when your application starts processing requests, the logging/tracing/monitoring agent is already operational. No collection gaps.

If the daemon task stops or becomes unhealthy, ECS automatically drains and replaces the entire instance. This auto-repair guarantees consistent coverage without manual intervention.

Daemon task definition

It's a new type of task definition, separate from application task definitions, with its own parameters and validation. It supports:

  • Privileged containers (required for security and monitoring agents that need host access)
  • Additional Linux capabilities
  • Host filesystem path mounts
  • daemon_bridge network mode, which enables communication with application tasks while maintaining network isolation

These capabilities are essential for agents that need visibility into host-level metrics, processes and system calls.

Deploy and rolling updates

When you update a daemon to a new task definition revision, ECS performs an automatic rolling deployment:

  1. Drains a configurable percentage of instances simultaneously (default: 25%)
  2. Provisions new instances with the updated daemon
  3. Starts the daemon first on the new instance
  4. Migrates application tasks to the new instance
  5. Terminates old instances

This "start before stop" approach ensures no gaps in daemon coverage during updates. Your logging and monitoring agents remain operational throughout the entire process.

ECS also provides built-in circuit breaker protection. You can configure:

  • Bake time: how long ECS waits after updating all instances before considering the deployment complete. During this period, it monitors CloudWatch alarms and automatically rolls back if any alarm triggers
  • CloudWatch alarms: alarms that ECS monitors during deployment to decide whether to roll back

Practical example: CloudWatch Agent as a daemon

Let's see how to configure the CloudWatch Agent as a managed daemon.

Via Console

  1. In the ECS console, go to Daemon task definitions in the side menu
  2. Create a new daemon task definition with:

- Image: public.ecr.aws/cloudwatch-agent/cloudwatch-agent:latest

- CPU: 1 vCPU, Memory: 0.5 GB

- Task execution role: ecsTaskExecutionRole

  1. In the cluster, go to the Daemons tab and click Create daemon
  2. Select the daemon task definition you created and the Managed Instances capacity provider
  3. Configure drain percentage and alarms as needed

Via CLI

Create the daemon task definition and then the daemon:

Create the daemon

aws ecs create-daemon --cli-input-json '{

"clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/my-cluster",

"daemonName": "cloudwatch-agent",

"daemonTaskDefinitionArn": "arn:aws:ecs:us-east-1:123456789012:daemon-task-definition/cw-agent:1",

"capacityProviderArns": [

"arn:aws:ecs:us-east-1:123456789012:capacity-provider/my-cp"

]

}'

Check the status:

List daemons

aws ecs list-daemons --cluster-arn arn:aws:ecs:us-east-1:123456789012:cluster/my-cluster

Daemon details

aws ecs describe-daemons --daemon-arn arn:aws:ecs:us-east-1:123456789012:daemon/my-cluster/cloudwatch-agent

Update to a new version:

aws ecs update-daemon \

--daemon-arn arn:aws:ecs:us-east-1:123456789012:daemon/my-cluster/cloudwatch-agent \

--daemon-task-definition-arn arn:aws:ecs:us-east-1:123456789012:daemon-task-definition/cw-agent:2

Use cases

Observability

The most obvious case. Deploy CloudWatch Agent, Datadog Agent, New Relic Agent or any other metrics/logs/traces collector as a daemon. One agent per instance, shared across all tasks, with independent lifecycle.

Security

Security agents like Falco, Sysdig or EDR agents that need privileged host access. Support for privileged containers and Linux capabilities makes this viable without workarounds.

Networking

Service meshes and proxies that need to run on every instance. Envoy, Consul Connect or custom networking agents.

Compliance

Audit and compliance agents that must be present on 100% of instances, with guarantees that they're running before any workload.

Managed Daemons vs DAEMON scheduling strategy

| Aspect | Managed Daemons | DAEMON Strategy |

|---|---|---|

| Platform | ECS Managed Instances | EC2 launch type |

| Ordering | Daemon starts before apps | No ordering guarantee |

| Auto-repair | Replaces instance if daemon fails | Tries to restart the service |

| Rolling update | Start before stop, no gaps | May have gaps during update |

| Circuit breaker | Built-in with bake time and alarms | Basic |

| Task definition | Separate type (daemon task def) | Regular task definition |

| Isolation | Independent lifecycle | Coupled to ECS Service |

Limitations

  • Works only with ECS Managed Instances, not with Fargate or traditional EC2 launch type
  • The daemon doesn't provision instances on its own. Instances are only created when application tasks need capacity
  • Each UpdateDaemon must include all desired settings. Omitted settings revert to defaults (no merge with previous configuration)
  • Available in all AWS regions at no additional cost (you only pay for compute consumed by daemon tasks)

Conclusion

ECS Managed Daemons solves a real separation of concerns problem in container clusters. Platform teams gain independent control over infrastructure agents without depending on application teams for deploys or updates. The ordering guarantee (daemon before app) and auto-repair eliminate entire classes of operational issues.

If you use ECS with EC2 and need to run agents on every instance, it's worth migrating to Managed Instances + Managed Daemons. The operational experience is significantly better than any previous alternative.