luizmachado.dev

PT EN

Session 036 — CloudWatch: Custom Metrics with EMF (Embedded Metrics Format)

Dependencies: session-018-ecs-observabilidade-firelens-xray, session-024-lambda-observabilidade-xray-insights


Objective

By the end of this session, you will be able to emit custom metrics from a Lambda function using EMF format (via print() to stdout as structured JSON, without a separate API call), create a CloudWatch dashboard that plots these metrics, explain why EMF is preferable to put-metric-data in terms of latency and cost, and use AWS Lambda Powertools to simplify the emission.


Context

[FACT] The Embedded Metrics Format (EMF) is a CloudWatch JSON specification that instructs the CloudWatch Logs service to automatically extract custom metrics from structured log events. Instead of making a separate API call to CloudWatch, the function writes a specially formatted JSON to stdout — the Lambda Logs Agent captures it and forwards it to CloudWatch Logs, which extracts the metrics asynchronously.

[CONSENSUS] EMF is the canonical approach for custom metrics in Lambda for three reasons: (1) it does not block function execution (asynchronous via logs); (2) it eliminates the cost of PutMetricData per API call; (3) the EMF JSON also remains available as a log event in CloudWatch Logs Insights for debugging.


Core concepts

1. EMF document structure

[FACT] A valid EMF document is a JSON with the _aws key as mandatory metadata, plus the metric values and dimensions as root-level fields:

{
  "_aws": {
    "Timestamp": 1574109732004,
    "CloudWatchMetrics": [
      {
        "Namespace": "MyApp/Payments",
        "Dimensions": [["service", "environment"]],
        "Metrics": [
          { "Name": "ProcessingLatency", "Unit": "Milliseconds", "StorageResolution": 60 },
          { "Name": "SuccessfulPayments",  "Unit": "Count",        "StorageResolution": 60 }
        ]
      }
    ]
  },
  "service":             "payment-processor",
  "environment":         "production",
  "ProcessingLatency":   45.3,
  "SuccessfulPayments":  1,
  "orderId":             "ord-abc-123"
}

[FACT] EMF structural rules:
- _aws.Timestamp: epoch in milliseconds (required)
- _aws.CloudWatchMetrics: array of MetricDirective (required)
- Each MetricDirective must have Namespace, Dimensions (array of arrays of strings), and Metrics (array of MetricDefinition)
- Maximum of 100 metrics per EMF document
- Maximum of 30 dimensions per DimensionSet
- Metric values: number or array of numbers (maximum 100 elements)
- Extra fields beyond metrics/dimensions (like orderId above) are preserved in the log but do not become metrics — they serve as context for Logs Insights

[FACT] Valid units: Seconds, Microseconds, Milliseconds, Bytes, Kilobytes, Megabytes, Gigabytes, Terabytes, Bits, Kilobits, Megabits, Gigabits, Terabits, Percent, Count, Bytes/Second, Kilobytes/Second, Megabytes/Second, Gigabytes/Second, Terabytes/Second, Bits/Second, Count/Second, None


2. EMF vs. PutMetricData: cost and latency comparison

[FACT] There are two ingestion mechanisms for custom metrics in CloudWatch:

                    EMF (via CloudWatch Logs)     PutMetricData API
────────────────────────────────────────────────────────────────────
Execução            Assíncrona — print() retorna  Síncrona — bloqueia
                    imediatamente                 até resposta da API
Latência na função  Nenhuma adicionada            Latência de rede
                                                  (tipicamente 10-50ms)
Custo de escrita    Custo de logs ingestion       $0.01 por 1.000 API calls
                    ($0.50/GB — us-east-1)        (independente do volume)
Custo de métrica    Igual: $0.30/métrica/mês      Igual: $0.30/métrica/mês
                    (primeiras 10.000)            (primeiras 10.000)
Permissão requerida logs:PutLogEvents             cloudwatch:PutMetricData
Dados de contexto   Log event completo disponível Apenas dados da métrica
                    no Logs Insights
Limite de batch     100 métricas por blob EMF     20 métricas por
                                                  PutMetricData call

[CONSENSUS] For Lambda functions with high invocation rates, EMF is financially superior because you don't pay per API call — you only pay for log ingestion (which would happen anyway). The break-even point is low: any function with more than ~1,000 invocations/day typically saves money with EMF.

[OPINION — AWS Well-Architected Serverless Lens] EMF is the recommended approach for custom metrics in serverless workloads as it eliminates synchronous calls that increase the billable duration of the function.


3. High-resolution metrics

[FACT] The StorageResolution field in MetricDefinition defines storage granularity:

StorageResolution = 60  →  Standard resolution: CloudWatch armazena em
                            granularidade de 1 minuto
                            (default, menor custo)

StorageResolution = 1   →  High resolution: CloudWatch armazena em
                            granularidade de 1 segundo
                            (útil para anomaly detection em tempo real,
                            alertas sub-minuto)

[FACT] High-resolution metrics are charged the same way as standard in terms of metric cost ($0.30/metric/month), but consume more internal storage. Alarms on high-resolution metrics can have a minimum period of 10 or 30 seconds.


4. Unique metric = name + namespace + dimensions

[FACT] CloudWatch defines a unique metric by the combination of: metric_name + namespace + {dimension_key: dimension_value, ...}. This distinction is critical for understanding cost:

Métrica A:  Namespace="MyApp", Name="Latency", {service="payment"}
Métrica B:  Namespace="MyApp", Name="Latency", {service="checkout"}
→ São 2 métricas distintas, cobradas separadamente.

Armadilha de alta cardinalidade:
Namespace="MyApp", Name="Latency", {requestId="abc-123"}
Namespace="MyApp", Name="Latency", {requestId="def-456"}
→ Cada requestId único cria uma nova métrica!
   1M de requisições/dia = potencialmente 1M de métricas novas/dia
   = custo explosivo ($0.30 × 1.000.000 = $300.000/mês)

[FACT] High-cardinality dimensions (requestId, userId, sessionId) should never be used as dimensions. Use them as metadata fields in EMF (they stay in the log, searchable via Logs Insights, at no metric cost).


5. AWS Lambda Powertools — Metrics utility

[FACT] The aws_lambda_powertools.metrics module is an abstraction over EMF that: validates the schema, serializes the JSON, flushes on the @log_metrics decorator, and prevents accidental creation of metrics with high-cardinality dimensions.

Comportamento do Metrics (singleton compartilhado entre módulos):
  - Acumula métricas em memória durante a invocação
  - Flush automático no fim do handler (via @log_metrics)
  - Flush automático ao atingir 100 métricas (limite EMF)
  - Valida unidades, namespace, dimensões na emissão

EphemeralMetrics (não-singleton):
  - Instância isolada — não compartilha estado
  - Útil para multi-tenant ou métricas com dimensões completamente distintas

Practical example

Scenario: Payments API — business and operational metrics

CDK Python — Lambda + Powertools Layer + Dashboard

from aws_cdk import (
    Stack, Duration, RemovalPolicy,
    aws_lambda as lambda_,
    aws_logs as logs,
    aws_cloudwatch as cw,
    aws_cloudwatch_actions as cw_actions,
    aws_sns as sns,
)
from constructs import Construct


class PaymentMetricsStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        # ── Lambda Powertools Layer (ARN oficial por região/versão) ───
        # Consulte: https://docs.aws.amazon.com/powertools/python/latest/
        powertools_layer = lambda_.LayerVersion.from_layer_version_arn(
            self, "PowertoolsLayer",
            layer_version_arn=(
                f"arn:aws:lambda:{self.region}:017000801446:"
                "layer:AWSLambdaPowertoolsPythonV3-python312-x86_64:31"
            ),
        )

        # ── Função Lambda ─────────────────────────────────────────────
        payment_fn = lambda_.Function(
            self, "PaymentProcessor",
            function_name="payment-processor",
            runtime=lambda_.Runtime.PYTHON_3_12,
            handler="handler.lambda_handler",
            code=lambda_.Code.from_asset("lambda/payment"),
            timeout=Duration.seconds(30),
            memory_size=256,
            layers=[powertools_layer],
            environment={
                "POWERTOOLS_SERVICE_NAME":      "payment-processor",
                "POWERTOOLS_METRICS_NAMESPACE": "MyApp/Payments",
                "ENVIRONMENT":                  "production",
                "LOG_LEVEL":                    "INFO",
            },
            log_retention=logs.RetentionDays.ONE_WEEK,
        )

        # ── Dashboard CloudWatch ──────────────────────────────────────
        dashboard = cw.Dashboard(
            self, "PaymentDashboard",
            dashboard_name="payment-metrics",
        )

        # Widget 1: Taxa de sucesso vs falha
        dashboard.add_widgets(
            cw.GraphWidget(
                title="Payment Success vs Failure Rate",
                width=12,
                left=[
                    cw.Metric(
                        namespace="MyApp/Payments",
                        metric_name="SuccessfulPayments",
                        dimensions_map={"service": "payment-processor", "environment": "production"},
                        statistic="Sum",
                        period=Duration.minutes(1),
                        color="#2ca02c",
                    ),
                    cw.Metric(
                        namespace="MyApp/Payments",
                        metric_name="FailedPayments",
                        dimensions_map={"service": "payment-processor", "environment": "production"},
                        statistic="Sum",
                        period=Duration.minutes(1),
                        color="#d62728",
                    ),
                ],
            ),
            # Widget 2: Latência de processamento (p50, p95, p99)
            cw.GraphWidget(
                title="Processing Latency (ms)",
                width=12,
                left=[
                    cw.Metric(
                        namespace="MyApp/Payments",
                        metric_name="ProcessingLatency",
                        dimensions_map={"service": "payment-processor", "environment": "production"},
                        statistic="p50",
                        period=Duration.minutes(1),
                        label="p50",
                        color="#1f77b4",
                    ),
                    cw.Metric(
                        namespace="MyApp/Payments",
                        metric_name="ProcessingLatency",
                        dimensions_map={"service": "payment-processor", "environment": "production"},
                        statistic="p95",
                        period=Duration.minutes(1),
                        label="p95",
                        color="#ff7f0e",
                    ),
                    cw.Metric(
                        namespace="MyApp/Payments",
                        metric_name="ProcessingLatency",
                        dimensions_map={"service": "payment-processor", "environment": "production"},
                        statistic="p99",
                        period=Duration.minutes(1),
                        label="p99",
                        color="#d62728",
                    ),
                ],
            ),
        )

        # Widget 3: Valor total processado
        dashboard.add_widgets(
            cw.GraphWidget(
                title="Payment Value Processed (USD)",
                width=12,
                left=[
                    cw.Metric(
                        namespace="MyApp/Payments",
                        metric_name="PaymentValueUSD",
                        dimensions_map={"service": "payment-processor", "environment": "production"},
                        statistic="Sum",
                        period=Duration.minutes(5),
                    ),
                ],
            ),
            # Widget 4: Cold starts
            cw.GraphWidget(
                title="Cold Starts",
                width=12,
                left=[
                    cw.Metric(
                        namespace="MyApp/Payments",
                        metric_name="ColdStart",
                        dimensions_map={
                            "function_name": "payment-processor",
                            "service": "payment-processor",
                        },
                        statistic="Sum",
                        period=Duration.minutes(5),
                    ),
                ],
            ),
        )

        # ── Alarm em falhas ───────────────────────────────────────────
        alarm_topic = sns.Topic(self, "PaymentAlarmTopic")

        cw.Alarm(
            self, "HighFailureRateAlarm",
            alarm_name="payment-high-failure-rate",
            metric=cw.Metric(
                namespace="MyApp/Payments",
                metric_name="FailedPayments",
                dimensions_map={"service": "payment-processor", "environment": "production"},
                statistic="Sum",
                period=Duration.minutes(5),
            ),
            threshold=10,
            evaluation_periods=2,
            comparison_operator=cw.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
            treat_missing_data=cw.TreatMissingData.NOT_BREACHING,
        ).add_alarm_action(cw_actions.SnsAction(alarm_topic))

Lambda handler — EMF via Powertools and manual EMF

# lambda/payment/handler.py
import json
import os
import time
from typing import Any

from aws_lambda_powertools import Logger, Metrics
from aws_lambda_powertools.metrics import MetricUnit, MetricResolution
from aws_lambda_powertools.utilities.typing import LambdaContext

# Inicializado globalmente (singleton compartilhado)
# Namespace e service vêm de env vars:
#   POWERTOOLS_METRICS_NAMESPACE="MyApp/Payments"
#   POWERTOOLS_SERVICE_NAME="payment-processor"
logger = Logger()
metrics = Metrics()

# Dimensão adicional compartilhada por todas as métricas
ENVIRONMENT = os.environ.get("ENVIRONMENT", "development")
metrics.set_default_dimensions(environment=ENVIRONMENT)


@metrics.log_metrics(capture_cold_start_metric=True)
@logger.inject_lambda_context
def lambda_handler(event: dict, context: LambdaContext) -> dict:
    """
    @metrics.log_metrics:
      - Serializa e faz flush do EMF blob no stdout ao final do handler
      - Captura cold start em blob EMF separado (dimensão function_name)
      - Se ocorrer exceção, o flush ainda é executado
    """
    start_time = time.perf_counter()

    order_id   = event.get("orderId", "unknown")
    amount_usd = float(event.get("amountUSD", 0))
    currency   = event.get("currency", "USD")

    try:
        # Simula processamento
        result = _process_payment(order_id, amount_usd, currency)

        # ── Métricas de negócio ──────────────────────────────────────
        metrics.add_metric(
            name="SuccessfulPayments",
            unit=MetricUnit.Count,
            value=1,
        )
        metrics.add_metric(
            name="PaymentValueUSD",
            unit=MetricUnit.Count,      # CloudWatch não tem "Currency"
            value=amount_usd,
        )

        # ── Metadata de alta cardinalidade (NO log, NOT a dimension) ──
        # orderId vai para o log event mas NÃO cria nova métrica por invocação
        metrics.add_metadata(key="orderId",   value=order_id)
        metrics.add_metadata(key="currency",  value=currency)
        metrics.add_metadata(key="processor", value=result.get("processor"))

        return {"statusCode": 200, "body": json.dumps(result)}

    except ValueError as e:
        # Erro de validação do pedido
        metrics.add_metric(name="FailedPayments", unit=MetricUnit.Count, value=1)
        metrics.add_metadata(key="errorType",    value="ValidationError")
        metrics.add_metadata(key="errorMessage", value=str(e))
        metrics.add_metadata(key="orderId",      value=order_id)
        logger.error("Payment validation failed", extra={"orderId": order_id, "error": str(e)})
        return {"statusCode": 400, "body": json.dumps({"error": str(e)})}

    except Exception as e:
        metrics.add_metric(name="FailedPayments", unit=MetricUnit.Count, value=1)
        metrics.add_metadata(key="errorType",    value=type(e).__name__)
        metrics.add_metadata(key="orderId",      value=order_id)
        logger.exception("Unexpected payment error", extra={"orderId": order_id})
        return {"statusCode": 500, "body": json.dumps({"error": "Internal error"})}

    finally:
        # Latência sempre emitida, independente de sucesso/falha
        elapsed_ms = (time.perf_counter() - start_time) * 1000
        metrics.add_metric(
            name="ProcessingLatency",
            unit=MetricUnit.Milliseconds,
            value=elapsed_ms,
            resolution=MetricResolution.High,  # StorageResolution=1: sub-minuto
        )


def _process_payment(order_id: str, amount: float, currency: str) -> dict:
    if amount <= 0:
        raise ValueError(f"Invalid amount: {amount}")
    if currency not in ("USD", "EUR", "BRL"):
        raise ValueError(f"Unsupported currency: {currency}")
    time.sleep(0.02)  # simula chamada externa
    return {"orderId": order_id, "status": "approved", "processor": "stripe"}

Manual EMF without Powertools (to understand the mechanism)

import json
import time

def emit_emf_manually(metric_name: str, value: float, unit: str,
                      namespace: str, dimensions: dict) -> None:
    """
    Emite uma métrica EMF via print() — sem nenhuma dependência externa.
    O Lambda Logs Agent captura o stdout e envia ao CloudWatch Logs.
    CloudWatch Logs extrai a métrica automaticamente.
    """
    emf_document = {
        "_aws": {
            "Timestamp": int(time.time() * 1000),   # epoch em milissegundos
            "CloudWatchMetrics": [
                {
                    "Namespace": namespace,
                    "Dimensions": [list(dimensions.keys())],  # array de arrays
                    "Metrics": [
                        {"Name": metric_name, "Unit": unit, "StorageResolution": 60}
                    ],
                }
            ],
        },
        metric_name: value,
        **dimensions,   # dimensões no nível raiz
    }
    # Uma linha por documento EMF (sem quebra de linha interna)
    print(json.dumps(emf_document))

CLI — Create metrics manually and create dashboard

# 1. Emitir uma métrica via PutMetricData (para comparação com EMF)
aws cloudwatch put-metric-data \
  --namespace "MyApp/Payments" \
  --metric-name "ManualTestMetric" \
  --value 42 \
  --unit Count \
  --dimensions service=payment-processor,environment=production

# 2. Verificar se as métricas EMF foram criadas
aws cloudwatch list-metrics \
  --namespace "MyApp/Payments" \
  --query 'Metrics[*].{Name:MetricName,Dims:Dimensions}'

# 3. Obter estatísticas de latência (últimos 15 minutos)
aws cloudwatch get-metric-statistics \
  --namespace "MyApp/Payments" \
  --metric-name "ProcessingLatency" \
  --dimensions Name=service,Value=payment-processor \
               Name=environment,Value=production \
  --start-time "$(date -u -d '15 minutes ago' '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -v-15M '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  --period 60 \
  --statistics Average Minimum Maximum p95 p99 \
  --extended-statistics p95 p99

# 4. Criar alarm em taxa de erros alta
aws cloudwatch put-metric-alarm \
  --alarm-name "payment-high-failure-rate" \
  --alarm-description "More than 10 payment failures in 5 minutes" \
  --namespace "MyApp/Payments" \
  --metric-name "FailedPayments" \
  --dimensions Name=service,Value=payment-processor Name=environment,Value=production \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 10 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching

# 5. Verificar JSON EMF gerado pelo Lambda nos logs
LOG_GROUP="/aws/lambda/payment-processor"
LATEST_STREAM=$(aws logs describe-log-streams \
  --log-group-name "$LOG_GROUP" \
  --order-by LastEventTime \
  --descending \
  --limit 1 \
  --query 'logStreams[0].logStreamName' \
  --output text)

aws logs get-log-events \
  --log-group-name "$LOG_GROUP" \
  --log-stream-name "$LATEST_STREAM" \
  --limit 20 \
  --query 'events[*].message' \
  --output text | grep -A5 '"_aws"' | head -50

# 6. Consultar métricas e orderId via Logs Insights
# (os campos de metadata ficam pesquisáveis mesmo não sendo dimensões)
aws logs start-query \
  --log-group-name "/aws/lambda/payment-processor" \
  --start-time "$(date -u -d '1 hour ago' +%s 2>/dev/null || date -u -v-1H +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    fields @timestamp, orderId, ProcessingLatency, errorType
    | filter ispresent(FailedPayments)
    | sort @timestamp desc
    | limit 20
  '

Common pitfalls

1. High-cardinality dimensions = cost explosion
[FACT] Each unique combination of (namespace + metric_name + dimensions) is a distinct metric charged at $0.30/month (for the first 10,000). Using requestId, userId, or sessionId as a dimension with 1M unique values/month generates 1M new metrics = $300,000/month. Use add_metadata() for high-cardinality data — it stays in the log, doesn't create metrics.

2. Multi-line EMF is not processed
[FACT] The EMF document must be a single JSON object on a single line in stdout. If the JSON is pretty-printed (with line breaks), CloudWatch Logs does not recognize the format and silently discards the metric extraction. Powertools guarantees this; in manual mode, use json.dumps(doc) without indent.

3. Metrics or dimensions defined outside the handler (global scope)
[CONSENSUS] Dimensions or metrics added at the global scope of the function (import time) are only applied during cold start. On subsequent invocations, global state persists between calls on the same instance but is not reinitialized. Powertools warns about this. Permanent dimensions should be configured via set_default_dimensions().

4. Missing flush on exception without the decorator
[FACT] If you use metrics.flush_metrics() manually (without the @log_metrics decorator), an uncaught exception in an intermediate block can leave metrics unemitted. Always use try/finally or use the decorator, which guarantees flush even on exception.

5. StorageResolution=1 (high-resolution) without real need
[CONSENSUS] High-resolution metrics allow alarms with 10/30-second periods, but don't cost more in metrics — the extra cost is marginal in storage. The problem is that alarms on high-resolution metrics consume more evaluation reads, slightly increasing alarm cost. Use high-resolution only when sub-minute granularity is truly needed for the SLA.

6. Namespace with high granularity creates visibility silos
[CONSENSUS] Using overly specific namespaces (e.g., MyApp/Payments/OrderType/Subscription) fragments metrics into silos that are hard to correlate on the dashboard. Use one namespace per service/application and use dimensions to segment — dimensions are filterable in the console and in queries.


Reflection exercise

You have a Lambda function that processes events from an SQS queue. For each message, you want to monitor:
- MessageProcessed (Count) — by message type (messageType: order, refund, notification)
- ProcessingDuration (Milliseconds) — processing latency
- The original messageId for correlation with logs

  1. How would you structure the dimensions for MessageProcessed considering there are 3 message types (order, refund, notification)? How many distinct metrics does this create in CloudWatch? What would be the problem if you used messageId as a dimension?

  2. Write the Python code snippet with Powertools that emits MessageProcessed and ProcessingDuration with the messageType dimension, keeping messageId as metadata (not a dimension). Use @metrics.log_metrics.

  3. The EMF blob generated by Powertools is also a log event in CloudWatch Logs. Write a Logs Insights query that lists the last 10 messageId of messages of type refund with ProcessingDuration > 500ms.


Resources for further study

  • [FACT] EMF Specification (JSON structure, limits, schema): https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html
  • [FACT] Embedding metrics within logs (overview): https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html
  • [FACT] Powertools for AWS Lambda (Python) — Metrics: https://docs.aws.amazon.com/powertools/python/latest/core/metrics/
  • [FACT] CloudWatch Pricing (custom metrics, ingestion): https://aws.amazon.com/cloudwatch/pricing/