luizmachado.dev

PT EN

Session 039 — X-Ray: groups, annotations, sampling rules and cross-service integration

Dependencies: session-038-cloudwatch-composite-alarms-anomaly


Objective

By the end of this session, you will be able to create X-Ray Groups to filter traces by custom annotation, configure sampling rules to increase sampling on critical endpoints without inflating cost, navigate a distributed trace that traverses ALB → Lambda → DynamoDB identifying where time was spent, and distinguish annotations from metadata for use in filter expressions.


Context

[FACT] AWS X-Ray is a distributed tracing service that collects segment data sent by instrumented applications and groups them into traces by TraceId. Each request generates a trace that traverses all participating services, allowing you to identify where latency occurs and where errors originate.

[FACT] X-Ray natively integrates with ALB, API Gateway, Lambda, ECS, EC2, and AWS SDK clients (DynamoDB, SQS, SNS, etc.). For Lambda, tracing is enabled via configuration — without modifying code. For other services, you need to instrument the code with the X-Ray SDK.


Key concepts

1. Anatomy of a trace: segment, subsegment, inferred segment

[FACT] The X-Ray data hierarchy:

Trace (TraceId)
│  Agrupa todos os segmentos de uma mesma requisição
│  Retido por 30 dias
│
├── Segment (enviado por cada serviço instrumentado)
│     ├── name, trace_id, id (segment_id)
│     ├── start_time, end_time
│     ├── http request/response
│     ├── annotations (indexadas para filter expressions)
│     ├── metadata (não indexada, qualquer tipo)
│     ├── error / fault / throttle flags
│     └── subsegments
│           ├── AWS SDK calls (DynamoDB, S3, SQS...)
│           ├── HTTP downstream calls
│           └── custom subsegments (código arbitrário)
│
└── Inferred segment (gerado pelo X-Ray para serviços sem SDK)
      Criado a partir do subsegmento upstream que chamou o serviço
      Ex: DynamoDB não envia segmento → X-Ray infere pelo subsegmento
          do Lambda que fez a chamada ao DynamoDB

[FACT] Segment documents have a maximum size of 64 KB. Data beyond this limit is truncated.

Tracing header — cross-service propagation:
[FACT] The first service that receives the request adds the X-Amzn-Trace-Id header with Root (TraceId), Parent (segmentId of the current service), and Sampled (0 or 1). All downstream services propagate this header, maintaining the original TraceId.

X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793;Parent=53995c3f42cd8ad8;Sampled=1

2. Sampling rules: default rate and per-endpoint configuration

[FACT] Default rule: first request per second + 5% of additional requests. This rule is conservative for cost control.

[FACT] Sampling rules are evaluated in priority order (lower number = higher priority, from 1 to 9999). The default rule has priority 10000 and is always evaluated last.

Estrutura de uma sampling rule:
─────────────────────────────────────────────────────────────
RuleName:        Nome identificador (único)
Priority:        1–9999 (menor = maior prioridade)
ReservoirSize:   Requests/segundo sempre amostrados (reservoir)
FixedRate:       % adicional amostrada além do reservoir (0.0–1.0)
ServiceName:     Filtro por nome do serviço (wildcard * permitido)
ServiceType:     Filtro por tipo: "AWS::Lambda::Function", "AWS::ECS::Container", etc.
Host:            Filtro por host
HTTPMethod:      GET, POST, PUT, DELETE, * (wildcard)
URLPath:         Filtro por path (wildcard * permitido)
ResourceARN:     Filtro por ARN do recurso

Custo de sampling:
  ReservoirSize = 10, FixedRate = 0.05 significa:
  - Primeiros 10 req/s: TODOS amostrados
  - Além disso: 5% amostrados

[FACT] Sampling rules are centrally managed in X-Ray (not per instance or function). The X-Ray SDK queries the rules periodically (every 10 seconds) and applies them locally, with no additional latency per sampling decision.


3. Annotations vs. Metadata

[FACT] Two types of custom data can be added to segments and subsegments:

                Annotations                 Metadata
────────────────────────────────────────────────────────────────
Indexação       SIM — indexado para         NÃO — não pesquisável
                filter expressions           via filter expressions
Tipos           Boolean, Number, String     Qualquer tipo JSON
                                            (objetos, arrays, etc.)
Uso típico      UserId, orderId, endpoint,  Payload completo,
                versão do app, feature flag configurações, dados
                                            de debugging
Limite          50 annotations por trace    Não documentado limite rígido
                                            (limitado pelo 64KB do segment)
Visibilidade    Console X-Ray, GetTrace     Console X-Ray,
                Summaries, filter expr.     GetTraceSummaries (sem filtro)
SDK             put_annotation(key, value)  put_metadata(key, value,
                                            namespace="default")

4. Filter expressions: full syntax

[FACT] Filter expressions are used both in the console (trace search) and in Group definitions. Syntax: keyword operator value, combined with AND / OR.

Tipo de keyword     Exemplos
────────────────────────────────────────────────────────────────
Boolean             ok, error, fault, throttle, partial
                    (usados sem operador: "fault" = trace com 5XX)
                    ou com: "ok = false"

Number              responsetime > 2
                    duration >= 5 AND duration <= 8
                    http.status != 200
                    http.status = 429

String              http.url CONTAINS "/api/payments"
                    http.method = "POST"
                    http.url BEGINSWITH "https://api."
                    user CONTAINS ""      (field exists check)
                    name = "payment-service"

Annotation          annotation[userId] = "u-abc123"
                    annotation[orderId] CONTAINS "ORD-"
                    annotation[isPremium] = true
                    annotation[retryCount] > 2
                    !annotation[userId]   (annotation não presente)

Complex — service   service("payment-lambda") { fault }
                    service() { fault }          (qualquer serviço)
                    service(id(name: "payment", type: "AWS::Lambda::Function")) { error }

Complex — edge      edge("alb", "payment-lambda") { error }

Group               group.name = "payment-errors" AND user = "alice"

5. X-Ray Groups: service graph and metrics by context

[FACT] A Group is a collection of traces defined by a filter expression. When created, X-Ray:
1. Compares incoming traces against the filter expression when storing them
2. Generates a separate service graph for traces matching the group
3. Publishes CloudWatch metrics (namespace AWS/X-Ray) for the group every minute: ApproximateTraceCount, Throttle, Fault, Error

[FACT] Groups are charged per retrieved trace that matches the filter expression — not for group creation.

[FACT] Updating a Group's filter expression does not affect already-stored traces — only future traces. To avoid mixed data from old and new expressions, delete the group and recreate it.


Practical example

Scenario: Payments API — trace ALB → Lambda → DynamoDB

Flow: ALB → Lambda payment-processor → DynamoDB orders-table

CDK Python — Enable X-Ray + Sampling Rule + Group

from aws_cdk import (
    Stack, Duration,
    aws_lambda as lambda_,
    aws_xray as xray,
    aws_iam as iam,
)
from constructs import Construct


class XRayObservabilityStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        # ── Lambda com X-Ray Active Tracing ───────────────────────────
        payment_fn = lambda_.Function(
            self, "PaymentProcessor",
            function_name="payment-processor",
            runtime=lambda_.Runtime.PYTHON_3_12,
            handler="handler.lambda_handler",
            code=lambda_.Code.from_asset("lambda/payment"),
            timeout=Duration.seconds(30),
            memory_size=256,
            # Active = X-Ray registra TODOS os invocações amostradas
            # PassThrough = X-Ray propaga tracing header mas não registra
            tracing=lambda_.Tracing.ACTIVE,
            environment={
                "POWERTOOLS_SERVICE_NAME": "payment-processor",
            },
        )

        # Permissão para o Lambda enviar traces ao X-Ray
        payment_fn.add_to_role_policy(
            iam.PolicyStatement(
                actions=[
                    "xray:PutTraceSegments",
                    "xray:PutTelemetryRecords",
                    "xray:GetSamplingRules",
                    "xray:GetSamplingTargets",
                ],
                resources=["*"],
            )
        )

        # ── Sampling Rule: mais amostragem em /payments (endpoint crítico) ─
        payment_sampling_rule = xray.CfnSamplingRule(
            self, "PaymentSamplingRule",
            sampling_rule=xray.CfnSamplingRule.SamplingRuleProperty(
                rule_name="payment-high-value-endpoints",
                priority=100,               # Alta prioridade (menor número)
                reservoir_size=50,          # 50 req/s sempre amostrados
                fixed_rate=0.10,            # + 10% dos req adicionais
                service_name="payment-processor",
                service_type="AWS::Lambda::Function",
                http_method="POST",
                url_path="/api/v*/payments*",
                host="*",
                resource_arn="*",
                version=1,
            ),
        )

        # Regra de baixa amostragem para health checks (ruído desnecessário)
        health_check_rule = xray.CfnSamplingRule(
            self, "HealthCheckSamplingRule",
            sampling_rule=xray.CfnSamplingRule.SamplingRuleProperty(
                rule_name="health-check-low-sample",
                priority=50,               # Maior prioridade que payment rule
                reservoir_size=1,          # 1 req/s — só para confirmar que funciona
                fixed_rate=0.0,            # 0% adicional
                service_name="*",
                service_type="*",
                http_method="GET",
                url_path="/health*",
                host="*",
                resource_arn="*",
                version=1,
            ),
        )

        # ── X-Ray Group: traces com falhas em pagamentos ──────────────
        payment_errors_group = xray.CfnGroup(
            self, "PaymentErrorsGroup",
            group_name="payment-errors",
            filter_expression=(
                'service("payment-processor") { fault OR error } '
                'AND annotation[endpoint] BEGINSWITH "/api/v"'
            ),
            insights_configuration=xray.CfnGroup.InsightsConfigurationProperty(
                insights_enabled=True,
                notifications_enabled=True,  # CloudWatch Events para insights
            ),
        )

        # Group para traces lentos (latência > 2s)
        slow_payments_group = xray.CfnGroup(
            self, "SlowPaymentsGroup",
            group_name="slow-payments",
            filter_expression=(
                'service("payment-processor") { responsetime > 2 } '
                'AND annotation[isPremiumUser] = true'
            ),
            insights_configuration=xray.CfnGroup.InsightsConfigurationProperty(
                insights_enabled=True,
            ),
        )

Python — Lambda handler with X-Ray SDK (Powertools Tracer)

# lambda/payment/handler.py
import json
import time
import boto3
from typing import Any

from aws_lambda_powertools import Logger, Tracer
from aws_lambda_powertools.utilities.typing import LambdaContext

# Tracer: wraps X-Ray SDK — service vem de POWERTOOLS_SERVICE_NAME env var
tracer = Tracer()
logger = Logger()

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
orders_table = dynamodb.Table("orders-table")


@tracer.capture_lambda_handler   # cria segmento raiz + flush automático
@logger.inject_lambda_context
def lambda_handler(event: dict, context: LambdaContext) -> dict:
    order_id   = event.get("orderId", "unknown")
    user_id    = event.get("userId", "anonymous")
    amount     = float(event.get("amount", 0))
    endpoint   = event.get("requestContext", {}).get("path", "/unknown")
    is_premium = event.get("isPremium", False)

    # ── Annotations: indexadas para filter expressions e Groups ───────
    # Máximo 50 por trace
    tracer.put_annotation(key="orderId",       value=order_id)
    tracer.put_annotation(key="userId",        value=user_id)
    tracer.put_annotation(key="endpoint",      value=endpoint)
    tracer.put_annotation(key="isPremiumUser", value=is_premium)

    # ── Metadata: não indexada, para debugging detalhado ──────────────
    # Útil para dados grandes ou estruturados
    tracer.put_metadata(
        key="requestPayload",
        value={"orderId": order_id, "amount": amount, "isPremium": is_premium},
        namespace="payment",   # namespace organiza metadata por domínio
    )

    try:
        result = _process_payment(order_id, user_id, amount)

        # Annotation de sucesso para filter expressions
        tracer.put_annotation(key="paymentStatus", value="approved")
        tracer.put_annotation(key="processorUsed", value=result.get("processor"))

        return {"statusCode": 200, "body": json.dumps(result)}

    except Exception as e:
        tracer.put_annotation(key="paymentStatus", value="failed")
        tracer.put_annotation(key="errorType",     value=type(e).__name__)
        raise


@tracer.capture_method   # cria subsegmento automático para este método
def _process_payment(order_id: str, user_id: str, amount: float) -> dict:
    """
    @tracer.capture_method cria um subsegmento com o nome do método.
    Aparece no trace timeline como "## _process_payment".
    """
    # Simulando validação com subsegmento customizado
    with tracer.provider.in_subsegment("## validate_payment") as subsegment:
        subsegment.put_annotation("validationStep", "amount_check")
        if amount <= 0:
            raise ValueError(f"Invalid amount: {amount}")
        subsegment.put_annotation("validationResult", "passed")

    # Chamada ao DynamoDB — X-Ray SDK instrumenta automaticamente boto3
    # O AWS SDK call aparece como subsegmento "DynamoDB" no trace
    _record_order(order_id, user_id, amount)

    return {"orderId": order_id, "status": "approved", "processor": "stripe"}


@tracer.capture_method
def _record_order(order_id: str, user_id: str, amount: float) -> None:
    """
    Escrita no DynamoDB — o boto3 já instrumentado pelo X-Ray SDK
    aparece como subsegmento "DynamoDB PutItem" no trace.
    """
    orders_table.put_item(
        Item={
            "PK": f"ORDER#{order_id}",
            "SK": "METADATA",
            "userId": user_id,
            "amount": str(amount),
            "status": "approved",
        }
    )

CLI — Configure sampling, create groups, query traces

# 1. Criar sampling rule para endpoint de pagamentos
aws xray create-sampling-rule \
  --sampling-rule '{
    "RuleName": "payment-high-value-endpoints",
    "Priority": 100,
    "FixedRate": 0.10,
    "ReservoirSize": 50,
    "ServiceName": "payment-processor",
    "ServiceType": "AWS::Lambda::Function",
    "Host": "*",
    "HTTPMethod": "POST",
    "URLPath": "/api/v*/payments*",
    "ResourceARN": "*",
    "Version": 1
  }'

# 2. Criar sampling rule de baixa prioridade para health checks
aws xray create-sampling-rule \
  --sampling-rule '{
    "RuleName": "health-check-low-sample",
    "Priority": 50,
    "FixedRate": 0.0,
    "ReservoirSize": 1,
    "ServiceName": "*",
    "ServiceType": "*",
    "Host": "*",
    "HTTPMethod": "GET",
    "URLPath": "/health*",
    "ResourceARN": "*",
    "Version": 1
  }'

# 3. Listar sampling rules ativas (verificar prioridades)
aws xray get-sampling-rules \
  --query 'SamplingRuleRecords[*].SamplingRule.{Name:RuleName,Priority:Priority,Reservoir:ReservoirSize,Rate:FixedRate,Method:HTTPMethod,Path:URLPath}'

# 4. Criar Group para traces com falha em pagamentos
aws xray create-group \
  --group-name "payment-errors" \
  --filter-expression 'service("payment-processor") { fault OR error } AND annotation[endpoint] BEGINSWITH "/api/v"' \
  --insights-configuration 'InsightsEnabled=true,NotificationsEnabled=true'

# 5. Criar Group para traces lentos de usuários premium
aws xray create-group \
  --group-name "slow-payments" \
  --filter-expression 'service("payment-processor") { responsetime > 2 } AND annotation[isPremiumUser] = true'

# 6. Listar groups existentes
aws xray get-groups \
  --query 'Groups[*].{Name:GroupName,Filter:FilterExpression,ARN:GroupARN}'

# 7. Consultar traces por filter expression (últimos 30 min)
START=$(date -u -d '30 minutes ago' +%s 2>/dev/null || date -u -v-30M +%s)
END=$(date -u +%s)

aws xray get-trace-summaries \
  --start-time $START \
  --end-time $END \
  --filter-expression 'annotation[paymentStatus] = "failed" AND fault' \
  --query 'TraceSummaries[*].{Id:Id,Duration:Duration,Fault:HasFault,Error:HasError}'

# 8. Obter trace completo por ID (substituir <trace-id>)
aws xray batch-get-traces \
  --trace-ids "1-5759e988-bd862e3fe1be46a994272793" \
  --query 'Traces[0].Segments[*].{Id:Id,Document:Document}'

# 9. Ver service graph do Group payment-errors
aws xray get-service-graph \
  --start-time $START \
  --end-time $END \
  --group-name "payment-errors" \
  --query 'Services[*].{Name:Name,Type:Type,Edges:Edges[*].ReferenceId}'

# 10. Métricas CloudWatch geradas automaticamente pelo Group
aws cloudwatch get-metric-statistics \
  --namespace "AWS/X-Ray" \
  --metric-name "ErrorRate" \
  --dimensions Name=GroupName,Value=payment-errors \
  --start-time "$(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  --period 300 \
  --statistics Average

# 11. Atualizar filter expression de um Group existente
aws xray update-group \
  --group-name "payment-errors" \
  --filter-expression 'service("payment-processor") { fault OR error } AND annotation[endpoint] BEGINSWITH "/api/v" AND duration > 1'

Navigating a distributed trace: ALB → Lambda → DynamoDB

Trace ID: 1-5759e988-bd862e3fe1be46a994272793

Timeline (total: 1.85s):
├── [ALB]  0ms →  12ms  : ALB segment — recebe request, rota para Lambda
│
├── [Lambda service]  12ms → 48ms : Lambda service segment (cold start)
│   └── @initDuration: 450ms (cold start separado — paralelo ao invocation)
│
├── [Lambda function]  48ms → 1850ms : payment-processor segment
│   ├── ## lambda_handler (subsegment)  48ms → 1850ms
│   │   ├── ## _process_payment         52ms → 1820ms
│   │   │   ├── ## validate_payment    52ms →   55ms  : 3ms (validação local)
│   │   │   └── ## _record_order       55ms → 1820ms : 1765ms  ← GARGALO!
│   │   │       └── [DynamoDB] PutItem  56ms → 1819ms : 1763ms
│   │   │             ← Inferred segment (DynamoDB não envia segmento)
│   │   │             ← 1.76s para um PutItem de 500B → throttling?
│   │
│   Annotations:  orderId=ORD-123, userId=u-abc, endpoint=/api/v1/payments
│                 paymentStatus=approved, processorUsed=stripe
│   Metadata:     requestPayload={orderId: ..., amount: 299.90}
│
Diagnóstico: 95% do tempo gasto em DynamoDB PutItem
Ação: verificar ConsumedWriteCapacityUnits e WriteThrottleEvents
      via CloudWatch para a tabela orders-table

Common pitfalls

1. Annotations don't appear in filter expressions if added in the wrong scope
[FACT] Annotations added in a subsegment are visible in the subsegment detail, but annotation[key] in filter expressions searches the root segment. Use tracer.put_annotation() at the handler level (root segment) or explicitly in the correct segment for them to appear in filter expression results and GetTraceSummaries.

2. Updating a Group's filter expression is not retroactive
[FACT] Already-stored traces are not re-evaluated when a Group's filter expression is updated. The Group's service graph may show data from both expressions for up to 30 days. For clean data, delete the Group and recreate it.

3. Sampling rules applied locally by the SDK — 10-second lag
[FACT] The X-Ray SDK fetches sampling rules from the service every 10 seconds. Changes to rules are not applied instantly — there is a window of up to 10s of lag. In Lambda functions with frequent cold starts, the SDK may need to fetch the rules on every new instance.

4. Sampled=0 in the propagated header → trace ignored in all downstream
[FACT] When the upstream service decides not to sample (Sampled=0), that header is propagated to all downstream services. Even if a downstream sampling rule had a high rate, the trace is not recorded because the decision is made once and propagated. This is intentional — it ensures the trace is either complete or doesn't exist.

5. Lambda cold start: @initDuration appears in a separate segment
[FACT] Lambda cold start time (@initDuration) is recorded in a segment separate from the invocation segment. In the trace timeline, it appears as a parallel node. Confusing the cold start with invocation latency can lead to incorrect diagnoses — separate @duration (invocation) from @initDuration (initialization).

6. DynamoDB does not send a segment — only inferred segments
[FACT] DynamoDB does not natively instrument X-Ray. X-Ray creates an "inferred segment" from the AWS SDK subsegment in the client (Lambda). This inferred segment shows latency from the client's perspective — it includes network latency. It is not possible to distinguish whether latency occurred in the network or within DynamoDB through X-Ray.


Reflection exercise

You have an application with the following flow: API Gateway → Lambda checkout → (SQS + DynamoDB). The checkout Lambda puts a message on SQS and writes metadata to DynamoDB. Another Lambda order-processor consumes from SQS.

  1. How does X-Ray propagate the TraceId between the checkout Lambda and the order-processor Lambda? Does SQS automatically preserve the X-Amzn-Trace-Id header? What happens if order-processor is not instrumented?

  2. You want to create a Group called checkout-failures that captures traces where: the endpoint is /api/checkout, there is a fault, and the user is premium type (annotation[isPremiumUser] = true). Write the correct filter expression.

  3. The checkout Lambda has 2,000 req/s. The default rule (1 req/s + 5%) sampled ~101 traces/s. You want to sample at least 200 req/s from /api/checkout with no more than 10% additional. What ReservoirSize and FixedRate values would you use? What is the maximum traces/s rate in this configuration?


Resources for further study

  • [FACT] AWS X-Ray concepts (segments, subsegments, sampling, annotations): https://docs.aws.amazon.com/xray/latest/devguide/xray-concepts.html
  • [FACT] Filter expressions — full syntax: https://docs.aws.amazon.com/xray/latest/devguide/xray-console-filters.html
  • [FACT] Configuring groups: https://docs.aws.amazon.com/xray/latest/devguide/xray-console-groups.html
  • [FACT] Configuring sampling rules: https://docs.aws.amazon.com/xray/latest/devguide/xray-console-sampling.html
  • [FACT] Powertools for AWS Lambda — Tracer: https://docs.aws.amazon.com/powertools/python/latest/core/tracer/