Session 039 — X-Ray: groups, annotations, sampling rules and cross-service integration
Dependencies: session-038-cloudwatch-composite-alarms-anomaly
Objective
By the end of this session, you will be able to create X-Ray Groups to filter traces by custom annotation, configure sampling rules to increase sampling on critical endpoints without inflating cost, navigate a distributed trace that traverses ALB → Lambda → DynamoDB identifying where time was spent, and distinguish annotations from metadata for use in filter expressions.
Context
[FACT] AWS X-Ray is a distributed tracing service that collects segment data sent by instrumented applications and groups them into traces by TraceId. Each request generates a trace that traverses all participating services, allowing you to identify where latency occurs and where errors originate.
[FACT] X-Ray natively integrates with ALB, API Gateway, Lambda, ECS, EC2, and AWS SDK clients (DynamoDB, SQS, SNS, etc.). For Lambda, tracing is enabled via configuration — without modifying code. For other services, you need to instrument the code with the X-Ray SDK.
Key concepts
1. Anatomy of a trace: segment, subsegment, inferred segment
[FACT] The X-Ray data hierarchy:
Trace (TraceId)
│ Agrupa todos os segmentos de uma mesma requisição
│ Retido por 30 dias
│
├── Segment (enviado por cada serviço instrumentado)
│ ├── name, trace_id, id (segment_id)
│ ├── start_time, end_time
│ ├── http request/response
│ ├── annotations (indexadas para filter expressions)
│ ├── metadata (não indexada, qualquer tipo)
│ ├── error / fault / throttle flags
│ └── subsegments
│ ├── AWS SDK calls (DynamoDB, S3, SQS...)
│ ├── HTTP downstream calls
│ └── custom subsegments (código arbitrário)
│
└── Inferred segment (gerado pelo X-Ray para serviços sem SDK)
Criado a partir do subsegmento upstream que chamou o serviço
Ex: DynamoDB não envia segmento → X-Ray infere pelo subsegmento
do Lambda que fez a chamada ao DynamoDB
[FACT] Segment documents have a maximum size of 64 KB. Data beyond this limit is truncated.
Tracing header — cross-service propagation:
[FACT] The first service that receives the request adds the X-Amzn-Trace-Id header with Root (TraceId), Parent (segmentId of the current service), and Sampled (0 or 1). All downstream services propagate this header, maintaining the original TraceId.
X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793;Parent=53995c3f42cd8ad8;Sampled=1
2. Sampling rules: default rate and per-endpoint configuration
[FACT] Default rule: first request per second + 5% of additional requests. This rule is conservative for cost control.
[FACT] Sampling rules are evaluated in priority order (lower number = higher priority, from 1 to 9999). The default rule has priority 10000 and is always evaluated last.
Estrutura de uma sampling rule:
─────────────────────────────────────────────────────────────
RuleName: Nome identificador (único)
Priority: 1–9999 (menor = maior prioridade)
ReservoirSize: Requests/segundo sempre amostrados (reservoir)
FixedRate: % adicional amostrada além do reservoir (0.0–1.0)
ServiceName: Filtro por nome do serviço (wildcard * permitido)
ServiceType: Filtro por tipo: "AWS::Lambda::Function", "AWS::ECS::Container", etc.
Host: Filtro por host
HTTPMethod: GET, POST, PUT, DELETE, * (wildcard)
URLPath: Filtro por path (wildcard * permitido)
ResourceARN: Filtro por ARN do recurso
Custo de sampling:
ReservoirSize = 10, FixedRate = 0.05 significa:
- Primeiros 10 req/s: TODOS amostrados
- Além disso: 5% amostrados
[FACT] Sampling rules are centrally managed in X-Ray (not per instance or function). The X-Ray SDK queries the rules periodically (every 10 seconds) and applies them locally, with no additional latency per sampling decision.
3. Annotations vs. Metadata
[FACT] Two types of custom data can be added to segments and subsegments:
Annotations Metadata
────────────────────────────────────────────────────────────────
Indexação SIM — indexado para NÃO — não pesquisável
filter expressions via filter expressions
Tipos Boolean, Number, String Qualquer tipo JSON
(objetos, arrays, etc.)
Uso típico UserId, orderId, endpoint, Payload completo,
versão do app, feature flag configurações, dados
de debugging
Limite 50 annotations por trace Não documentado limite rígido
(limitado pelo 64KB do segment)
Visibilidade Console X-Ray, GetTrace Console X-Ray,
Summaries, filter expr. GetTraceSummaries (sem filtro)
SDK put_annotation(key, value) put_metadata(key, value,
namespace="default")
4. Filter expressions: full syntax
[FACT] Filter expressions are used both in the console (trace search) and in Group definitions. Syntax: keyword operator value, combined with AND / OR.
Tipo de keyword Exemplos
────────────────────────────────────────────────────────────────
Boolean ok, error, fault, throttle, partial
(usados sem operador: "fault" = trace com 5XX)
ou com: "ok = false"
Number responsetime > 2
duration >= 5 AND duration <= 8
http.status != 200
http.status = 429
String http.url CONTAINS "/api/payments"
http.method = "POST"
http.url BEGINSWITH "https://api."
user CONTAINS "" (field exists check)
name = "payment-service"
Annotation annotation[userId] = "u-abc123"
annotation[orderId] CONTAINS "ORD-"
annotation[isPremium] = true
annotation[retryCount] > 2
!annotation[userId] (annotation não presente)
Complex — service service("payment-lambda") { fault }
service() { fault } (qualquer serviço)
service(id(name: "payment", type: "AWS::Lambda::Function")) { error }
Complex — edge edge("alb", "payment-lambda") { error }
Group group.name = "payment-errors" AND user = "alice"
5. X-Ray Groups: service graph and metrics by context
[FACT] A Group is a collection of traces defined by a filter expression. When created, X-Ray:
1. Compares incoming traces against the filter expression when storing them
2. Generates a separate service graph for traces matching the group
3. Publishes CloudWatch metrics (namespace AWS/X-Ray) for the group every minute: ApproximateTraceCount, Throttle, Fault, Error
[FACT] Groups are charged per retrieved trace that matches the filter expression — not for group creation.
[FACT] Updating a Group's filter expression does not affect already-stored traces — only future traces. To avoid mixed data from old and new expressions, delete the group and recreate it.
Practical example
Scenario: Payments API — trace ALB → Lambda → DynamoDB
Flow: ALB → Lambda payment-processor → DynamoDB orders-table
CDK Python — Enable X-Ray + Sampling Rule + Group
from aws_cdk import (
Stack, Duration,
aws_lambda as lambda_,
aws_xray as xray,
aws_iam as iam,
)
from constructs import Construct
class XRayObservabilityStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs):
super().__init__(scope, construct_id, **kwargs)
# ── Lambda com X-Ray Active Tracing ───────────────────────────
payment_fn = lambda_.Function(
self, "PaymentProcessor",
function_name="payment-processor",
runtime=lambda_.Runtime.PYTHON_3_12,
handler="handler.lambda_handler",
code=lambda_.Code.from_asset("lambda/payment"),
timeout=Duration.seconds(30),
memory_size=256,
# Active = X-Ray registra TODOS os invocações amostradas
# PassThrough = X-Ray propaga tracing header mas não registra
tracing=lambda_.Tracing.ACTIVE,
environment={
"POWERTOOLS_SERVICE_NAME": "payment-processor",
},
)
# Permissão para o Lambda enviar traces ao X-Ray
payment_fn.add_to_role_policy(
iam.PolicyStatement(
actions=[
"xray:PutTraceSegments",
"xray:PutTelemetryRecords",
"xray:GetSamplingRules",
"xray:GetSamplingTargets",
],
resources=["*"],
)
)
# ── Sampling Rule: mais amostragem em /payments (endpoint crítico) ─
payment_sampling_rule = xray.CfnSamplingRule(
self, "PaymentSamplingRule",
sampling_rule=xray.CfnSamplingRule.SamplingRuleProperty(
rule_name="payment-high-value-endpoints",
priority=100, # Alta prioridade (menor número)
reservoir_size=50, # 50 req/s sempre amostrados
fixed_rate=0.10, # + 10% dos req adicionais
service_name="payment-processor",
service_type="AWS::Lambda::Function",
http_method="POST",
url_path="/api/v*/payments*",
host="*",
resource_arn="*",
version=1,
),
)
# Regra de baixa amostragem para health checks (ruído desnecessário)
health_check_rule = xray.CfnSamplingRule(
self, "HealthCheckSamplingRule",
sampling_rule=xray.CfnSamplingRule.SamplingRuleProperty(
rule_name="health-check-low-sample",
priority=50, # Maior prioridade que payment rule
reservoir_size=1, # 1 req/s — só para confirmar que funciona
fixed_rate=0.0, # 0% adicional
service_name="*",
service_type="*",
http_method="GET",
url_path="/health*",
host="*",
resource_arn="*",
version=1,
),
)
# ── X-Ray Group: traces com falhas em pagamentos ──────────────
payment_errors_group = xray.CfnGroup(
self, "PaymentErrorsGroup",
group_name="payment-errors",
filter_expression=(
'service("payment-processor") { fault OR error } '
'AND annotation[endpoint] BEGINSWITH "/api/v"'
),
insights_configuration=xray.CfnGroup.InsightsConfigurationProperty(
insights_enabled=True,
notifications_enabled=True, # CloudWatch Events para insights
),
)
# Group para traces lentos (latência > 2s)
slow_payments_group = xray.CfnGroup(
self, "SlowPaymentsGroup",
group_name="slow-payments",
filter_expression=(
'service("payment-processor") { responsetime > 2 } '
'AND annotation[isPremiumUser] = true'
),
insights_configuration=xray.CfnGroup.InsightsConfigurationProperty(
insights_enabled=True,
),
)
Python — Lambda handler with X-Ray SDK (Powertools Tracer)
# lambda/payment/handler.py
import json
import time
import boto3
from typing import Any
from aws_lambda_powertools import Logger, Tracer
from aws_lambda_powertools.utilities.typing import LambdaContext
# Tracer: wraps X-Ray SDK — service vem de POWERTOOLS_SERVICE_NAME env var
tracer = Tracer()
logger = Logger()
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
orders_table = dynamodb.Table("orders-table")
@tracer.capture_lambda_handler # cria segmento raiz + flush automático
@logger.inject_lambda_context
def lambda_handler(event: dict, context: LambdaContext) -> dict:
order_id = event.get("orderId", "unknown")
user_id = event.get("userId", "anonymous")
amount = float(event.get("amount", 0))
endpoint = event.get("requestContext", {}).get("path", "/unknown")
is_premium = event.get("isPremium", False)
# ── Annotations: indexadas para filter expressions e Groups ───────
# Máximo 50 por trace
tracer.put_annotation(key="orderId", value=order_id)
tracer.put_annotation(key="userId", value=user_id)
tracer.put_annotation(key="endpoint", value=endpoint)
tracer.put_annotation(key="isPremiumUser", value=is_premium)
# ── Metadata: não indexada, para debugging detalhado ──────────────
# Útil para dados grandes ou estruturados
tracer.put_metadata(
key="requestPayload",
value={"orderId": order_id, "amount": amount, "isPremium": is_premium},
namespace="payment", # namespace organiza metadata por domínio
)
try:
result = _process_payment(order_id, user_id, amount)
# Annotation de sucesso para filter expressions
tracer.put_annotation(key="paymentStatus", value="approved")
tracer.put_annotation(key="processorUsed", value=result.get("processor"))
return {"statusCode": 200, "body": json.dumps(result)}
except Exception as e:
tracer.put_annotation(key="paymentStatus", value="failed")
tracer.put_annotation(key="errorType", value=type(e).__name__)
raise
@tracer.capture_method # cria subsegmento automático para este método
def _process_payment(order_id: str, user_id: str, amount: float) -> dict:
"""
@tracer.capture_method cria um subsegmento com o nome do método.
Aparece no trace timeline como "## _process_payment".
"""
# Simulando validação com subsegmento customizado
with tracer.provider.in_subsegment("## validate_payment") as subsegment:
subsegment.put_annotation("validationStep", "amount_check")
if amount <= 0:
raise ValueError(f"Invalid amount: {amount}")
subsegment.put_annotation("validationResult", "passed")
# Chamada ao DynamoDB — X-Ray SDK instrumenta automaticamente boto3
# O AWS SDK call aparece como subsegmento "DynamoDB" no trace
_record_order(order_id, user_id, amount)
return {"orderId": order_id, "status": "approved", "processor": "stripe"}
@tracer.capture_method
def _record_order(order_id: str, user_id: str, amount: float) -> None:
"""
Escrita no DynamoDB — o boto3 já instrumentado pelo X-Ray SDK
aparece como subsegmento "DynamoDB PutItem" no trace.
"""
orders_table.put_item(
Item={
"PK": f"ORDER#{order_id}",
"SK": "METADATA",
"userId": user_id,
"amount": str(amount),
"status": "approved",
}
)
CLI — Configure sampling, create groups, query traces
# 1. Criar sampling rule para endpoint de pagamentos
aws xray create-sampling-rule \
--sampling-rule '{
"RuleName": "payment-high-value-endpoints",
"Priority": 100,
"FixedRate": 0.10,
"ReservoirSize": 50,
"ServiceName": "payment-processor",
"ServiceType": "AWS::Lambda::Function",
"Host": "*",
"HTTPMethod": "POST",
"URLPath": "/api/v*/payments*",
"ResourceARN": "*",
"Version": 1
}'
# 2. Criar sampling rule de baixa prioridade para health checks
aws xray create-sampling-rule \
--sampling-rule '{
"RuleName": "health-check-low-sample",
"Priority": 50,
"FixedRate": 0.0,
"ReservoirSize": 1,
"ServiceName": "*",
"ServiceType": "*",
"Host": "*",
"HTTPMethod": "GET",
"URLPath": "/health*",
"ResourceARN": "*",
"Version": 1
}'
# 3. Listar sampling rules ativas (verificar prioridades)
aws xray get-sampling-rules \
--query 'SamplingRuleRecords[*].SamplingRule.{Name:RuleName,Priority:Priority,Reservoir:ReservoirSize,Rate:FixedRate,Method:HTTPMethod,Path:URLPath}'
# 4. Criar Group para traces com falha em pagamentos
aws xray create-group \
--group-name "payment-errors" \
--filter-expression 'service("payment-processor") { fault OR error } AND annotation[endpoint] BEGINSWITH "/api/v"' \
--insights-configuration 'InsightsEnabled=true,NotificationsEnabled=true'
# 5. Criar Group para traces lentos de usuários premium
aws xray create-group \
--group-name "slow-payments" \
--filter-expression 'service("payment-processor") { responsetime > 2 } AND annotation[isPremiumUser] = true'
# 6. Listar groups existentes
aws xray get-groups \
--query 'Groups[*].{Name:GroupName,Filter:FilterExpression,ARN:GroupARN}'
# 7. Consultar traces por filter expression (últimos 30 min)
START=$(date -u -d '30 minutes ago' +%s 2>/dev/null || date -u -v-30M +%s)
END=$(date -u +%s)
aws xray get-trace-summaries \
--start-time $START \
--end-time $END \
--filter-expression 'annotation[paymentStatus] = "failed" AND fault' \
--query 'TraceSummaries[*].{Id:Id,Duration:Duration,Fault:HasFault,Error:HasError}'
# 8. Obter trace completo por ID (substituir <trace-id>)
aws xray batch-get-traces \
--trace-ids "1-5759e988-bd862e3fe1be46a994272793" \
--query 'Traces[0].Segments[*].{Id:Id,Document:Document}'
# 9. Ver service graph do Group payment-errors
aws xray get-service-graph \
--start-time $START \
--end-time $END \
--group-name "payment-errors" \
--query 'Services[*].{Name:Name,Type:Type,Edges:Edges[*].ReferenceId}'
# 10. Métricas CloudWatch geradas automaticamente pelo Group
aws cloudwatch get-metric-statistics \
--namespace "AWS/X-Ray" \
--metric-name "ErrorRate" \
--dimensions Name=GroupName,Value=payment-errors \
--start-time "$(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ')" \
--end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
--period 300 \
--statistics Average
# 11. Atualizar filter expression de um Group existente
aws xray update-group \
--group-name "payment-errors" \
--filter-expression 'service("payment-processor") { fault OR error } AND annotation[endpoint] BEGINSWITH "/api/v" AND duration > 1'
Navigating a distributed trace: ALB → Lambda → DynamoDB
Trace ID: 1-5759e988-bd862e3fe1be46a994272793
Timeline (total: 1.85s):
├── [ALB] 0ms → 12ms : ALB segment — recebe request, rota para Lambda
│
├── [Lambda service] 12ms → 48ms : Lambda service segment (cold start)
│ └── @initDuration: 450ms (cold start separado — paralelo ao invocation)
│
├── [Lambda function] 48ms → 1850ms : payment-processor segment
│ ├── ## lambda_handler (subsegment) 48ms → 1850ms
│ │ ├── ## _process_payment 52ms → 1820ms
│ │ │ ├── ## validate_payment 52ms → 55ms : 3ms (validação local)
│ │ │ └── ## _record_order 55ms → 1820ms : 1765ms ← GARGALO!
│ │ │ └── [DynamoDB] PutItem 56ms → 1819ms : 1763ms
│ │ │ ← Inferred segment (DynamoDB não envia segmento)
│ │ │ ← 1.76s para um PutItem de 500B → throttling?
│ │
│ Annotations: orderId=ORD-123, userId=u-abc, endpoint=/api/v1/payments
│ paymentStatus=approved, processorUsed=stripe
│ Metadata: requestPayload={orderId: ..., amount: 299.90}
│
Diagnóstico: 95% do tempo gasto em DynamoDB PutItem
Ação: verificar ConsumedWriteCapacityUnits e WriteThrottleEvents
via CloudWatch para a tabela orders-table
Common pitfalls
1. Annotations don't appear in filter expressions if added in the wrong scope
[FACT] Annotations added in a subsegment are visible in the subsegment detail, but annotation[key] in filter expressions searches the root segment. Use tracer.put_annotation() at the handler level (root segment) or explicitly in the correct segment for them to appear in filter expression results and GetTraceSummaries.
2. Updating a Group's filter expression is not retroactive
[FACT] Already-stored traces are not re-evaluated when a Group's filter expression is updated. The Group's service graph may show data from both expressions for up to 30 days. For clean data, delete the Group and recreate it.
3. Sampling rules applied locally by the SDK — 10-second lag
[FACT] The X-Ray SDK fetches sampling rules from the service every 10 seconds. Changes to rules are not applied instantly — there is a window of up to 10s of lag. In Lambda functions with frequent cold starts, the SDK may need to fetch the rules on every new instance.
4. Sampled=0 in the propagated header → trace ignored in all downstream
[FACT] When the upstream service decides not to sample (Sampled=0), that header is propagated to all downstream services. Even if a downstream sampling rule had a high rate, the trace is not recorded because the decision is made once and propagated. This is intentional — it ensures the trace is either complete or doesn't exist.
5. Lambda cold start: @initDuration appears in a separate segment
[FACT] Lambda cold start time (@initDuration) is recorded in a segment separate from the invocation segment. In the trace timeline, it appears as a parallel node. Confusing the cold start with invocation latency can lead to incorrect diagnoses — separate @duration (invocation) from @initDuration (initialization).
6. DynamoDB does not send a segment — only inferred segments
[FACT] DynamoDB does not natively instrument X-Ray. X-Ray creates an "inferred segment" from the AWS SDK subsegment in the client (Lambda). This inferred segment shows latency from the client's perspective — it includes network latency. It is not possible to distinguish whether latency occurred in the network or within DynamoDB through X-Ray.
Reflection exercise
You have an application with the following flow: API Gateway → Lambda checkout → (SQS + DynamoDB). The checkout Lambda puts a message on SQS and writes metadata to DynamoDB. Another Lambda order-processor consumes from SQS.
-
How does X-Ray propagate the TraceId between the
checkoutLambda and theorder-processorLambda? Does SQS automatically preserve theX-Amzn-Trace-Idheader? What happens iforder-processoris not instrumented? -
You want to create a Group called
checkout-failuresthat captures traces where: the endpoint is/api/checkout, there is a fault, and the user is premium type (annotation[isPremiumUser] = true). Write the correct filter expression. -
The
checkoutLambda has 2,000 req/s. The default rule (1 req/s + 5%) sampled ~101 traces/s. You want to sample at least 200 req/s from/api/checkoutwith no more than 10% additional. WhatReservoirSizeandFixedRatevalues would you use? What is the maximum traces/s rate in this configuration?
Resources for further study
- [FACT] AWS X-Ray concepts (segments, subsegments, sampling, annotations): https://docs.aws.amazon.com/xray/latest/devguide/xray-concepts.html
- [FACT] Filter expressions — full syntax: https://docs.aws.amazon.com/xray/latest/devguide/xray-console-filters.html
- [FACT] Configuring groups: https://docs.aws.amazon.com/xray/latest/devguide/xray-console-groups.html
- [FACT] Configuring sampling rules: https://docs.aws.amazon.com/xray/latest/devguide/xray-console-sampling.html
- [FACT] Powertools for AWS Lambda — Tracer: https://docs.aws.amazon.com/powertools/python/latest/core/tracer/