Session 024 — Lambda Observability: structured logging, X-Ray and Lambda Insights
Estimated duration: 60 minutes
Prerequisites: session-023-stepfunctions-parallel-map-error
Objective
By the end, you will be able to emit structured logs (JSON) from a Lambda with correlation fields (requestId, userId), enable X-Ray active tracing and add custom subsegments, and enable Lambda Insights to see duration, error, and init time metrics per function.
Context
[FACT] Observability in distributed systems relies on the three classic pillars: logs (discrete event records), metrics (numeric time series), and traces (request tracing across services). In Lambda, each invocation is ephemeral and potentially distributed across hundreds of simultaneous worker instances — which makes correlation between these three pillars especially critical.
[CONSENSUS] The biggest observability problem in Lambda is not lack of data, but lack of correlation. CloudWatch already captures native metrics and logs by default. What differentiates an observable system from a monitored system is the ability to, given a request ID or a traceId, quickly find the function's log, the complete X-Ray trace, the performance metrics of the specific worker, and the errors that occurred. Structured logging, X-Ray, and Lambda Insights are the three tools that allow building this correlation systematically in Lambda.
[FACT] Starting in 2023, Lambda began natively supporting JSON format for system logs (messages that the Lambda service itself emits — such as START, END, REPORT), in addition to application logs. This simplifies log ingestion in CloudWatch Logs Insights without the need for custom parsers.
Key concepts
1. Structured Logging — logs as objects, not strings
[FACT] The default log format in Lambda is plain text. When the application uses print() or console.log(), CloudWatch receives a text line that needs to be parsed with regex or glob expressions to extract fields. Structured logging replaces strings with JSON objects, making each field directly queryable.
Log não estruturado (difícil de consultar):
───────────────────────────────────────────────────────────────
[INFO] 2026-06-24T10:15:32Z - Pedido P001 processado para usuario U42 em 245ms
Log estruturado JSON (CloudWatch Insights auto-descobre campos):
───────────────────────────────────────────────────────────────
{
"timestamp": "2026-06-24T10:15:32.410Z",
"level": "INFO",
"message": "Pedido processado",
"requestId": "abc123-def456",
"traceId": "1-66795-abc...",
"pedido_id": "P001",
"usuario_id": "U42",
"duracao_ms": 245,
"service": "pedidos",
"version": "2.1.0"
}
[FACT] CloudWatch Logs Insights automatically detects fields in JSON lines without any configuration. Once logs are in JSON, queries like the one below work directly:
-- Buscar todos os erros de um usuário específico na última hora
fields @timestamp, level, message, pedido_id, duracao_ms
| filter level = "ERROR" and usuario_id = "U42"
| sort @timestamp desc
| limit 50
Required correlation fields
[CONSENSUS] The practice adopted by most production teams is to include at least four correlation fields in each log:
| Campo | Origem | Uso |
|---|---|---|
requestId |
context.aws_request_id |
Correlacionar todos os logs de uma invocação |
traceId |
os.environ["_X_AMZN_TRACE_ID"] |
Correlacionar com trace X-Ray |
service |
Constante na função | Filtrar logs por serviço em log groups agregados |
cold_start |
Variável de inicialização | Identificar invocações com Init phase |
Enabling native JSON in Lambda (function log format)
[FACT] Since 2023, it is possible to configure the Lambda function so that system messages (START, END, REPORT) are also emitted in JSON. This is separate from the application log format:
# CDK — log format e log level nativos da função
from aws_cdk import aws_lambda as lambda_
fn = lambda_.Function(
self, "MinhaFuncao",
# ...
logging_format=lambda_.LoggingFormat.JSON, # logs sistema em JSON
system_log_level=lambda_.SystemLogLevel.INFO,
application_log_level=lambda_.ApplicationLogLevel.INFO,
log_retention=logs.RetentionDays.ONE_WEEK,
)
[FACT] With LoggingFormat.JSON, the REPORT record becomes:
{
"timestamp": "2026-06-24T10:15:32.660Z",
"type": "platform.report",
"record": {
"requestId": "abc123",
"metrics": {
"durationMs": 245.12,
"billedDurationMs": 246,
"memorySizeMB": 256,
"maxMemoryUsedMB": 89,
"initDurationMs": 312.5
}
}
}
Structured logging with pure Python (without Powertools)
import json
import logging
import os
# Configura o logger raiz para emitir JSON
class JsonFormatter(logging.Formatter):
def format(self, record):
log_entry = {
"timestamp": self.formatTime(record),
"level": record.levelname,
"message": record.getMessage(),
"logger": record.name,
"requestId": getattr(record, "requestId", None),
"traceId": os.environ.get("_X_AMZN_TRACE_ID"),
"service": "pedidos",
}
# Campos extras passados via extra={}
for key in vars(record):
if key not in logging.LogRecord.__dict__ and not key.startswith("_"):
log_entry[key] = getattr(record, key)
return json.dumps(log_entry)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
if logger.handlers:
logger.handlers[0].setFormatter(JsonFormatter())
# Variável para detectar cold start
COLD_START = True
def handler(event, context):
global COLD_START
cold = COLD_START
COLD_START = False
# Enriquece todos os logs desta invocação com requestId
extra = {"requestId": context.aws_request_id, "cold_start": cold}
logger.info("Invocação iniciada", extra={**extra, "evento_tipo": event.get("type")})
try:
resultado = processar(event, extra)
logger.info("Invocação concluída", extra={**extra, "resultado": resultado["status"]})
return resultado
except Exception as e:
logger.error("Erro na invocação", extra={**extra, "error_type": type(e).__name__, "error_msg": str(e)})
raise
[CONSENSUS] Using Lambda Powertools Logger (session-021) eliminates the need to implement this boilerplate manually. The @logger.inject_lambda_context decorator automatically injects requestId, cold_start, xray_trace_id, and other fields into all logs of the invocation.
2. X-Ray Active Tracing — distributed tracing in Lambda
[FACT] AWS X-Ray is AWS's distributed tracing service. In Lambda, tracing works via an X-Ray daemon that runs inside the execution environment and receives data via UDP (port 2000 on loopback). The X-Ray SDK sends segments to this daemon, which forwards them to the X-Ray service.
Anatomy of a trace in Lambda
[FACT] With active tracing enabled, Lambda automatically creates two segments per invocation:
Trace (X-Amzn-Trace-Id: Root=1-...;Sampled=1)
├── Segmento 1: "Lambda" (serviço)
│ └── Representa o Lambda service recebendo a invocação
│ Inclui: cold start time, queuing time
│
└── Segmento 2: "minhaFuncao" (função)
├── Subsegmento: Initialization (apenas em cold starts)
│ └── Tempo do Init phase (carregamento do módulo)
├── Subsegmento: Invocation
│ └── Tempo de execução do handler
│ ├── [seus subsegmentos customizados aqui]
└── Subsegmento: Overhead
└── Tempo de checkpoint/extensões
[FACT] The environment variable _X_AMZN_TRACE_ID contains the trace ID of the current invocation in the format Root=1-<timestamp>-<hex>;Parent=<parentId>;Sampled=<0|1>. This string must be propagated in downstream calls (HTTP headers, SQS messages, etc.) to maintain trace continuity.
Enabling active tracing
# CDK
fn = lambda_.Function(
self, "MinhaFuncao",
# ...
tracing=lambda_.Tracing.ACTIVE, # ou PASS_THROUGH para herdar do upstream
)
# CDK adiciona automaticamente xray:PutTraceSegments e xray:PutTelemetryRecords
# à execution role da função
# CLI
aws lambda update-function-configuration \
--function-name MinhaFuncao \
--tracing-config Mode=Active
Custom subsegments
[FACT] The X-Ray SDK allows creating subsegments for any operation within the handler — database calls, external APIs, heavy processing:
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
# Patcha automaticamente boto3, requests, httplib, pymongo, etc.
patch_all()
def handler(event, context):
pedido_id = event["pedido_id"]
# Subsegmento manual com context manager
with xray_recorder.in_subsegment("validar-pedido") as subseg:
subseg.put_annotation("pedido_id", pedido_id) # indexado — filtrável
subseg.put_annotation("valor", event["valor"])
subseg.put_metadata("evento_completo", event) # não indexado — apenas armazenado
resultado = validar_pedido(pedido_id)
# Decorator em funções internas
resultado_bd = salvar_no_banco(pedido_id, resultado)
return {"status": "ok"}
@xray_recorder.capture("salvar-no-banco")
def salvar_no_banco(pedido_id, dados):
# boto3 já está patchado — as chamadas DynamoDB aparecem como
# subsegmentos automáticos dentro de "salvar-no-banco"
tabela.put_item(Item={"id": pedido_id, **dados})
return True
Annotations vs Metadata
[FACT] The distinction between put_annotation and put_metadata is critical for X-Ray usage:
Annotations Metadata
──────────────────────────────── ────────────────────────────────
Tipos: string, número, booleano Tipos: qualquer JSON serializável
Indexados pelo X-Ray NÃO indexados
Aparecem em filter expressions Apenas visíveis no detalhe do trace
Limite: 50 anotações por trace Limite: 64KB por segmento
Uso: agrupamento, filtros, alertas Uso: debug, dados de contexto
[FACT] Filter expressions in the X-Ray console use annotations:
# Encontrar todos os traces com erro para um pedido específico
annotation.pedido_id = "P001" AND error = true
# Traces lentos (>2s) de um serviço específico
annotation.service = "pedidos" AND responsetime > 2
Sampling rules
[FACT] By default, X-Ray samples 5% of requests (or 1 req/s, whichever is greater). In high-volume production, this is essential for cost control. Custom rules can be configured:
aws xray create-sampling-rule --cli-input-json '{
"SamplingRule": {
"RuleName": "PedidosAltoValor",
"Priority": 1,
"FixedRate": 1.0,
"ReservoirSize": 5,
"ServiceName": "pedidos",
"ServiceType": "AWS::Lambda::Function",
"Host": "*",
"HTTPMethod": "*",
"URLPath": "*",
"ResourceARN": "*",
"Attributes": { "pedido_valor": "alto" }
}
}'
3. Lambda Insights — per-invocation system metrics
[FACT] Lambda Insights is implemented as an internal Lambda Extension, distributed as an AWS-managed Lambda Layer. When enabled, the extension collects system metrics from each invocation and sends them to CloudWatch Logs in the /aws/lambda/insights group using EMF (Embedded Metric Format), which CloudWatch interprets to create time-series metrics.
Collected metrics
[FACT] Lambda Insights collects the following metrics per invocation:
Métricas de performance:
┌─────────────────────────┬──────────────────────────────────────────────────┐
│ Métrica │ Descrição │
├─────────────────────────┼──────────────────────────────────────────────────┤
│ duration │ Duração da invocação em ms │
│ billed_duration │ Duração cobrada (arredondada para 1ms) │
│ init_duration │ Tempo do Init phase (cold start apenas) │
│ memory_utilization │ % de memória configurada utilizada │
│ used_memory_max │ Pico de uso de memória em MB │
│ cpu_total_time │ Tempo total de CPU em ms │
├─────────────────────────┼──────────────────────────────────────────────────┤
│ Métricas de I/O: │ │
│ rx_bytes │ Bytes recebidos via rede │
│ tx_bytes │ Bytes enviados via rede │
│ disk_used │ Uso de /tmp em MB │
│ disk_total │ Espaço total em /tmp em MB │
├─────────────────────────┼──────────────────────────────────────────────────┤
│ Diagnóstico: │ │
│ cold_start │ 1 se foi cold start, 0 caso contrário │
│ out_of_memory │ 1 se a função excedeu memória │
│ timeout │ 1 se a função atingiu timeout │
│ errors │ 1 se houve erro não tratado │
└─────────────────────────┴──────────────────────────────────────────────────┘
Enabling Lambda Insights via CDK
# CDK
fn = lambda_.Function(
self, "MinhaFuncao",
# ...
tracing=lambda_.Tracing.ACTIVE,
insights_version=lambda_.LambdaInsightsVersion.VERSION_1_0_229_0,
# CDK adiciona automaticamente:
# - A layer gerenciada arn:aws:lambda:<region>:580247275435:layer:LambdaInsightsExtension:...
# - A policy CloudWatchLambdaInsightsExecutionRolePolicy à execution role
)
[FACT] The layer ARN changes per region. LambdaInsightsVersion.VERSION_1_0_229_0 is the most recent version as of May 2026 — check docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Lambda-Insights.html for versions available in your region.
Dashboard and Log Insights
[FACT] The CloudWatch console automatically creates a dashboard at /LambdaInsights when Lambda Insights is enabled. The metrics are available in the LambdaInsights namespace.
-- CloudWatch Logs Insights: invocações mais lentas com cold start
-- Log group: /aws/lambda/insights
fields @timestamp, function_name, duration, init_duration, memory_utilization, cold_start
| filter cold_start = 1
| sort duration desc
| limit 20
-- Correlacionar log de aplicação com métricas Insights
-- Log group: /aws/lambda/MinhaFuncao
fields @timestamp, @message, @requestId
| filter level = "ERROR"
| join insights on requestId = @requestId -- correlação via requestId
4. Correlation between the three pillars
[FACT] The field that unites logs, traces, and metrics in Lambda is the requestId (also called aws_request_id in the Python context object). The correlation flow is:
Invocação recebida
│
▼
Lambda Service gera requestId ─────────────────────────────────────────┐
│ │
▼ ▼
┌──────────────────────┐ ┌────────────────────────┐ ┌────────────────────────┐
│ LOGS │ │ X-RAY │ │ LAMBDA INSIGHTS │
│ │ │ │ │ │
│ Log estruturado com │ │ Trace ID gerado pelo │ │ Métricas EMF emitidas │
│ "requestId": "abc" │ │ X-Ray daemon │ │ com requestId e │
│ "traceId": "1-..." │ │ │ │ function_name │
│ "cold_start": true │ │ Segmento da função tem │ │ │
│ "usuario_id": "U42" │ │ anotação requestId │ │ init_duration: 312ms │
│ │ │ │ │ memory_utilization: 35%│
└──────────┬───────────┘ └───────────┬────────────┘ └────────────┬───────────┘
│ │ │
└────────────────────────────┴──────────────────────────────┘
requestId como chave de correlação
Console CloudWatch → ServiceLens: une logs + traces em uma view única
[FACT] CloudWatch ServiceLens (tab in the CloudWatch console) automatically consumes the correlation between logs and X-Ray traces when:
1. The function has active tracing enabled.
2. The logs include the @xrayTraceId field (Lambda Powertools injects this automatically; without Powertools, use os.environ["_X_AMZN_TRACE_ID"]).
Practical example
Scenario: Order processing function with complete structured logging, X-Ray with custom subsegments, and Lambda Insights.
Python handler with all three pillars
import json
import logging
import os
import time
import boto3
from aws_xray_sdk.core import xray_recorder, patch_all
# Patcha clientes boto3 automaticamente para X-Ray
patch_all()
# ── Structured Logger ──────────────────────────────────────────────────────────
class StructuredLogger:
def __init__(self, service_name: str, level: str = "INFO"):
self.service = service_name
self.level = getattr(logging, level)
self._base_fields: dict = {}
def set_invocation_context(self, request_id: str, cold_start: bool):
self._base_fields = {
"requestId": request_id,
"cold_start": cold_start,
"traceId": os.environ.get("_X_AMZN_TRACE_ID", ""),
"service": self.service,
}
def _emit(self, level: str, message: str, **kwargs):
entry = {
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime()),
"level": level,
"message": message,
**self._base_fields,
**kwargs,
}
# Lambda captura stdout; usar print garante flush imediato
print(json.dumps(entry))
def info(self, msg: str, **kwargs): self._emit("INFO", msg, **kwargs)
def warn(self, msg: str, **kwargs): self._emit("WARN", msg, **kwargs)
def error(self, msg: str, **kwargs): self._emit("ERROR", msg, **kwargs)
def debug(self, msg: str, **kwargs):
if self.level <= logging.DEBUG:
self._emit("DEBUG", msg, **kwargs)
logger = StructuredLogger("pedidos")
# ── Clientes (inicializados fora do handler = reutilizados em warm starts) ──────
dynamodb = boto3.resource("dynamodb")
tabela = dynamodb.Table(os.environ["TABELA_PEDIDOS"])
# Detecta cold start
_COLD_START = True
# ── Handler ───────────────────────────────────────────────────────────────────
def handler(event, context):
global _COLD_START
cold = _COLD_START
_COLD_START = False
logger.set_invocation_context(context.aws_request_id, cold)
logger.info("Invocação iniciada", pedido_id=event.get("pedido_id"))
inicio = time.time()
try:
resultado = processar_pedido(event)
duracao_ms = int((time.time() - inicio) * 1000)
logger.info(
"Pedido processado",
pedido_id=event["pedido_id"],
status=resultado["status"],
duracao_ms=duracao_ms,
)
return resultado
except ValueError as e:
logger.error(
"Erro de validação",
pedido_id=event.get("pedido_id"),
error_type="ValueError",
error_msg=str(e),
)
raise
except Exception as e:
logger.error(
"Erro inesperado",
pedido_id=event.get("pedido_id"),
error_type=type(e).__name__,
error_msg=str(e),
)
raise
def processar_pedido(event: dict) -> dict:
pedido_id = event["pedido_id"]
# ── Subsegmento: validação ─────────────────────────────────────────────────
with xray_recorder.in_subsegment("validar-pedido") as seg:
seg.put_annotation("pedido_id", pedido_id)
seg.put_annotation("valor", event.get("valor", 0))
seg.put_metadata("evento_completo", event, namespace="pedidos")
if not pedido_id or not isinstance(event.get("valor"), (int, float)):
raise ValueError(f"Pedido inválido: campos obrigatórios ausentes")
if event["valor"] <= 0:
raise ValueError(f"Valor do pedido deve ser positivo: {event['valor']}")
# ── Subsegmento: persistência ──────────────────────────────────────────────
with xray_recorder.in_subsegment("persistir-pedido") as seg:
seg.put_annotation("pedido_id", pedido_id)
# boto3 patchado → a chamada DynamoDB aparece como sub-subsegmento
tabela.put_item(Item={
"pedido_id": pedido_id,
"valor": str(event["valor"]),
"status": "PROCESSADO",
"request_id": xray_recorder.current_segment().id,
})
return {"status": "PROCESSADO", "pedido_id": pedido_id}
CDK — function with all three pillars enabled
from aws_cdk import (
Stack, Duration, RemovalPolicy,
aws_lambda as lambda_,
aws_logs as logs,
aws_iam as iam,
)
class PedidosObservabilidadeStack(Stack):
def __init__(self, scope, construct_id, **kwargs):
super().__init__(scope, construct_id, **kwargs)
# Layer com aws-xray-sdk (construída via Docker para compatibilidade Linux)
xray_layer = lambda_.LayerVersion(
self, "XRayLayer",
code=lambda_.Code.from_asset(
"layers/xray",
bundling={
"image": lambda_.Runtime.PYTHON_3_12.bundling_image,
"command": [
"bash", "-c",
"pip install aws-xray-sdk -t /asset-output/python"
],
}
),
compatible_runtimes=[lambda_.Runtime.PYTHON_3_12],
description="aws-xray-sdk para instrumentação customizada",
)
fn = lambda_.Function(
self, "ProcessarPedido",
runtime=lambda_.Runtime.PYTHON_3_12,
handler="handler.handler",
code=lambda_.Code.from_asset("src/pedidos"),
memory_size=256,
timeout=Duration.seconds(30),
layers=[xray_layer],
environment={
"TABELA_PEDIDOS": "pedidos",
"POWERTOOLS_SERVICE_NAME": "pedidos",
},
# Pilar 1: Structured logging nativo
logging_format=lambda_.LoggingFormat.JSON,
system_log_level=lambda_.SystemLogLevel.INFO,
application_log_level=lambda_.ApplicationLogLevel.INFO,
log_retention=logs.RetentionDays.ONE_WEEK,
# Pilar 2: X-Ray active tracing
tracing=lambda_.Tracing.ACTIVE,
# Pilar 3: Lambda Insights
insights_version=lambda_.LambdaInsightsVersion.VERSION_1_0_229_0,
)
# Permissão adicional para DynamoDB (X-Ray já é adicionado pelo CDK)
fn.add_to_role_policy(iam.PolicyStatement(
actions=["dynamodb:PutItem", "dynamodb:GetItem"],
resources=["arn:aws:dynamodb:*:*:table/pedidos"],
))
CloudWatch Logs Insights queries for diagnostics
-- 1. Erros das últimas 3 horas agrupados por tipo
fields @timestamp, message, error_type, pedido_id
| filter level = "ERROR"
| stats count(*) as total by error_type
| sort total desc
-- 2. Latência p95 e p99 por hora (usando campo duracao_ms do log)
fields @timestamp, duracao_ms
| filter ispresent(duracao_ms)
| stats
pct(duracao_ms, 95) as p95,
pct(duracao_ms, 99) as p99,
avg(duracao_ms) as media
by bin(1h)
-- 3. Cold starts e seus requestIds (para cruzar com X-Ray)
fields @timestamp, requestId, cold_start, message
| filter cold_start = true
| sort @timestamp desc
| limit 100
-- 4. No log group /aws/lambda/insights — funções com memory > 80%
fields @timestamp, function_name, memory_utilization, duration, cold_start
| filter memory_utilization > 80
| sort memory_utilization desc
| limit 50
Common pitfalls
Pitfall 1 — print() with a JSON object is not the same as true structured logging
The mistake: The developer does print(json.dumps({"level": "INFO", "message": "ok"})) and assumes CloudWatch Logs Insights will parse it as JSON. It works — but the timestamp generated by Lambda for the log line is not inside the JSON, making sorting difficult. Additionally, if the JSON object contains line breaks, CloudWatch may interpret it as multiple log events.
Why it happens: CloudWatch Logs captures each line (\n) as a separate event. If json.dumps doesn't have separators=(',', ':') and produces multi-line JSON, the event is fragmented.
How to avoid:
- Always use json.dumps(obj, separators=(',', ':')) (no spaces) to ensure the JSON is a single line.
- Or use json.dumps(obj) without indent (which is the default — no indent produces a single line).
- For the timestamp, rely on the @timestamp field that CloudWatch adds automatically — it's not necessary to include a timestamp in the JSON (but including one doesn't hurt and facilitates correlation).
Pitfall 2 — X-Ray SDK's patch_all() outside the handler causes errors in test environments
The mistake: The developer calls patch_all() in the module's global scope. In unit tests without the X-Ray daemon running, the SDK tries to register the trace and fails with SegmentNotFoundException: cannot find the current segment/subsegment.
Why it happens: patch_all() monkeypatches boto3 clients globally. In tests, there is no active X-Ray context — the daemon is not running and there is no open segment.
How to avoid:
- Configure the SDK to ignore errors when there is no context: xray_recorder.configure(context_missing='LOG_ERROR') (default in Lambda) or 'IGNORE_ERROR'.
- In tests, configure via environment variable: AWS_XRAY_CONTEXT_MISSING=LOG_ERROR.
- CDK/Lambda already configures this automatically when tracing is enabled, but local test environments may not have this variable.
from aws_xray_sdk.core import xray_recorder, patch_all
xray_recorder.configure(context_missing='IGNORE_ERROR')
patch_all()
Pitfall 3 — Lambda Insights without CloudWatchLambdaInsightsExecutionRolePolicy permission causes the extension to fail silently
The mistake: Lambda Insights is enabled (layer added), but metrics don't appear in /aws/lambda/insights. The function runs normally, but no data arrives.
Why it happens: The Lambda Insights extension needs permission to write logs to the /aws/lambda/insights log group with specific permissions: logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents. Without these permissions, the extension fails to initialize and is ignored — it doesn't throw an error in the main invocation.
How to avoid:
- With CDK: insights_version=lambda_.LambdaInsightsVersion.* adds the managed policy automatically.
- Manually: add CloudWatchLambdaInsightsExecutionRolePolicy (AWS managed) to the function's execution role.
- To verify: check the extension logs in the /aws/lambda/insights log group or enable LAMBDA_INSIGHTS_LOG_LEVEL=info in the environment variables.
# CDK faz isso automaticamente, mas se precisar fazer manualmente:
fn.role.add_managed_policy(
iam.ManagedPolicy.from_aws_managed_policy_name(
"CloudWatchLambdaInsightsExecutionRolePolicy"
)
)
Reflection exercise
You have a Lambda function that processes payments and is receiving user complaints that "some payments don't process." The system has no observability configured beyond Lambda's default logs (plain text, no correlation). You need to propose an observability solution that allows, given a transaction ID reported by the user, finding in less than 5 minutes: (a) the complete log of the invocation that processed that transaction, (b) whether there was a retry or cold start, (c) which external calls (database, payment API) were made and which one was slowest, and (d) whether the problem is systemic (affects X% of transactions) or isolated.
Question: Which fields would you include in the structured logs? How would you configure X-Ray to trace the payment API call (which is external HTTP, not AWS)? What CloudWatch Logs Insights query would you use to identify whether the problem is systemic? Where would Lambda Insights help (or not help) in this diagnosis?
Resources for further study
-
Monitor function performance with Amazon CloudWatch Lambda Insights
URL: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-insights.html
Official guide for enabling Lambda Insights with layer ARNs per region, step-by-step via console/CDK/CLI, and complete list of collected metrics. Includes how to interpret the/LambdaInsightsdashboard. -
Visualize Lambda function invocations using AWS X-Ray
URL: https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html
Explains the native X-Ray integration with Lambda: how segments are created automatically, how to enable active tracing, and how to use the X-Ray Python SDK inside Lambda functions. Includes examples of subsegments and sampling configuration. -
Configuring JSON and plain text log formats
URL: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-cloudwatchlogs-logformat.html
Documentation of the new native JSON format for system logs (START, END, REPORT). Describes the fields emitted in each event type, how to configure via console/CLI/CDK, and how to use with CloudWatch Logs Insights.