Session 037 — CloudWatch Logs Insights: query syntax and derived fields
Dependencies: session-036-cloudwatch-custom-metrics-emf
Objective
By the end of this session, you will be able to write Logs Insights queries to aggregate errors by endpoint, extract fields from structured JSON logs with parse, create timeseries visualizations, use auto-discovered fields (@timestamp, @message, @requestId), and save reusable queries in the console.
Context
[FACT] CloudWatch Logs Insights is an interactive log analytics service that accepts its own pipe-oriented query language. You can query multiple log groups simultaneously, visualize results as timeseries or tables, and save queries for reuse. Results are generated in seconds to minutes depending on volume.
[FACT] CloudWatch Logs Insights is priced per GB of data scanned (us-east-1: $0.005/GB). The more precise the time filter and log group selection, the lower the cost. Queries have no request cost — only scanning cost.
Key concepts
1. Auto-discovered fields
[FACT] Logs Insights automatically discovers and indexes certain fields:
Campo Origem / Conteúdo
────────────────────────────────────────────────────────────────
@timestamp Timestamp do log event (sempre disponível)
@message Texto completo do log event
@logStream Nome do log stream
@log Identificador do log group (account-id:log-group-name)
@requestId Presente em logs Lambda REPORT e START/END
@duration Duração de invocação Lambda (em ms) nos REPORT events
@billedDuration Duração faturável Lambda (ms)
@memorySize Memória configurada (bytes)
@maxMemoryUsed Memória máxima usada (bytes)
@initDuration Duração do cold start (ms) — apenas em cold starts
@type Tipo de event Lambda: "START", "END", "REPORT"
[FACT] For logs in JSON format, Logs Insights automatically discovers first-level fields as searchable fields. A log {"level": "ERROR", "endpoint": "/pay", "latency": 250} has level, endpoint, and latency available directly in queries without needing parse.
2. Main language commands
[FACT] The language is pipe-based: each command receives the results from the previous one. Commands are case-insensitive but field names are case-sensitive.
Comando Propósito
────────────────────────────────────────────────────────────────
fields Seleciona campos a exibir; cria campos derivados
filter Filtra eventos (WHERE equivalente)
stats Agrega dados — count, avg, sum, min, max, percentile
sort Ordena resultados (deve vir após o último stats)
limit Limita número de linhas retornadas
parse Extrai subcampos de um campo de texto (glob, regex, logfmt)
dedup Remove duplicatas por campo(s)
display Formata a exibição sem afetar filtros (alias de fields pós-stats)
pattern Agrupa mensagens similares (ML-based clustering)
diff Compara métricas com período anterior
3. fields — projection and derived fields
[FACT] The fields command selects which fields to display and can create new fields with expressions:
-- Seleção simples
fields @timestamp, @message, level, endpoint
-- Campos derivados com operadores aritméticos
fields @timestamp, @duration / 1000 as durationSeconds
-- Campos derivados condicionais com if()
fields @timestamp,
if(statusCode >= 400, "error", "ok") as requestStatus
-- coalesce: primeiro não-nulo
fields @timestamp,
coalesce(httpMethod, method, "UNKNOWN") as verb
4. filter — event filtering
[FACT] Supports comparison operators (=, !=, <, <=, >, >=), like (substring), not like, in [...], ispresent(), not ispresent():
-- Filtro simples
filter level = "ERROR"
-- Like (substring, case-sensitive)
filter @message like "TimeoutException"
-- Regex com ~
filter @message =~ /TimeoutException|ConnectionReset/
-- Múltiplas condições
filter statusCode >= 400 and endpoint like "/api/payments"
-- Verificar existência de campo
filter ispresent(errorCode)
-- IN para lista de valores
filter statusCode in [400, 401, 403, 404]
-- Combinando com NOT
filter not (statusCode = 200 or statusCode = 201)
5. stats — aggregations
[FACT] Functions available in stats:
Função Descrição
────────────────────────────────────────────────────────────────
count(*) Total de eventos no grupo
count(field) Total de eventos onde field não é nulo
count_distinct(f) Contagem de valores únicos
sum(field) Soma
avg(field) Média aritmética
min(field) Mínimo
max(field) Máximo
pct(field, n) Percentil (ex: pct(@duration, 95) = p95)
stddev(field) Desvio padrão
earliest(field) Valor mais antigo no grupo
latest(field) Valor mais recente no grupo
[FACT] bin(period) groups by time window — fundamental for timeseries:
-- Erros por endpoint por minuto
filter level = "ERROR"
| stats count(*) as errorCount by endpoint, bin(1m)
| sort bin desc
-- Períodos válidos: 1s, 10s, 30s, 1m, 5m, 10m, 15m, 30m, 1h, 6h, 1d
6. parse — extracting fields from unstructured text
[FACT] Three modes of parse:
Glob (wildcard *):
-- Log: "user=alice, method:GET, latency := 45"
parse @message "user=*, method:*, latency := *"
as user, method, latency
| stats avg(latency) by method
Regex with named groups:
-- Log: "[ERROR] POST /api/pay - 503 - 1250ms"
parse @message /\[(?<logLevel>[A-Z]+)\] (?<httpMethod>[A-Z]+) (?<path>[^ ]+) - (?<statusCode>\d+) - (?<durationMs>\d+)ms/
| filter logLevel = "ERROR"
| stats count(*) by path, statusCode
For structured JSON logs — parse is NOT necessary:
-- Log JSON: {"level":"ERROR","endpoint":"/pay","latencyMs":1250,"userId":"u-123"}
-- Logs Insights já descobre os campos automaticamente
fields @timestamp, endpoint, latencyMs, level
| filter level = "ERROR"
| stats avg(latencyMs) as avgLatency by endpoint
Practical example
Scenario: analyzing payment API logs
Logs are emitted in structured JSON by the Lambda Powertools Logger:
{
"@timestamp": "2024-01-15T10:23:45.123Z",
"level": "INFO",
"service": "payment-processor",
"message": "Payment processed",
"endpoint": "/api/v1/payments",
"httpMethod": "POST",
"statusCode": 200,
"latencyMs": 45,
"userId": "u-abc123",
"orderId": "ord-xyz789",
"cold_start": false,
"xray_trace_id": "1-abc123-def456"
}
Query 1: Error rate by endpoint in the last 60 minutes
fields @timestamp, endpoint, statusCode, level
| filter level in ["ERROR", "CRITICAL"] or statusCode >= 400
| stats
count(*) as totalErrors,
count_distinct(userId) as uniqueUsersAffected,
max(latencyMs) as maxLatency
by endpoint, statusCode
| sort totalErrors desc
| limit 20
Query 2: Latency percentiles by endpoint (timeseries, 5 min)
filter ispresent(latencyMs) and ispresent(endpoint)
| stats
avg(latencyMs) as avgLatency,
pct(latencyMs, 50) as p50,
pct(latencyMs, 95) as p95,
pct(latencyMs, 99) as p99,
count(*) as requestCount
by endpoint, bin(5m)
| sort bin desc
Query 3: Lambda cold starts — identify and measure impact
-- Usa campos especiais do Lambda REPORT
filter @type = "REPORT"
| stats
count(*) as totalInvocations,
sum(if(ispresent(@initDuration), 1, 0)) as coldStartCount,
avg(@initDuration) as avgColdStartMs,
max(@initDuration) as maxColdStartMs,
avg(@duration) as avgDurationMs,
avg(@maxMemoryUsed / 1000 / 1000) as avgMemoryMB,
max(@memorySize / 1000 / 1000) - max(@maxMemoryUsed / 1000 / 1000)
as overProvisionedMB
by bin(1h)
| sort bin desc
Query 4: Trace a user's journey by orderId
fields @timestamp, level, message, endpoint, statusCode, latencyMs, orderId
| filter orderId = "ord-xyz789"
| sort @timestamp asc
Query 5: Top 10 errors grouped by message
filter level = "ERROR"
| stats count(*) as occurrences by message
| sort occurrences desc
| limit 10
Query 6: parse on unstructured logs — Lambda REPORT latency
-- Lambda REPORT logs: "REPORT RequestId: abc Duration: 123.45 ms Billed Duration: 200 ms ..."
filter @type = "REPORT"
| parse @message "Duration: * ms" as rawDuration
| stats
avg(rawDuration) as avgDuration,
pct(rawDuration, 99) as p99Duration,
max(rawDuration) as maxDuration
by bin(5m)
| sort bin desc
Query 7: dedup — last error occurrence per user
fields @timestamp, userId, message, endpoint, statusCode
| filter level = "ERROR" and ispresent(userId)
| sort @timestamp desc
| dedup userId
| limit 50
Query 8: Slow requests with retry deduplication
fields @timestamp, @requestId, endpoint, latencyMs, statusCode, userId
| filter latencyMs > 1000
| sort @timestamp desc
| dedup @requestId
| limit 20
CLI — Execute queries via AWS CLI
# 1. Iniciar query assíncrona
QUERY_ID=$(aws logs start-query \
--log-group-names "/aws/lambda/payment-processor" \
--start-time "$(date -u -d '1 hour ago' +%s 2>/dev/null || date -u -v-1H +%s)" \
--end-time "$(date -u +%s)" \
--query-string '
filter level = "ERROR" or statusCode >= 400
| stats count(*) as errorCount by endpoint, statusCode
| sort errorCount desc
| limit 10
' \
--query 'queryId' \
--output text)
echo "Query ID: $QUERY_ID"
# 2. Aguardar conclusão e obter resultados
sleep 5
aws logs get-query-results \
--query-id "$QUERY_ID" \
--query '{Status: status, Results: results}'
# 3. Query em múltiplos log groups simultaneamente
QUERY_ID2=$(aws logs start-query \
--log-group-names \
"/aws/lambda/payment-processor" \
"/aws/lambda/order-service" \
"/aws/lambda/notification-service" \
--start-time "$(date -u -d '1 hour ago' +%s 2>/dev/null || date -u -v-1H +%s)" \
--end-time "$(date -u +%s)" \
--query-string '
filter level = "ERROR"
| stats count(*) as errors by @log, bin(5m)
| sort bin desc
' \
--query 'queryId' \
--output text)
# 4. Listar queries salvas no console
aws logs describe-query-definitions \
--query 'queryDefinitions[*].{Name:name,QueryString:queryString}'
# 5. Salvar uma query reutilizável
aws logs put-query-definition \
--name "payment-error-rate-by-endpoint" \
--log-group-names "/aws/lambda/payment-processor" \
--query-string '
filter level = "ERROR" or statusCode >= 400
| stats count(*) as errors, count(*) / 1 as errorRate by endpoint, bin(5m)
| sort bin desc
'
# 6. Verificar custo aproximado de scaneamento
# CloudWatch Logs → Insights → Histórico de queries no console
# Ou via CloudWatch Metrics:
aws cloudwatch get-metric-statistics \
--namespace AWS/Logs \
--metric-name DataScannedInBytes \
--start-time "$(date -u -d '1 day ago' '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -v-1d '+%Y-%m-%dT%H:%M:%SZ')" \
--end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
--period 86400 \
--statistics Sum \
--query 'Datapoints[0].Sum'
CDK Dashboard — Widget with Logs Insights query
from aws_cdk import (
Stack, Duration,
aws_cloudwatch as cw,
)
from constructs import Construct
class LogsInsightsDashboardStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs):
super().__init__(scope, construct_id, **kwargs)
dashboard = cw.Dashboard(
self, "ApiDashboard",
dashboard_name="api-insights",
)
# Widget de Logs Insights — timeseries de erros por endpoint
dashboard.add_widgets(
cw.LogQueryWidget(
title="Error Rate by Endpoint (5m)",
log_group_names=["/aws/lambda/payment-processor"],
view=cw.LogQueryVisualizationType.LINE,
width=24,
height=6,
query_lines=[
"filter level = 'ERROR' or statusCode >= 400",
"| stats count(*) as errors by endpoint, bin(5m)",
"| sort bin desc",
],
),
)
dashboard.add_widgets(
# Widget de tabela — top erros
cw.LogQueryWidget(
title="Top Error Messages (last 1h)",
log_group_names=["/aws/lambda/payment-processor"],
view=cw.LogQueryVisualizationType.TABLE,
width=12,
height=6,
query_lines=[
"filter level = 'ERROR'",
"| stats count(*) as occurrences by message",
"| sort occurrences desc",
"| limit 10",
],
),
# Widget de barras — latência p99 por endpoint
cw.LogQueryWidget(
title="p99 Latency by Endpoint",
log_group_names=["/aws/lambda/payment-processor"],
view=cw.LogQueryVisualizationType.BAR,
width=12,
height=6,
query_lines=[
"filter ispresent(latencyMs)",
"| stats pct(latencyMs, 99) as p99 by endpoint",
"| sort p99 desc",
"| limit 10",
],
),
)
Common pitfalls
1. Cost per GB scanned — queries on large log groups
[FACT] Logs Insights charges $0.005/GB scanned (us-east-1). A 100 GB log group with 30-day retention can cost $0.50 per query if the time range is not restricted. Always use the smallest time range that answers your question. For frequent historical analyses, consider exporting logs to S3 and using Athena (cheaper for large volumes).
2. parse on JSON logs is unnecessary and slow
[FACT] Logs Insights already automatically indexes first-level fields from JSON logs. Using parse @message "\"latency\": *" as latency on a JSON log is redundant and uses more CPU. For JSON logs, use the fields directly.
3. sort and limit must come after the last stats
[FACT] The language requires that sort and limit appear after the last stats command. Placing sort before stats generates a syntax error or incorrect results.
4. Fields with special characters require backticks
[FACT] Fields containing hyphens, dots, or starting with a number need backticks. Example: a log with "detail-type" must be referenced as `detail-type` in the query. Fields like requestParameters.userName (nested) are accessed with dot-notation directly.
5. Queries on multiple log groups return the @log field
[FACT] When querying multiple log groups, the @log field identifies which log group each event came from — account-id:log-group-name. Without filtering by @log, you may inadvertently mix events from different services in the same aggregation.
6. count_distinct is approximate for high cardinality
[FACT] For fields with many unique values (e.g., userId), count_distinct uses HyperLogLog internally and may have an estimation error of ~2-3%. For exact counts on high-cardinality fields, export the data to Athena.
Reflection exercise
You have a log group /aws/lambda/checkout-service with structured JSON logs following the Powertools Logger pattern. The relevant fields are: level, endpoint, statusCode, latencyMs, userId, cartId, errorCode.
Write queries for:
-
Availability SLA by endpoint: Return, for each endpoint, the total requests, total errors (statusCode >= 400 or level = ERROR), and success percentage — grouped by 1-hour windows over the last 24 hours. Sort by error rate descending.
-
Affected user diagnosis: Given a specific
userId(u-suspicious-123), list in chronological order all error events from the last 6 hours, showing@timestamp,endpoint,statusCode,errorCode, andcartId. -
Parse on legacy log: The legacy service emits text-format logs:
"[2024-01-15 10:23:45] WARN /checkout/confirm - user=u-abc123 cart=cart-456 error=STOCK_DEPLETED". Write a query withparse(glob or regex) that extractswarnUser,warnCart, andwarnError, and aggregates bywarnError.
Resources for further study
- [FACT] CloudWatch Logs Insights query syntax: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html
- [FACT] Sample queries (Lambda, VPC Flow, CloudTrail, API Gateway): https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html
- [FACT] Supported logs and discovered fields: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData-discoverable-fields.html
- [FACT] Functions reference (boolean, datetime, numeric): https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-operations-functions.html