luizmachado.dev

PT EN

Session 037 — CloudWatch Logs Insights: query syntax and derived fields

Dependencies: session-036-cloudwatch-custom-metrics-emf


Objective

By the end of this session, you will be able to write Logs Insights queries to aggregate errors by endpoint, extract fields from structured JSON logs with parse, create timeseries visualizations, use auto-discovered fields (@timestamp, @message, @requestId), and save reusable queries in the console.


Context

[FACT] CloudWatch Logs Insights is an interactive log analytics service that accepts its own pipe-oriented query language. You can query multiple log groups simultaneously, visualize results as timeseries or tables, and save queries for reuse. Results are generated in seconds to minutes depending on volume.

[FACT] CloudWatch Logs Insights is priced per GB of data scanned (us-east-1: $0.005/GB). The more precise the time filter and log group selection, the lower the cost. Queries have no request cost — only scanning cost.


Key concepts

1. Auto-discovered fields

[FACT] Logs Insights automatically discovers and indexes certain fields:

Campo           Origem / Conteúdo
────────────────────────────────────────────────────────────────
@timestamp      Timestamp do log event (sempre disponível)
@message        Texto completo do log event
@logStream      Nome do log stream
@log            Identificador do log group (account-id:log-group-name)
@requestId      Presente em logs Lambda REPORT e START/END
@duration       Duração de invocação Lambda (em ms) nos REPORT events
@billedDuration Duração faturável Lambda (ms)
@memorySize     Memória configurada (bytes)
@maxMemoryUsed  Memória máxima usada (bytes)
@initDuration   Duração do cold start (ms) — apenas em cold starts
@type           Tipo de event Lambda: "START", "END", "REPORT"

[FACT] For logs in JSON format, Logs Insights automatically discovers first-level fields as searchable fields. A log {"level": "ERROR", "endpoint": "/pay", "latency": 250} has level, endpoint, and latency available directly in queries without needing parse.


2. Main language commands

[FACT] The language is pipe-based: each command receives the results from the previous one. Commands are case-insensitive but field names are case-sensitive.

Comando         Propósito
────────────────────────────────────────────────────────────────
fields          Seleciona campos a exibir; cria campos derivados
filter          Filtra eventos (WHERE equivalente)
stats           Agrega dados — count, avg, sum, min, max, percentile
sort            Ordena resultados (deve vir após o último stats)
limit           Limita número de linhas retornadas
parse           Extrai subcampos de um campo de texto (glob, regex, logfmt)
dedup           Remove duplicatas por campo(s)
display         Formata a exibição sem afetar filtros (alias de fields pós-stats)
pattern         Agrupa mensagens similares (ML-based clustering)
diff            Compara métricas com período anterior

3. fields — projection and derived fields

[FACT] The fields command selects which fields to display and can create new fields with expressions:

-- Seleção simples
fields @timestamp, @message, level, endpoint

-- Campos derivados com operadores aritméticos
fields @timestamp, @duration / 1000 as durationSeconds

-- Campos derivados condicionais com if()
fields @timestamp,
       if(statusCode >= 400, "error", "ok") as requestStatus

-- coalesce: primeiro não-nulo
fields @timestamp,
       coalesce(httpMethod, method, "UNKNOWN") as verb

4. filter — event filtering

[FACT] Supports comparison operators (=, !=, <, <=, >, >=), like (substring), not like, in [...], ispresent(), not ispresent():

-- Filtro simples
filter level = "ERROR"

-- Like (substring, case-sensitive)
filter @message like "TimeoutException"

-- Regex com ~
filter @message =~ /TimeoutException|ConnectionReset/

-- Múltiplas condições
filter statusCode >= 400 and endpoint like "/api/payments"

-- Verificar existência de campo
filter ispresent(errorCode)

-- IN para lista de valores
filter statusCode in [400, 401, 403, 404]

-- Combinando com NOT
filter not (statusCode = 200 or statusCode = 201)

5. stats — aggregations

[FACT] Functions available in stats:

Função              Descrição
────────────────────────────────────────────────────────────────
count(*)            Total de eventos no grupo
count(field)        Total de eventos onde field não é nulo
count_distinct(f)   Contagem de valores únicos
sum(field)          Soma
avg(field)          Média aritmética
min(field)          Mínimo
max(field)          Máximo
pct(field, n)       Percentil (ex: pct(@duration, 95) = p95)
stddev(field)       Desvio padrão
earliest(field)     Valor mais antigo no grupo
latest(field)       Valor mais recente no grupo

[FACT] bin(period) groups by time window — fundamental for timeseries:

-- Erros por endpoint por minuto
filter level = "ERROR"
| stats count(*) as errorCount by endpoint, bin(1m)
| sort bin desc

-- Períodos válidos: 1s, 10s, 30s, 1m, 5m, 10m, 15m, 30m, 1h, 6h, 1d

6. parse — extracting fields from unstructured text

[FACT] Three modes of parse:

Glob (wildcard *):

-- Log: "user=alice, method:GET, latency := 45"
parse @message "user=*, method:*, latency := *"
    as user, method, latency
| stats avg(latency) by method

Regex with named groups:

-- Log: "[ERROR] POST /api/pay - 503 - 1250ms"
parse @message /\[(?<logLevel>[A-Z]+)\] (?<httpMethod>[A-Z]+) (?<path>[^ ]+) - (?<statusCode>\d+) - (?<durationMs>\d+)ms/
| filter logLevel = "ERROR"
| stats count(*) by path, statusCode

For structured JSON logs — parse is NOT necessary:

-- Log JSON: {"level":"ERROR","endpoint":"/pay","latencyMs":1250,"userId":"u-123"}
-- Logs Insights já descobre os campos automaticamente

fields @timestamp, endpoint, latencyMs, level
| filter level = "ERROR"
| stats avg(latencyMs) as avgLatency by endpoint

Practical example

Scenario: analyzing payment API logs

Logs are emitted in structured JSON by the Lambda Powertools Logger:

{
  "@timestamp": "2024-01-15T10:23:45.123Z",
  "level": "INFO",
  "service": "payment-processor",
  "message": "Payment processed",
  "endpoint": "/api/v1/payments",
  "httpMethod": "POST",
  "statusCode": 200,
  "latencyMs": 45,
  "userId": "u-abc123",
  "orderId": "ord-xyz789",
  "cold_start": false,
  "xray_trace_id": "1-abc123-def456"
}

Query 1: Error rate by endpoint in the last 60 minutes

fields @timestamp, endpoint, statusCode, level
| filter level in ["ERROR", "CRITICAL"] or statusCode >= 400
| stats
    count(*) as totalErrors,
    count_distinct(userId) as uniqueUsersAffected,
    max(latencyMs) as maxLatency
  by endpoint, statusCode
| sort totalErrors desc
| limit 20

Query 2: Latency percentiles by endpoint (timeseries, 5 min)

filter ispresent(latencyMs) and ispresent(endpoint)
| stats
    avg(latencyMs) as avgLatency,
    pct(latencyMs, 50) as p50,
    pct(latencyMs, 95) as p95,
    pct(latencyMs, 99) as p99,
    count(*) as requestCount
  by endpoint, bin(5m)
| sort bin desc

Query 3: Lambda cold starts — identify and measure impact

-- Usa campos especiais do Lambda REPORT
filter @type = "REPORT"
| stats
    count(*) as totalInvocations,
    sum(if(ispresent(@initDuration), 1, 0)) as coldStartCount,
    avg(@initDuration) as avgColdStartMs,
    max(@initDuration) as maxColdStartMs,
    avg(@duration) as avgDurationMs,
    avg(@maxMemoryUsed / 1000 / 1000) as avgMemoryMB,
    max(@memorySize / 1000 / 1000) - max(@maxMemoryUsed / 1000 / 1000)
        as overProvisionedMB
  by bin(1h)
| sort bin desc

Query 4: Trace a user's journey by orderId

fields @timestamp, level, message, endpoint, statusCode, latencyMs, orderId
| filter orderId = "ord-xyz789"
| sort @timestamp asc

Query 5: Top 10 errors grouped by message

filter level = "ERROR"
| stats count(*) as occurrences by message
| sort occurrences desc
| limit 10

Query 6: parse on unstructured logs — Lambda REPORT latency

-- Lambda REPORT logs: "REPORT RequestId: abc Duration: 123.45 ms Billed Duration: 200 ms ..."
filter @type = "REPORT"
| parse @message "Duration: * ms" as rawDuration
| stats
    avg(rawDuration) as avgDuration,
    pct(rawDuration, 99) as p99Duration,
    max(rawDuration) as maxDuration
  by bin(5m)
| sort bin desc

Query 7: dedup — last error occurrence per user

fields @timestamp, userId, message, endpoint, statusCode
| filter level = "ERROR" and ispresent(userId)
| sort @timestamp desc
| dedup userId
| limit 50

Query 8: Slow requests with retry deduplication

fields @timestamp, @requestId, endpoint, latencyMs, statusCode, userId
| filter latencyMs > 1000
| sort @timestamp desc
| dedup @requestId
| limit 20

CLI — Execute queries via AWS CLI

# 1. Iniciar query assíncrona
QUERY_ID=$(aws logs start-query \
  --log-group-names "/aws/lambda/payment-processor" \
  --start-time "$(date -u -d '1 hour ago' +%s 2>/dev/null || date -u -v-1H +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter level = "ERROR" or statusCode >= 400
    | stats count(*) as errorCount by endpoint, statusCode
    | sort errorCount desc
    | limit 10
  ' \
  --query 'queryId' \
  --output text)

echo "Query ID: $QUERY_ID"

# 2. Aguardar conclusão e obter resultados
sleep 5
aws logs get-query-results \
  --query-id "$QUERY_ID" \
  --query '{Status: status, Results: results}'

# 3. Query em múltiplos log groups simultaneamente
QUERY_ID2=$(aws logs start-query \
  --log-group-names \
      "/aws/lambda/payment-processor" \
      "/aws/lambda/order-service" \
      "/aws/lambda/notification-service" \
  --start-time "$(date -u -d '1 hour ago' +%s 2>/dev/null || date -u -v-1H +%s)" \
  --end-time "$(date -u +%s)" \
  --query-string '
    filter level = "ERROR"
    | stats count(*) as errors by @log, bin(5m)
    | sort bin desc
  ' \
  --query 'queryId' \
  --output text)

# 4. Listar queries salvas no console
aws logs describe-query-definitions \
  --query 'queryDefinitions[*].{Name:name,QueryString:queryString}'

# 5. Salvar uma query reutilizável
aws logs put-query-definition \
  --name "payment-error-rate-by-endpoint" \
  --log-group-names "/aws/lambda/payment-processor" \
  --query-string '
    filter level = "ERROR" or statusCode >= 400
    | stats count(*) as errors, count(*) / 1 as errorRate by endpoint, bin(5m)
    | sort bin desc
  '

# 6. Verificar custo aproximado de scaneamento
# CloudWatch Logs → Insights → Histórico de queries no console
# Ou via CloudWatch Metrics:
aws cloudwatch get-metric-statistics \
  --namespace AWS/Logs \
  --metric-name DataScannedInBytes \
  --start-time "$(date -u -d '1 day ago' '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -v-1d '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  --period 86400 \
  --statistics Sum \
  --query 'Datapoints[0].Sum'

CDK Dashboard — Widget with Logs Insights query

from aws_cdk import (
    Stack, Duration,
    aws_cloudwatch as cw,
)
from constructs import Construct


class LogsInsightsDashboardStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        dashboard = cw.Dashboard(
            self, "ApiDashboard",
            dashboard_name="api-insights",
        )

        # Widget de Logs Insights — timeseries de erros por endpoint
        dashboard.add_widgets(
            cw.LogQueryWidget(
                title="Error Rate by Endpoint (5m)",
                log_group_names=["/aws/lambda/payment-processor"],
                view=cw.LogQueryVisualizationType.LINE,
                width=24,
                height=6,
                query_lines=[
                    "filter level = 'ERROR' or statusCode >= 400",
                    "| stats count(*) as errors by endpoint, bin(5m)",
                    "| sort bin desc",
                ],
            ),
        )

        dashboard.add_widgets(
            # Widget de tabela — top erros
            cw.LogQueryWidget(
                title="Top Error Messages (last 1h)",
                log_group_names=["/aws/lambda/payment-processor"],
                view=cw.LogQueryVisualizationType.TABLE,
                width=12,
                height=6,
                query_lines=[
                    "filter level = 'ERROR'",
                    "| stats count(*) as occurrences by message",
                    "| sort occurrences desc",
                    "| limit 10",
                ],
            ),
            # Widget de barras — latência p99 por endpoint
            cw.LogQueryWidget(
                title="p99 Latency by Endpoint",
                log_group_names=["/aws/lambda/payment-processor"],
                view=cw.LogQueryVisualizationType.BAR,
                width=12,
                height=6,
                query_lines=[
                    "filter ispresent(latencyMs)",
                    "| stats pct(latencyMs, 99) as p99 by endpoint",
                    "| sort p99 desc",
                    "| limit 10",
                ],
            ),
        )

Common pitfalls

1. Cost per GB scanned — queries on large log groups
[FACT] Logs Insights charges $0.005/GB scanned (us-east-1). A 100 GB log group with 30-day retention can cost $0.50 per query if the time range is not restricted. Always use the smallest time range that answers your question. For frequent historical analyses, consider exporting logs to S3 and using Athena (cheaper for large volumes).

2. parse on JSON logs is unnecessary and slow
[FACT] Logs Insights already automatically indexes first-level fields from JSON logs. Using parse @message "\"latency\": *" as latency on a JSON log is redundant and uses more CPU. For JSON logs, use the fields directly.

3. sort and limit must come after the last stats
[FACT] The language requires that sort and limit appear after the last stats command. Placing sort before stats generates a syntax error or incorrect results.

4. Fields with special characters require backticks
[FACT] Fields containing hyphens, dots, or starting with a number need backticks. Example: a log with "detail-type" must be referenced as `detail-type` in the query. Fields like requestParameters.userName (nested) are accessed with dot-notation directly.

5. Queries on multiple log groups return the @log field
[FACT] When querying multiple log groups, the @log field identifies which log group each event came from — account-id:log-group-name. Without filtering by @log, you may inadvertently mix events from different services in the same aggregation.

6. count_distinct is approximate for high cardinality
[FACT] For fields with many unique values (e.g., userId), count_distinct uses HyperLogLog internally and may have an estimation error of ~2-3%. For exact counts on high-cardinality fields, export the data to Athena.


Reflection exercise

You have a log group /aws/lambda/checkout-service with structured JSON logs following the Powertools Logger pattern. The relevant fields are: level, endpoint, statusCode, latencyMs, userId, cartId, errorCode.

Write queries for:

  1. Availability SLA by endpoint: Return, for each endpoint, the total requests, total errors (statusCode >= 400 or level = ERROR), and success percentage — grouped by 1-hour windows over the last 24 hours. Sort by error rate descending.

  2. Affected user diagnosis: Given a specific userId (u-suspicious-123), list in chronological order all error events from the last 6 hours, showing @timestamp, endpoint, statusCode, errorCode, and cartId.

  3. Parse on legacy log: The legacy service emits text-format logs: "[2024-01-15 10:23:45] WARN /checkout/confirm - user=u-abc123 cart=cart-456 error=STOCK_DEPLETED". Write a query with parse (glob or regex) that extracts warnUser, warnCart, and warnError, and aggregates by warnError.


Resources for further study

  • [FACT] CloudWatch Logs Insights query syntax: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html
  • [FACT] Sample queries (Lambda, VPC Flow, CloudTrail, API Gateway): https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html
  • [FACT] Supported logs and discovered fields: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData-discoverable-fields.html
  • [FACT] Functions reference (boolean, datetime, numeric): https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-operations-functions.html