Session 033 — DAX: architecture, use cases and when NOT to use

June 2, 2026

Dependencies: session-032-dynamodb-streams-lambda-cdc

Objective

By the end of this session, you will be able to identify the read patterns where DAX delivers real gains (repeated reads of the same item, read-heavy workloads), the cases where DAX does not help or is harmful (writes, strongly consistent reads, complex queries, cold data), calculate the cost-benefit of a DAX cluster versus the cost of the reads it replaces, and provision a DAX cluster via CDK Python.

Context

[FACT] Amazon DynamoDB Accelerator (DAX) is a fully managed in-memory cache, compatible with the DynamoDB API, designed to reduce read latency from single-digit milliseconds to microseconds — without changing the application code beyond swapping the client.

[CONSENSUS] DAX is not a generic cache: it understands the DynamoDB data model, distinguishes item cache from query cache, and implements write-through to maintain consistency between cache and table. This specialization is both its advantage and its limitation — it does not replace Redis/ElastiCache for use cases that require advanced data structures.

Key concepts

1. DAX cluster architecture

[FACT] A DAX cluster runs inside a VPC. It consists of a primary node (read + write) and up to 10 replica nodes (read-only). The application uses the DAX Client — a drop-in replacement for the standard DynamoDB client — that points to the cluster endpoint.

Aplicação (EC2 / Lambda na VPC)
        │
        ▼ DAX Client (endpoint: mycluster.abc123.dax-clusters.amazonaws.com)
┌───────────────────────────────────┐
│           DAX Cluster (VPC)       │
│  ┌─────────────┐  ┌────────────┐  │
│  │  Primário   │  │  Réplica 1 │  │
│  │ (lê + escreve)│ │ (lê só)   │  │
│  └──────┬──────┘  └────────────┘  │
│         │  replicação async       │
│  ┌──────▼──────┐  ┌────────────┐  │
│  │  Réplica 2  │  │  Réplica 3 │  │
│  └─────────────┘  └────────────┘  │
└───────────────────┬───────────────┘
                    │ cache miss / write-through
                    ▼
             DynamoDB Table

[FACT] The DAX Client performs intelligent load balancing across nodes — it routes reads to replicas and writes to the primary. The application does not need to know which node is being used.

[FACT] DAX only supports data operations — not table management. CreateTable, UpdateTable, DeleteTable, ListTables etc. must go directly to the standard DynamoDB client.

2. Item cache vs. Query cache

[FACT] DAX maintains two completely independent caches:

┌─────────────────────────────────────────────────────────────┐
│                        DAX                                  │
│                                                             │
│  ┌──────────────────────┐   ┌───────────────────────────┐  │
│  │      Item Cache      │   │       Query Cache         │  │
│  │                      │   │                           │  │
│  │ GetItem / BatchGetItem│  │ Query / Scan              │  │
│  │                      │   │                           │  │
│  │ Chave: PK (+ SK)     │   │ Chave: parâmetros da      │  │
│  │ TTL default: 5 min   │   │ operação (KeyCondition,   │  │
│  │ Evicção: LRU         │   │ FilterExpr, Limit etc.)   │  │
│  │                      │   │ TTL default: 5 min        │  │
│  │                      │   │ Evicção: LRU              │  │
│  └──────────────────────┘   └───────────────────────────┘  │
│                                                             │
│  Escrita no item cache NÃO invalida o query cache           │
└─────────────────────────────────────────────────────────────┘

[FACT] Item cache TTL default: 5 minutes (configurable at cluster creation — cannot be changed afterward without recreating the cluster).
[FACT] Query cache TTL default: 5 minutes (equally configurable, independent of the item cache TTL).
[FACT] TTL = 0 for item cache: items are only removed by LRU eviction or by write-through. TTL = 0 for query cache: no Query/Scan results are stored.

Critical implication — inconsistency between caches:
[FACT] If an item is updated (write-through updates the item cache), the query cache that contains that item in a result set is NOT invalidated. The next Query may return the old result set from the query cache until it expires by TTL.

3. Write-through: what happens on each operation

[FACT] For write operations (PutItem, UpdateItem, DeleteItem, BatchWriteItem), the flow is:

Aplicação
    │
    ▼ PutItem(item)
DAX Client
    │
    ├──▶ 1. Escreve no DynamoDB (sincronamente)
    │         │
    │         ├── Sucesso → 2. Atualiza item no item cache
    │         │                    │
    │         │                    └──▶ Retorna sucesso à aplicação
    │         │
    │         └── Falha (throttle, etc.) → NÃO escreve no cache
    │                                       Retorna erro à aplicação
    │
    └── (query cache NÃO é invalidado)

[FACT] If DynamoDB fails (including throttling), the item is not written to the cache. This prevents cache poisoning with data that was not persisted.

4. Behavior with strongly consistent reads and transactions

[FACT] DAX does not serve strongly consistent reads from the cache:

Operação                        Comportamento DAX
─────────────────────────────────────────────────────────────
GetItem (eventually consistent) Cache hit → retorna do cache
GetItem (strongly consistent)   Passa direto ao DynamoDB,
                                NÃO armazena resultado no cache
Query (eventually consistent)   Cache hit → retorna do cache
Query (strongly consistent)     Passa direto ao DynamoDB,
                                NÃO armazena resultado no cache
TransactGetItems                Passa direto ao DynamoDB,
                                NÃO armazena resultado no cache
TransactWriteItems              Passa direto ao DynamoDB,
                                NÃO atualiza o item cache

[FACT] TransactWriteItems does not update the DAX item cache. After a successful transaction, the item cache may contain the previous version of the items until the TTL expires or a subsequent write-through occurs.

5. When to use and when NOT to use DAX

[FACT — AWS Docs] Scenarios suitable for DAX:

✅ Use DAX quando:
─────────────────────────────────────────────────────────────
Read-heavy (10:1 ou mais)   Alta razão reads/writes → alto hit rate
Dados quentes               Mesmos itens lidos repetidamente
Latência em microssegundos  SLA abaixo de 1ms para leituras
Bursts de tráfego           DAX absorve picos; DynamoDB escala atrás
> 3.000 RCU/item/partição   DAX rompe o limite por-item da partição
Eventually consistent OK    Tolerância a dados levemente desatualizados

[FACT — AWS Docs] Scenarios unsuitable for DAX:

❌ NÃO use DAX quando:
─────────────────────────────────────────────────────────────
Write-heavy                 Writes usam mais recursos DAX que reads;
                            benefício de cache é mínimo
Dados frios / long tail     Baixo hit rate → custo sem benefício
Strongly consistent reads   DAX passa ao DynamoDB; recursos desperdiçados
TransactGetItems/WriteItems Mesma razão: DAX bypassa o cache
Bulk scans (ETL)            Scan full-table deve ir direto ao DynamoDB
Compliance (ex: SOC)        DAX não tem todas as acreditações do DynamoDB
Dados de alta variabilidade Praticamente todos os acessos são cache miss

[OPINION — AWS Docs] For cases that need caching but with advanced structures (lists, sorted sets, pub/sub, Lua scripting), AWS itself recommends Amazon ElastiCache (Redis OSS) as an alternative to DAX.

6. Node types and sizing

[FACT] DAX offers instance families dedicated to DynamoDB acceleration:

Família     vCPU  Memória   Típico para
──────────────────────────────────────────────────────
dax.r4.large   2    15 GB   Dev/test, workloads leves
dax.r4.xlarge  4    30 GB   Produção pequena/média
dax.r4.4xlarge 16  122 GB   Alta escala
dax.r5.large   2    16 GB   Geração atual — substitui r4
dax.r5.xlarge  4    32 GB   Produção padrão
dax.r5.4xlarge 16  128 GB   Alta escala geração atual
dax.r5.24xlarge 96 768 GB   Workloads extremas

[FACT] The node's memory determines how many items fit in the cache. Correct sizing requires estimating: average size of hot items × number of unique items that fit in the working set.

[CONSENSUS] For high availability in production: minimum 3 nodes (1 primary + 2 replicas), distributed across 3 different AZs. With 1 node, any failure causes DAX downtime (the application falls back to DynamoDB, but with higher latency).

Practical example

Scenario: e-commerce product catalog — repeated reads on popular items

Table: products-table (pay-per-request)
Pattern: 95% of reads on ~5% of products (popular products)
SLA: p99 < 1ms for GetItem

CDK Python — DAX Cluster + table

from aws_cdk import (
    Stack, Duration, RemovalPolicy,
    aws_dynamodb as dynamodb,
    aws_dax as dax,
    aws_ec2 as ec2,
    aws_iam as iam,
)
from constructs import Construct


class DaxProductsCatalogStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        # ── VPC (usa a default ou cria uma nova) ──────────────────────
        vpc = ec2.Vpc.from_lookup(self, "VPC", is_default=True)

        # ── Tabela DynamoDB ───────────────────────────────────────────
        products_table = dynamodb.Table(
            self, "ProductsTable",
            table_name="products-table",
            partition_key=dynamodb.Attribute(
                name="PK", type=dynamodb.AttributeType.STRING
            ),
            sort_key=dynamodb.Attribute(
                name="SK", type=dynamodb.AttributeType.STRING
            ),
            billing_mode=dynamodb.BillingMode.PAY_PER_REQUEST,
            removal_policy=RemovalPolicy.DESTROY,
        )

        # ── IAM Role para o cluster DAX ───────────────────────────────
        dax_role = iam.Role(
            self, "DaxRole",
            assumed_by=iam.ServicePrincipal("dax.amazonaws.com"),
        )
        products_table.grant_read_write_data(dax_role)

        # ── Security Group para o cluster DAX ─────────────────────────
        dax_sg = ec2.SecurityGroup(
            self, "DaxSG",
            vpc=vpc,
            description="DAX cluster SG",
            allow_all_outbound=True,
        )
        # Permite acesso na porta DAX (8111 sem TLS, 9111 com TLS)
        # da subnet onde roda a aplicação
        dax_sg.add_ingress_rule(
            peer=ec2.Peer.ipv4(vpc.vpc_cidr_block),
            connection=ec2.Port.tcp(8111),
            description="DAX unencrypted access from VPC",
        )

        # ── Subnet group para o cluster ───────────────────────────────
        private_subnet_ids = [
            subnet.subnet_id
            for subnet in vpc.private_subnets
        ]

        dax_subnet_group = dax.CfnSubnetGroup(
            self, "DaxSubnetGroup",
            subnet_group_name="dax-products-subnet-group",
            subnet_ids=private_subnet_ids,
            description="Subnets for DAX cluster",
        )

        # ── Parameter group (TTLs customizados) ──────────────────────
        # item cache TTL: 10 min (produtos mudam pouco)
        # query cache TTL: 2 min (listas mudam mais)
        dax_param_group = dax.CfnParameterGroup(
            self, "DaxParamGroup",
            parameter_group_name="dax-products-params",
            parameter_name_values={
                "record-ttl-millis":    "600000",  # 10 min
                "query-ttl-millis":     "120000",  # 2 min
            },
        )

        # ── Cluster DAX (3 nós — HA multi-AZ) ────────────────────────
        dax_cluster = dax.CfnCluster(
            self, "DaxCluster",
            cluster_name="products-dax-cluster",
            node_type="dax.r5.large",
            replication_factor=3,          # 1 primário + 2 réplicas
            iam_role_arn=dax_role.role_arn,
            subnet_group_name=dax_subnet_group.subnet_group_name,
            security_group_ids=[dax_sg.security_group_id],
            parameter_group_name=dax_param_group.parameter_group_name,
            # SSE em repouso
            sse_specification=dax.CfnCluster.SSESpecificationProperty(
                sse_enabled=True,
            ),
            # Encryption in transit (TLS)
            cluster_endpoint_encryption_type="TLS",
            # Janela de manutenção
            preferred_maintenance_window="sun:05:00-sun:06:00",
        )
        dax_cluster.add_dependency(dax_subnet_group)
        dax_cluster.add_dependency(dax_param_group)

Python — Using the DAX Client (boto3 + amazon-dax-client)

# pip install amazon-dax-client
import amazondax
import boto3
from botocore.exceptions import ClientError

# ── Inicialização: DAX Client para reads/writes normais ──────────────
DAX_ENDPOINT = "daxs://products-dax-cluster.abc123.dax-clusters.amazonaws.com"

dax_client = amazondax.AmazonDaxClient.resource(
    endpoint_url=DAX_ENDPOINT,
    region_name="us-east-1",
)
dax_table = dax_client.Table("products-table")

# ── DynamoDB Client direto (para strongly consistent + transações) ───
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
ddb_table = dynamodb.Table("products-table")


def get_product(product_id: str) -> dict | None:
    """
    Leitura eventual — vai para o DAX (item cache).
    p99 em microssegundos após o warm-up do cache.
    """
    response = dax_table.get_item(
        Key={"PK": f"PRODUCT#{product_id}", "SK": "METADATA"}
    )
    return response.get("Item")


def get_product_consistent(product_id: str) -> dict | None:
    """
    Leitura strongly consistent — BYPASSA o DAX, vai ao DynamoDB.
    Use apenas quando precisar do valor mais recente garantido.
    NOTA: NÃO armazenado no cache do DAX após a leitura.
    """
    response = ddb_table.get_item(
        Key={"PK": f"PRODUCT#{product_id}", "SK": "METADATA"},
        ConsistentRead=True,
    )
    return response.get("Item")


def update_product_price(product_id: str, new_price: float) -> None:
    """
    Write-through: DAX escreve no DynamoDB primeiro,
    depois atualiza o item cache.
    Query cache com esse produto em result sets NÃO é invalidado.
    """
    dax_table.update_item(
        Key={"PK": f"PRODUCT#{product_id}", "SK": "METADATA"},
        UpdateExpression="SET price = :p, updatedAt = :ts",
        ExpressionAttributeValues={
            ":p": str(new_price),           # DynamoDB usa Decimal
            ":ts": "2024-01-15T10:00:00Z",
        },
    )


def list_products_by_category(category: str, limit: int = 20) -> list:
    """
    Query — vai para o query cache do DAX.
    Hit rate alto para categorias populares.
    ATENÇÃO: resultado pode ter até query-ttl-millis de staleness.
    """
    response = dax_table.query(
        IndexName="CategoryIndex",
        KeyConditionExpression="category = :cat",
        ExpressionAttributeValues={":cat": category},
        Limit=limit,
    )
    return response.get("Items", [])


def bulk_scan_for_etl() -> list:
    """
    Scan completo para ETL — usa DynamoDB DIRETAMENTE.
    Não passa pelo DAX: evita poluir o cache com dados frios
    e não desperdiça recursos do cluster.
    """
    items = []
    last_key = None
    while True:
        kwargs = {"Limit": 1000}
        if last_key:
            kwargs["ExclusiveStartKey"] = last_key
        response = ddb_table.scan(**kwargs)
        items.extend(response.get("Items", []))
        last_key = response.get("LastEvaluatedKey")
        if not last_key:
            break
    return items

CLI — Provision cluster and verify metrics

# 1. Criar subnet group
aws dax create-subnet-group \
  --subnet-group-name dax-products-subnet-group \
  --subnet-ids subnet-abc123 subnet-def456 subnet-ghi789

# 2. Criar parameter group com TTLs customizados
aws dax create-parameter-group \
  --parameter-group-name dax-products-params

aws dax update-parameter-group \
  --parameter-group-name dax-products-params \
  --parameter-name-values \
      "ParameterName=record-ttl-millis,ParameterValue=600000" \
      "ParameterName=query-ttl-millis,ParameterValue=120000"

# 3. Criar cluster (3 nós, TLS, SSE)
aws dax create-cluster \
  --cluster-name products-dax-cluster \
  --node-type dax.r5.large \
  --replication-factor 3 \
  --iam-role-arn arn:aws:iam::123456789012:role/DaxRole \
  --subnet-group-name dax-products-subnet-group \
  --security-group-ids sg-abc123 \
  --parameter-group-name dax-products-params \
  --sse-specification Enabled=true \
  --cluster-endpoint-encryption-type TLS

# 4. Verificar status do cluster
aws dax describe-clusters \
  --cluster-names products-dax-cluster \
  --query 'Clusters[0].{Status:Status, Endpoint:ClusterDiscoveryEndpoint, Nodes:Nodes[*].{Id:NodeId,Status:NodeStatus,AZ:AvailabilityZone}}'

# 5. Monitorar cache hit ratio (métrica chave de saúde do DAX)
aws cloudwatch get-metric-statistics \
  --namespace AWS/DAX \
  --metric-name ItemCacheHits \
  --dimensions Name=ClusterName,Value=products-dax-cluster \
  --start-time "$(date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  --period 300 \
  --statistics Sum

aws cloudwatch get-metric-statistics \
  --namespace AWS/DAX \
  --metric-name ItemCacheMisses \
  --dimensions Name=ClusterName,Value=products-dax-cluster \
  --start-time "$(date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  --period 300 \
  --statistics Sum

# 6. Calcular hit rate manualmente:
#    HitRate = ItemCacheHits / (ItemCacheHits + ItemCacheMisses)
#    Alvo saudável: > 80% para workloads read-heavy

# 7. Verificar throttling no cluster
aws cloudwatch get-metric-statistics \
  --namespace AWS/DAX \
  --metric-name ThrottledRequestCount \
  --dimensions Name=ClusterName,Value=products-dax-cluster \
  --start-time "$(date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  --period 300 \
  --statistics Sum

Common pitfalls

1. Query cache is not invalidated by writes
[FACT] Write-through updates the item cache of the changed item. But if a previous Query cached a result set that contained that item, that result set remains in the query cache until it expires by TTL. Applications that require immediate consistency between a write and a Query result should go directly to DynamoDB or use a very low TTL on the query cache.

2. TransactWriteItems does not update the item cache
[FACT] Transactional operations pass directly to DynamoDB and do not interact with the DAX cache. After a successful TransactWriteItems, DAX may still serve the previous version of the affected items until the TTL expires. If this is unacceptable, read with strongly consistent after the transaction (directly through the DynamoDB client).

3. Using DAX Client for ETL scans
[CONSENSUS] Full table scans via DAX waste cluster resources (they pollute the query cache with data that will not be reused) and increase latency for other requests. Always use the standard DynamoDB client for batch/ETL operations.

4. Cluster in a single AZ without replicas
[CONSENSUS] A cluster with replication-factor=1 (only the primary node) offers zero HA. Any maintenance or failure makes the cluster unavailable. The DAX Client falls back to DynamoDB, but there is a window of elevated latency during recovery. In production: minimum 3 nodes across 3 AZs.

5. Item cache TTL too high for frequently changing data
[CONSENSUS] The 5-minute default is reasonable for many cases, but data with a high update rate (e.g., real-time inventory, flash sale prices) requires a lower TTL or direct use of DynamoDB. The cost of staleness must be weighed against the latency benefit.

6. DAX Client does not support all DynamoDB operations
[FACT] Table control operations (CreateTable, DescribeTable, UpdateTable, etc.) are not supported by the DAX Client. The application needs to maintain two clients: DAX Client for data operations and the direct DynamoDB Client for control plane operations.

Reflection exercise

You are evaluating whether to add DAX to an e-commerce application with the following characteristics:

500,000 GetItem/s at peak (popular products: top 1,000 SKUs account for 80% of reads)
10,000 UpdateItem/s (real-time inventory updates)
SLA: p99 < 5ms for catalog reads, p99 < 50ms for inventory reads
Inventory must reflect changes within 1 second (to avoid oversell)

Answer:

For which of the two operations (catalog vs inventory) is DAX more suitable and why? What determines this trade-off?
The inventory SLA requires that reads reflect changes within 1 second. Could you use DAX for these reads? If yes, how would you configure the TTL? If not, what would be the alternative?
Considering that the DAX cluster (3x dax.r5.xlarge) costs approximately $0.65/hour per node (total ~$1.95/hour or ~$1,400/month), and that the table without DAX would consume approximately 500,000 RCU/s (~$5,400/month in pay-per-request), what would be the minimum break-even cache hit rate for DAX to be financially justifiable?

Resources for further study

[FACT] DAX: How it works (item cache, query cache, write-through): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.concepts.html
[FACT] Evaluating DAX suitability: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/evaluate-dax-suitability.html
[FACT] DAX and DynamoDB consistency models: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.consistency.html
[FACT] DAX cluster sizing guide: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.sizing-guide.html
[FACT] DAX metrics and dimensions (CloudWatch): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/dax-metrics-dimensions-dax.html