luizmachado.dev

PT EN

Session 029 — DynamoDB: access patterns first and generic PK/SK

Estimated duration: 60 minutes
Prerequisites: session-020-lambda-event-source-mappings


Objective

By the end, you will be able to list all access patterns of an application before writing
any schema, design generic PK and SK to support multiple entity types
(e.g.: PK = USER#123, SK = PROFILE), and implement GetItem by PK and Query by SK prefix in
code.


Context

[FACT] DynamoDB is a NoSQL key-value and document database, fully managed by AWS, that provides
single-digit millisecond latencies at any scale. It was originally described in the paper
"Dynamo: Amazon's Highly Available Key-value Store" (2007) and became one of the most used
services on AWS for applications that need predictable throughput without the operational costs
of a relational database.

[FACT] The fundamental difference between modeling for SQL and modeling for DynamoDB is not
technical — it's philosophical. In SQL, you normalize the data first and write the queries later
("queries follow the data"). In DynamoDB, it's the opposite: you document the queries (access
patterns) first and design the schema to serve exactly those queries ("data follows the queries").
Starting with the schema in DynamoDB almost always results in designs that require expensive Scans
or multiple queries where one would suffice.

[OPINION] Rick Houlihan (former AWS, principal DynamoDB evangelist) articulates this as: "If you
don't know your access patterns before you design your DynamoDB table, you're going to have a bad
time." — this position is consensus in the DynamoDB community, documented in numerous re:Invent
talks and in The DynamoDB Book by Alex DeBrie.


Main concepts

1. The "access patterns first" paradigm — why queries before schema

In a relational database, the design process is: entities → normalization → tables → indexes.
Queries are written afterwards, and the query optimizer (query planner) decides how to execute
them efficiently using indexes, joins, and dynamic execution plans.

DynamoDB has no query planner. Each read operation must specify exactly how to reach the data:
via primary key (GetItem), via partition with a sort key filter (Query), or via a full table scan
(Scan). Scan is the only operation that "finds" data without knowing where it is — and it's
prohibitively expensive on large tables.

[FACT] The correct process for DynamoDB is:

Passo 1: Identificar entidades do domínio
         (ex: User, Order, Product, Review, Session)

Passo 2: Identificar os relacionamentos entre entidades
         (ex: User tem muitos Orders; Order tem muitos OrderItems)

Passo 3: Documentar TODOS os access patterns necessários
         (ex: "buscar usuário por ID", "listar pedidos de um usuário",
          "buscar pedidos de um usuário por data", "buscar itens de um pedido")

Passo 4: SOMENTE ENTÃO projetar PK, SK e índices secundários
         para atender cada access pattern com uma única operação

[FACT] An access pattern documents: who is reading, what is being read, and which
filters/orderings
are needed. A table typically has 10–30 access patterns. Each pattern becomes
a DynamoDB operation — GetItem, Query, or (rarely) Scan.

Example access pattern list for e-commerce:

# Access Pattern DynamoDB Operation
AP1 Get user by ID GetItem
AP2 List all orders for a user Query
AP3 List orders for a user in the last 30 days Query (BETWEEN on SK)
AP4 Get specific order by ID GetItem
AP5 List items of an order Query (begins_with on SK)
AP6 Get product by ID GetItem
AP7 List reviews for a product Query
AP8 Get specific review from a user on a product GetItem

This list must be complete and stable before starting the design. Adding a new access pattern
after creating the table usually requires creating a new secondary index (Global Secondary
Index — GSI), which has storage and write throughput costs.

2. Partition key and sort key — how DynamoDB stores and accesses data

[FACT] DynamoDB has two types of primary key:

Chave simples (simple primary key):
  Apenas partition key (PK)
  → identifica unicamente um item
  → só suporta GetItem (busca por valor exato)

Chave composta (composite primary key):
  Partition key (PK) + Sort key (SK)
  → PK + SK juntos identificam unicamente um item
  → múltiplos itens podem ter o mesmo PK, com SKs diferentes
  → suporta GetItem (PK + SK exatos) e Query (PK exato + condição no SK)

[FACT] DynamoDB uses the partition key value as input to an internal hash function to determine
which physical partition the item will be stored in. Items with the same PK value are stored
in the same partition, sorted by the SK value. This grouping is called an item collection.

Partição física para PK = "USER#123":
  ┌──────────────┬──────────────────────────────────────────┐
  │ PK           │ SK                   │ outros atributos  │
  ├──────────────┼──────────────────────┼───────────────────┤
  │ USER#123     │ ORDER#2024-01-15#001  │ total: 150.00    │
  │ USER#123     │ ORDER#2024-01-20#002  │ total: 89.90     │
  │ USER#123     │ ORDER#2024-03-05#003  │ total: 230.00    │
  │ USER#123     │ PROFILE              │ name: João Silva  │
  │ USER#123     │ SESSION#abc123       │ expires: ...      │
  └──────────────┴──────────────────────┴───────────────────┘
  ↑ todos os itens com PK="USER#123" ficam juntos, ordenados por SK

[FACT] The Query operation allows fetching all items from a partition (fixed PK) and
optionally filtering by SK using range conditions. Since items are stored sorted by SK, these
operations are O(log n) in the number of items in the partition — efficient and predictable.

[FACT] The sort key has a type — String, Number, or Binary. For Strings, the ordering is
lexicographic (UTF-8). This means "2024-01-15" sorts correctly as a date if it's in ISO
8601 format (year-month-day), but "15/01/2024" (day/month/year) does not sort correctly as a
date. Using timestamps in ISO 8601 format or Unix epoch in the SK is a canonical practice.

3. Generic PK and SK — the naming pattern and entity prefixes

[FACT] In single-table designs, it's standard practice to use generic names PK and SK for
the primary key attributes, instead of semantic names like userId or orderId. This is because
the same PK column will store values of completely different types (USER#123, ORDER#456,
PRODUCT#789) — a semantic name would be misleading.

Entity prefixes serve to:

1. Distinguir tipos diferentes de entidade na mesma tabela
   USER#123 vs ORDER#123  →  evita colisões de ID entre tipos

2. Permitir Query por tipo usando begins_with
   SK begins_with "ORDER#"  →  retorna apenas pedidos do usuário

3. Documentar o intent do schema para outros desenvolvedores
   PK = "USER#abc123"  →  imediatamente legível o que esse item representa

4. Prevenir bugs de "tipo errado"
   Se GetItem retorna um item com PK="ORDER#456" mas você esperava USER,
   o prefixo deixa o erro óbvio

Prefix convention:

Entidade    PK value          SK value
─────────────────────────────────────────────────────────────
User        USER#<userId>     PROFILE
Order       USER#<userId>     ORDER#<orderId>
OrderItem   ORDER#<orderId>   ITEM#<productId>
Product     PRODUCT#<pid>     METADATA
Review      PRODUCT#<pid>     REVIEW#<userId>
Session     USER#<userId>     SESSION#<sessionId>

[FACT] The # separator is conventional but arbitrary — you can use any character that doesn't
appear in the IDs. The # is widely used because it's easy to read and doesn't cause problems
in URLs or JSON.

[FACT] When PK and SK have the same value (e.g.: PK = "USER#123", SK = "USER#123"), the item
is called the root item of the partition. This pattern is used when you want to ensure that
a "header" item always exists for each entity, even when other items in the collection (orders,
sessions) haven't been created yet.

4. Item collections — grouping and data locality

[FACT] An item collection is the set of all items that share the same partition key value.
It's the fundamental unit of "data locality" in DynamoDB — all items in a collection reside in
the same physical partition and can be retrieved with a single Query operation.

Item collection para PK = "USER#123":
  → PROFILE (dados do usuário)
  → ORDER#2024-01-15#001 (pedido de janeiro)
  → ORDER#2024-01-20#002 (pedido de janeiro)
  → ORDER#2024-03-05#003 (pedido de março)
  → SESSION#abc123 (sessão ativa)

Custo de recuperar perfil + todos os pedidos:
  SQL: JOIN entre tabela users e orders → 2 queries ou 1 com JOIN
  DynamoDB: 1 Query com PK="USER#123" → 1 operação, 0.5 RCU (se < 4KB)

[FACT] The size limit of an item collection is 10 GB per partition key value (only for tables
with LSI — Local Secondary Index). For tables without LSI, there is no size limit per partition
key.

[FACT] An individual item in DynamoDB has a limit of 400 KB. Very large attributes (e.g.:
images, PDFs, blobs) should be stored in S3, with DynamoDB storing only the reference (URL or S3
key).

5. GetItem and Query operations — implementation and semantics

[FACT] GetItem is DynamoDB's most efficient operation — O(1), always reads exactly 1 item
given PK + SK. Never does a Scan. Cost: 0.5 RCU per 4KB in eventual consistency or 1 RCU in
strong consistency.

[FACT] Query requires an exact PK and optionally a condition on the SK. It supports the
following operators on the sort key:

=           → valor exato
<, <=       → menor (ou igual) a um valor
>, >=       → maior (ou igual) a um valor
BETWEEN x AND y → valor entre x e y (inclusivo)
begins_with(SK, "prefix") → SK começa com o prefixo

[FACT] Query returns results sorted by SK in ascending order by default (lexicographic for
String, numeric for Number). For descending order, use ScanIndexForward=False.

[FACT] A single Query call returns at most 1 MB of data. If the collection has more than
1 MB, the response includes LastEvaluatedKey — a pagination token. To fetch all results, you
need to make paginated calls until LastEvaluatedKey is null.

FilterExpression vs KeyConditionExpression:

[FACT] This distinction is critical for understanding cost in DynamoDB:

KeyConditionExpression:
  → aplicado ANTES de ler os dados
  → usa PK e SK para identificar quais itens ler
  → você paga apenas pelos dados que correspondem à condição de chave

FilterExpression:
  → aplicado DEPOIS de ler os dados (no lado do DynamoDB, mas após consumir RCUs)
  → filtra atributos não-chave
  → você PAGA pelos dados lidos, mesmo os filtrados de volta
  → útil para refinar resultados, mas NÃO reduz custo de leitura

Example: Query with PK = "USER#123" returns 100 items (500 KB). If a FilterExpression filters
out 90 of them, you paid for 500 KB but received only 50 KB of results. To avoid this, the filter
should always be on the SK (via begins_with or BETWEEN).


Practical example

Scenario: blog platform with users, posts, and comments. Identified access patterns:

AP1: Buscar usuário por userId
AP2: Buscar post por postId
AP3: Listar todos os posts de um usuário (mais recente primeiro)
AP4: Listar posts de um usuário no último mês
AP5: Listar comentários de um post
AP6: Buscar comentário específico (postId + userId)
AP7: Buscar perfil + últimos 5 posts do usuário (transação)

Designed schema:

Tabela: blog-single-table
Chave primária composta: PK (String) + SK (String)

PK value             SK value                    Representa
──────────────────────────────────────────────────────────────────────
USER#u123            PROFILE                     Perfil do usuário u123
USER#u123            POST#2024-03-15T10:00:00    Post de 15/03 do u123
USER#u123            POST#2024-03-20T14:30:00    Post de 20/03 do u123
USER#u123            POST#2024-04-01T09:15:00    Post de 01/04 do u123
POST#p456            METADATA                    Metadados do post p456
POST#p456            COMMENT#u789                Comentário do user u789
POST#p456            COMMENT#u321                Comentário do user u321

How each access pattern is served:

AP1 (buscar usuário):
  GetItem(PK="USER#u123", SK="PROFILE")

AP2 (buscar post):
  GetItem(PK="POST#p456", SK="METADATA")

AP3 (listar posts de usuário, mais recente primeiro):
  Query(PK="USER#u123",
        KeyConditionExpression="PK = :pk AND begins_with(SK, :prefix)",
        ExpressionAttributeValues={":pk": "USER#u123", ":prefix": "POST#"},
        ScanIndexForward=False)   ← ordem decrescente (mais recente primeiro)

AP4 (posts do último mês — abril 2024):
  Query(PK="USER#u123",
        KeyConditionExpression="PK = :pk AND SK BETWEEN :start AND :end",
        ExpressionAttributeValues={
          ":pk": "USER#u123",
          ":start": "POST#2024-04-01",
          ":end": "POST#2024-04-30T23:59:59"
        })

AP5 (comentários de um post):
  Query(PK="POST#p456",
        KeyConditionExpression="PK = :pk AND begins_with(SK, :prefix)",
        ExpressionAttributeValues={":pk": "POST#p456", ":prefix": "COMMENT#"})

AP6 (comentário específico):
  GetItem(PK="POST#p456", SK="COMMENT#u789")

Python implementation with boto3

import boto3
from boto3.dynamodb.conditions import Key
from datetime import datetime, timedelta, timezone

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
table = dynamodb.Table("blog-single-table")


# AP1: Buscar usuário por ID (GetItem — O(1), sem Scan)
def get_user(user_id: str) -> dict | None:
    response = table.get_item(
        Key={
            "PK": f"USER#{user_id}",
            "SK": "PROFILE",
        },
        ConsistentRead=False,  # eventual consistency (0.5 RCU) — adequado para leitura de perfil
    )
    return response.get("Item")


# AP3: Listar todos os posts de um usuário, do mais recente para o mais antigo
def list_user_posts(user_id: str, limit: int = 20) -> list[dict]:
    response = table.query(
        KeyConditionExpression=Key("PK").eq(f"USER#{user_id}") & Key("SK").begins_with("POST#"),
        ScanIndexForward=False,   # ordem decrescente → mais recente primeiro
        Limit=limit,
    )
    return response.get("Items", [])


# AP4: Posts de um usuário dentro de um range de datas
def list_user_posts_in_range(user_id: str, start: datetime, end: datetime) -> list[dict]:
    # ISO 8601 com timezone UTC para ordenação lexicográfica correta
    start_sk = "POST#" + start.strftime("%Y-%m-%dT%H:%M:%S")
    end_sk = "POST#" + end.strftime("%Y-%m-%dT%H:%M:%S")

    response = table.query(
        KeyConditionExpression=(
            Key("PK").eq(f"USER#{user_id}") &
            Key("SK").between(start_sk, end_sk)
        ),
        ScanIndexForward=False,
    )
    return response.get("Items", [])


# AP5: Comentários de um post com paginação
def list_post_comments(post_id: str) -> list[dict]:
    all_items = []
    last_key = None

    while True:
        kwargs = {
            "KeyConditionExpression": (
                Key("PK").eq(f"POST#{post_id}") &
                Key("SK").begins_with("COMMENT#")
            ),
        }
        if last_key:
            kwargs["ExclusiveStartKey"] = last_key

        response = table.query(**kwargs)
        all_items.extend(response.get("Items", []))
        last_key = response.get("LastEvaluatedKey")

        if not last_key:
            break

    return all_items


# Criar um usuário (PutItem)
def create_user(user_id: str, name: str, email: str) -> None:
    table.put_item(
        Item={
            "PK": f"USER#{user_id}",
            "SK": "PROFILE",
            "entity_type": "USER",         # atributo auxiliar para clareza
            "user_id": user_id,
            "name": name,
            "email": email,
            "created_at": datetime.now(timezone.utc).isoformat(),
        },
        ConditionExpression="attribute_not_exists(PK)",  # evita sobrescrever usuário existente
    )


# Criar um post (SK inclui timestamp para ordenação cronológica)
def create_post(user_id: str, post_id: str, title: str, body: str) -> None:
    now = datetime.now(timezone.utc)
    timestamp = now.strftime("%Y-%m-%dT%H:%M:%S")

    # Armazena na partição do usuário (para AP3/AP4)
    table.put_item(
        Item={
            "PK": f"USER#{user_id}",
            "SK": f"POST#{timestamp}",
            "entity_type": "POST",
            "post_id": post_id,
            "title": title,
            "body": body,
            "created_at": now.isoformat(),
        }
    )

    # Armazena também na partição do post (para AP2 e AP5)
    table.put_item(
        Item={
            "PK": f"POST#{post_id}",
            "SK": "METADATA",
            "entity_type": "POST",
            "post_id": post_id,
            "user_id": user_id,
            "title": title,
            "created_at": now.isoformat(),
        }
    )

CLI — basic queries

# AP1: GetItem — buscar usuário
aws dynamodb get-item \
  --table-name blog-single-table \
  --key '{"PK": {"S": "USER#u123"}, "SK": {"S": "PROFILE"}}' \
  --consistent-read

# AP3: Query — posts do usuário, mais recente primeiro
aws dynamodb query \
  --table-name blog-single-table \
  --key-condition-expression "PK = :pk AND begins_with(SK, :prefix)" \
  --expression-attribute-values '{":pk":{"S":"USER#u123"}, ":prefix":{"S":"POST#"}}' \
  --scan-index-forward false \
  --limit 10

# AP4: Query — posts entre duas datas
aws dynamodb query \
  --table-name blog-single-table \
  --key-condition-expression "PK = :pk AND SK BETWEEN :start AND :end" \
  --expression-attribute-values '{
    ":pk":{"S":"USER#u123"},
    ":start":{"S":"POST#2024-04-01"},
    ":end":{"S":"POST#2024-04-30T23:59:59"}
  }'

# AP5: Query — comentários de um post
aws dynamodb query \
  --table-name blog-single-table \
  --key-condition-expression "PK = :pk AND begins_with(SK, :prefix)" \
  --expression-attribute-values '{":pk":{"S":"POST#p456"}, ":prefix":{"S":"COMMENT#"}}'

Common pitfalls

Pitfall 1: using FilterExpression as a substitute for KeyConditionExpression

One of the most frequent confusions: developers create a schema without a discriminating SK and
then use FilterExpression to filter by entity type. For example: table with PK = userId and
an attribute type = "ORDER" or type = "PROFILE", then query with
FilterExpression="type = ORDER". DynamoDB reads all items in the partition, charges for all
of them, and discards those that don't pass the filter. In a partition with 10,000 items, you pay
for 10,000 reads and receive 500. The fix is to use begins_with(SK, "ORDER#") in the
KeyConditionExpression — DynamoDB then only reads items with the correct SK, and you pay only
for what you receive.

Pitfall 2: timestamps in SK with inconsistent timezone

If part of the code stores timestamps in UTC and another part stores in local time, the
lexicographic ordering breaks: "POST#2024-01-15T10:00:00-03:00" and
"POST#2024-01-15T13:00:00Z" are the same instant, but lexicographically different, and will
appear separated in a Query with BETWEEN. The rule is: always UTC, always ISO 8601, always
with consistent precision
. Using
datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S") in Python ensures uniformity.
Alternatively, store Unix timestamps as Number — Key("SK").between(1705316400, 1705320000)
works correctly with the Number type.

Pitfall 3: item too large from storing embedded collections in the item

The 400 KB per item limit is generous for most cases, but it's easy to hit when storing growing
arrays as attributes of an item: comments embedded in a post, state history embedded in an order,
tags embedded in a product. The correct solution is to model each element as a separate item
in the table (using a discriminating SK), not as a list attribute on the parent item. Separate
items = unlimited growth, granular reads, updates without rewriting the entire item. List
attributes = convenient for small, fixed lists (e.g.: a list of 3-5 tags), problematic for lists
that grow indefinitely.


Reflection exercise

You are designing the DynamoDB schema for a customer support system. The domain entities are:
Customer, Ticket (support ticket), Message (message within a ticket), and Agent (support agent
who handles tickets).

The access patterns identified by the team are:
- Get customer data by ID
- List all open tickets for a customer
- List all messages in a ticket, in chronological order
- Get ticket by ID (for any customer)
- List tickets assigned to a specific agent today
- Get the most recent ticket for a customer (for the support dashboard)

Describe how you would organize the table: what would be the PK and SK values for each entity
type, how the choice of timestamp in the SK solves the requirement of "list in chronological
order" and "get the most recent", and which of the access patterns above cannot be served
with just PK + SK (requires a secondary index) and why. Also explain why "list tickets assigned
to an agent today" is different from the other patterns in terms of modeling.


Resources for further study

1. Data Modeling foundations in DynamoDB
URL: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/data-modeling-foundations.html
What you'll find: structured comparison between single table design and multiple table design,
with trade-offs of each approach (backup, encryption, streams, GraphQL, cost) and guidance on
when to use each one.
Why it's the right source: primary AWS documentation with the current position (2024+) on single
vs multi-table, which shifted slightly from previous years when AWS was more dogmatic about
single-table.

2. First steps for modeling relational data in DynamoDB
URL: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-modeling-nosql.html
What you'll find: the process of identifying access patterns with a concrete example of 15
patterns for an order system, and the mental framework for "query first, schema last".
Why it's the right source: the access patterns table in this document is the canonical AWS example
for the DynamoDB modeling process.

3. Best practices for using sort keys to organize data in DynamoDB
URL: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html
What you'll find: composite sort key pattern for hierarchical data
([country]#[region]#[state]#[city]), version control pattern with sort key prefixes
(v0_ for current version, v1_... for history).
Why it's the right source: primary documentation with the two most cited sort key patterns in
DynamoDB literature.

4. Key condition expressions for the Query operation in DynamoDB
URL: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.KeyConditionExpressions.html
What you'll find: complete list of valid operators on the sort key (=, <, <=, >,
>=, BETWEEN, begins_with), with CLI examples for each one.
Why it's the right source: direct reference for implementing each access pattern; avoids the
confusion between operators available in KeyCondition vs FilterExpression.

5. The DynamoDB Book — Alex DeBrie
URL: https://www.dynamodbbook.com
What you'll find: the most comprehensive technical book on DynamoDB modeling, covering all design
patterns (adjacency list, GSI overloading, sparse indexes, write sharding) with complete examples
from real applications.
Why it's the right source: [OPINION] Alex DeBrie is considered by the community as the most
accessible and practical reference on DynamoDB — more didactic than the official documentation,
and regularly updated. It's not free, but it's widely cited in AWS re:Invent talks.