Session 041 — FinOps: Spot Instances, EC2 Fleet and Interruption Handling

June 10, 2026

Prerequisite: session-040 (Savings Plans and Reserved Instances)

Session Objectives

Understand the pricing model and risk of Spot Instances
Distinguish modern APIs (CreateFleet / Auto Scaling) from legacy ones (RequestSpotFleet — do not use)
Master allocation strategies: price-capacity-optimized, capacity-optimized, diversified
Implement complete interruption handling: rebalance recommendation + 2-min notice via IMDS and EventBridge
Apply instance type diversification and attribute-based selection
Use Spot Placement Scores for region/AZ selection

1. Economic Model and Risk

[FACT] Spot Instances offer discounts of up to 90% compared to On-Demand pricing, providing access to AWS's idle EC2 capacity. AWS can interrupt a Spot Instance with a 2-minute notice when it needs the capacity back.

[FACT] A Spot capacity pool is the set of idle instances with the same instance type and the same Availability Zone. The Spot price is defined per pool and varies with supply and demand.

Razões de interrupção (docs AWS):
┌─────────────────────────────────────────────────────────────────┐
│  CAPACITY  — AWS precisa da capacidade de volta (causa principal)│
│  PRICE     — Spot price subiu acima do seu maxPrice             │
│  CONSTRAINT— launch group / AZ group não pode mais ser satisfeito│
└─────────────────────────────────────────────────────────────────┘

Comportamento na interrupção (configurável):
  terminate  — padrão; instância encerra
  stop       — EBS preservado; pode reiniciar quando capacidade volta
  hibernate  — RAM salva em EBS; WITHOUT 2-min warning (hibernação imediata)

[FACT] Workloads suitable for Spot: stateless, fault-tolerant, flexible — big data, containerized, CI/CD, stateless web servers, HPC, rendering. Workloads not suitable: inflexible, stateful, tightly-coupled between nodes, intolerant of any period of partial capacity.

[CONSENSUS] Attempting to failover from Spot to On-Demand in response to interruptions can inadvertently cause more interruptions on other Spot Instances. AWS explicitly discourages this pattern.

2. APIs: What to Use and What to Avoid

[FACT] The official documentation (updated 2026) explicitly classifies Spot APIs as:

╔══════════════════════════════════╦══════════════════════════════════════════╗
║ API                              ║ Recomendação                             ║
╠══════════════════════════════════╬══════════════════════════════════════════╣
║ CreateAutoScalingGroup           ║ ✅ SIM — lifecycle gerenciado, scaling    ║
║ CreateFleet (instant mode)       ║ ✅ SIM — sem auto scaling necessário      ║
║ RunInstances                     ║ ⚠️  LIMITED — apenas 1 tipo de instância  ║
║ RequestSpotFleet                 ║ ❌ NÃO — API legado, sem investimento     ║
║ RequestSpotInstances             ║ ❌ NÃO — API legado, sem investimento     ║
╚══════════════════════════════════╩══════════════════════════════════════════╝

[FACT] RequestSpotFleet and RequestSpotInstances are explicitly marked as legacy APIs with "no planned investment" in AWS documentation. New workloads must use Auto Scaling Groups or EC2 Fleet.

3. Allocation Strategies

[FACT] When using multiple capacity pools (EC2 Fleet or Auto Scaling), the allocation strategy determines from which pools instances will be launched.

3.1 price-capacity-optimized (recommended)

[FACT] Identified as "best choice for most Spot workloads" by AWS documentation. The fleet identifies pools with the highest available capacity and, among those, selects the ones with the lowest price. Result: lower interruption rate + good cost.

Seleção: pools com mais capacidade disponível → menor preço entre esses
Uso:     stateless containers, microservices, web apps, data/analytics, batch
CLI:     --spot-allocation-strategy price-capacity-optimized (EC2 Fleet)
         --spot-allocation-strategy priceCapacityOptimized  (Spot Fleet legado)

3.2 capacity-optimized

[FACT] Focuses exclusively on maximum capacity availability, without considering price. Useful when the cost of reprocessing after an interruption is very high (long CI, rendering, HPC with hours of computation). Accepts the variant capacity-optimized-prioritized for ordering instance types.

3.3 diversified

[FACT] Distributes instances across all configured pools equally. If 10 pools are configured and target=100, it launches 10 instances in each. Protects against mass interruption of a single pool (only 10% affected).

Quando usar: fleets grandes ou que rodam por muito tempo
Limitação:   não lança em pools onde Spot price ≥ On-Demand price

3.4 lowest-price (NOT recommended)

[FACT] Highest interruption risk — only considers price, ignores capacity availability. Pools with higher demand (cheaper) tend to have higher interruption rates. AWS explicitly documents: "We don't recommend the lowest-price allocation strategy."

4. Instance Type Diversification

[CONSENSUS] AWS's golden rule is to be flexible across at least 10 instance types per workload, in addition to enabling all available AZs in the VPC.

Estratégia de diversificação para workload de processamento:
  Família c (compute): c5.xlarge, c5a.xlarge, c5n.xlarge, c4.xlarge
  Família m (general): m5.xlarge, m5a.xlarge, m5n.xlarge, m4.xlarge
  Família r (memory):  r5.large, r5a.large  (se workload aceitar)

Princípio: se flexível verticalmente → incluir instâncias maiores (mais vCPUs)
           se apenas escala horizontal → incluir gerações antigas (menor demanda OD)

4.1 Attribute-based Instance Type Selection (ABITS)

[FACT] Instead of listing specific types, you specify attributes (min/max vCPUs, memory, architecture, etc.) and AWS automatically selects all compatible types. Ensures use of new instance types as they are launched.

# CDK Python — attribute-based via CfnAutoScalingGroup
# (ver seção 7 para exemplo completo)
instance_requirements = autoscaling.CfnAutoScalingGroup.InstanceRequirementsProperty(
    v_cpu_count=autoscaling.CfnAutoScalingGroup.VCpuCountRequestProperty(min=2, max=8),
    memory_mi_b=autoscaling.CfnAutoScalingGroup.MemoryMiBRequestProperty(min=4096, max=16384),
    cpu_manufacturers=["intel", "amd"],
    instance_generations=["current"],
)

5. Spot Placement Scores

[FACT] The Spot Placement Score (1–10) indicates the likelihood of successfully provisioning the requested Spot capacity in a specific region or AZ. Score 10 = highly likely to succeed. It is a point-in-time recommendation — it does not guarantee capacity nor predict future interruption rates.

# Obter placement score para 100 vCPUs em múltiplas regiões
aws ec2 get-spot-placement-scores \
  --target-capacity 100 \
  --target-capacity-unit-type vcpu \
  --single-availability-zone-flag false \
  --instance-requirements-with-metadata '{
    "ArchitectureTypes": ["x86_64"],
    "VirtualizationTypes": ["hvm"],
    "InstanceRequirements": {
      "VCpuCount": {"Min": 2, "Max": 4},
      "MemoryMiB": {"Min": 4096}
    }
  }'
# Retorna: RegionName, Score (1-10), AvailabilityZoneId (se solicitado)

6. Interruption Handling — Two Signals

6.1 Interruption Temporal Architecture

Linha do tempo de uma interrupção:

 T-?min          T-2min                    T
  │               │                         │
  ▼               ▼                         ▼
 [Rebalance      [Interruption Notice    [Instância
  Recommendation] aparece no IMDS e       interrompida]
  EventBridge]    EventBridge emite]

  ← best-effort → ← garantido* 2 min →

* exceto hibernation: começa imediatamente, sem 2 min de aviso

[FACT] The Rebalance Recommendation can arrive before the 2-minute notice, giving more time for proactive action. However, it is emitted on a best-effort basis — it may arrive at the same time as the 2-minute notice, or not arrive at all.

[FACT] The Interruption Notice (2-minute warning) is guaranteed for terminate and stop actions. For hibernate, hibernation starts immediately without a 2-minute advance warning.

6.2 Monitoring via IMDS (Instance Metadata Service)

[FACT] Both signals are available as items in IMDS v2. AWS recommends polling every 5 seconds.

Endpoint do Rebalance Recommendation:
  GET http://169.254.169.254/latest/meta-data/events/recommendations/rebalance
  Resposta quando presente: {"noticeTime": "2024-01-15T14:22:00Z"}
  Resposta quando ausente:  HTTP 404

Endpoint do Interruption Notice:
  GET http://169.254.169.254/latest/meta-data/spot/instance-action
  Resposta quando presente: {"action": "terminate", "time": "2024-01-15T14:25:00Z"}
                            {"action": "stop",      "time": "2024-01-15T14:25:00Z"}
  Resposta quando ausente:  HTTP 404

6.3 Monitoring via EventBridge

[FACT] Both signals are emitted as events in EventBridge in the instance's account/region. detail-type allows filtering:

// Evento: Rebalance Recommendation
{
  "detail-type": "EC2 Instance Rebalance Recommendation",
  "source": "aws.ec2",
  "detail": { "instance-id": "i-0abcdef1234567890" }
}

// Evento: Interruption Warning (2 min)
{
  "detail-type": "EC2 Spot Instance Interruption Warning",
  "source": "aws.ec2",
  "detail": {
    "instance-id": "i-0abcdef1234567890",
    "instance-action": "terminate"   // ou "stop" ou "hibernate"
  }
}

6.4 Capacity Rebalancing (Auto Scaling / EC2 Fleet)

[FACT] Auto Scaling Groups and EC2 Fleet have a native Capacity Rebalancing feature: when they receive the Rebalance Recommendation signal, they automatically launch replacement instances before terminating the at-risk instances. This maintains the target capacity during transitions.

7. CDK Python — Auto Scaling Group with Spot (Mixed Instances)

from aws_cdk import (
    Stack, Duration, aws_ec2 as ec2,
    aws_autoscaling as autoscaling,
    aws_iam as iam, aws_sns as sns,
    aws_events as events, aws_events_targets as targets,
    aws_lambda as _lambda,
)
from constructs import Construct

class SpotFleetStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        vpc = ec2.Vpc(self, "VPC",
            max_azs=3,  # todos os AZs para diversificação
            nat_gateways=1,
        )

        # Security Group para o processador
        sg = ec2.SecurityGroup(self, "WorkerSG", vpc=vpc, allow_all_outbound=True)

        # IAM Role para as instâncias
        role = iam.Role(self, "WorkerRole",
            assumed_by=iam.ServicePrincipal("ec2.amazonaws.com"),
            managed_policies=[
                iam.ManagedPolicy.from_aws_managed_policy_name(
                    "AmazonSSMManagedInstanceCore"
                ),
            ],
        )

        # Launch Template com user data para graceful shutdown
        user_data = ec2.UserData.for_linux()
        user_data.add_commands(
            "#!/bin/bash",
            "yum install -y aws-cli jq",
            # Instala e inicia o daemon de interruption handling
            "cat > /usr/local/bin/spot-handler.sh << 'EOF'",
            "#!/bin/bash",
            "while true; do",
            "  TOKEN=$(curl -s -X PUT http://169.254.169.254/latest/api/token \\",
            "    -H 'X-aws-ec2-metadata-token-ttl-seconds: 21600')",
            "  ACTION=$(curl -s -H \"X-aws-ec2-metadata-token: $TOKEN\" \\",
            "    http://169.254.169.254/latest/meta-data/spot/instance-action 2>/dev/null)",
            "  if echo \"$ACTION\" | grep -q 'terminate\\|stop'; then",
            "    echo 'INTERRUPTION NOTICE: '$ACTION",
            "    systemctl stop worker.service",
            "    aws sqs send-message --queue-url $DRAIN_QUEUE_URL \\",
            "      --message-body \"{\\\"instance_id\\\":\\\"$(curl -s -H \\\"X-aws-ec2-metadata-token: $TOKEN\\\" \\",
            "      http://169.254.169.254/latest/meta-data/instance-id)\\\"}\"",
            "    break",
            "  fi",
            "  REBALANCE=$(curl -s -H \"X-aws-ec2-metadata-token: $TOKEN\" \\",
            "    http://169.254.169.254/latest/meta-data/events/recommendations/rebalance 2>/dev/null)",
            "  if echo \"$REBALANCE\" | grep -q 'noticeTime'; then",
            "    echo 'REBALANCE RECOMMENDATION: '$REBALANCE",
            "    # Para novos jobs mas não encerra os existentes",
            "    touch /var/run/spot-draining",
            "  fi",
            "  sleep 5",
            "done",
            "EOF",
            "chmod +x /usr/local/bin/spot-handler.sh",
            "nohup /usr/local/bin/spot-handler.sh &",
        )

        launch_template = ec2.LaunchTemplate(self, "WorkerLT",
            machine_image=ec2.MachineImage.latest_amazon_linux2(),
            role=role,
            security_group=sg,
            user_data=user_data,
            # NÃO definir instance_type aqui quando usar mixed instances policy
        )

        # Auto Scaling Group com política de instâncias mistas
        # Usa CfnAutoScalingGroup (L1) para acesso à MixedInstancesPolicy completa
        asg = autoscaling.CfnAutoScalingGroup(self, "WorkerASG",
            min_size="2",
            max_size="20",
            desired_capacity="6",
            vpc_zone_identifier=vpc.select_subnets(
                subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS
            ).subnet_ids,
            # Capacity Rebalancing — lança substituta antes de encerrar instância em risco
            capacity_rebalance=True,
            mixed_instances_policy=autoscaling.CfnAutoScalingGroup.MixedInstancesPolicyProperty(
                launch_template=autoscaling.CfnAutoScalingGroup.LaunchTemplateProperty(
                    launch_template_specification=autoscaling.CfnAutoScalingGroup.LaunchTemplateSpecificationProperty(
                        launch_template_id=launch_template.launch_template_id,
                        version=launch_template.latest_version_number,
                    ),
                    overrides=[
                        # Diversificação: ≥ 10 tipos de instância
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="c5.xlarge"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="c5a.xlarge"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="c5n.xlarge"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="c4.xlarge"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="m5.xlarge"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="m5a.xlarge"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="m5n.xlarge"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="m4.xlarge"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="r5.large"),
                        autoscaling.CfnAutoScalingGroup.LaunchTemplateOverridesProperty(
                            instance_type="r5a.large"),
                    ],
                ),
                instances_distribution=autoscaling.CfnAutoScalingGroup.InstancesDistributionProperty(
                    # 0% On-Demand base, 100% Spot
                    on_demand_base_capacity=0,
                    on_demand_percentage_above_base_capacity=0,
                    spot_allocation_strategy="price-capacity-optimized",  # RECOMENDADA
                ),
            ),
        )

        # --- EventBridge: Lambda para Interruption Warning ---
        interrupt_fn = _lambda.Function(self, "SpotInterruptHandler",
            runtime=_lambda.Runtime.PYTHON_3_12,
            handler="index.handler",
            code=_lambda.Code.from_inline("""
import boto3, json, os

ec2 = boto3.client('ec2')
asg = boto3.client('autoscaling')

def handler(event, context):
    instance_id = event['detail']['instance-id']
    action = event['detail']['instance-action']
    print(f"INTERRUPTION WARNING: {instance_id} will be {action}")

    # Desanexar do ASG graciosamente (permite que o ASG repõe a instância)
    try:
        response = asg.describe_auto_scaling_instances(InstanceIds=[instance_id])
        if response['AutoScalingInstances']:
            asg_name = response['AutoScalingInstances'][0]['AutoScalingGroupName']
            asg.detach_instances(
                InstanceIds=[instance_id],
                AutoScalingGroupName=asg_name,
                ShouldDecrementDesiredCapacity=False,  # ASG provisiona substituta
            )
            print(f"Detached {instance_id} from ASG {asg_name}")
    except Exception as e:
        print(f"ASG detach failed (may be OK): {e}")

    return {"statusCode": 200, "instanceId": instance_id, "action": action}
"""),
        )
        interrupt_fn.add_to_role_policy(iam.PolicyStatement(
            actions=["autoscaling:DescribeAutoScalingInstances",
                     "autoscaling:DetachInstances",
                     "ec2:DescribeInstances"],
            resources=["*"],
        ))

        # EventBridge rule: Interruption Warning → Lambda
        events.Rule(self, "SpotInterruptRule",
            event_pattern=events.EventPattern(
                source=["aws.ec2"],
                detail_type=["EC2 Spot Instance Interruption Warning"],
            ),
            targets=[targets.LambdaFunction(interrupt_fn)],
        )

        # EventBridge rule: Rebalance Recommendation → Lambda (mesma função ou outra)
        events.Rule(self, "SpotRebalanceRule",
            event_pattern=events.EventPattern(
                source=["aws.ec2"],
                detail_type=["EC2 Instance Rebalance Recommendation"],
            ),
            targets=[targets.LambdaFunction(interrupt_fn)],
        )

8. Python — Spot Price History and Savings Analysis

import boto3
from datetime import datetime, timedelta, timezone
from statistics import mean, stdev
from dataclasses import dataclass
from typing import Optional

ec2 = boto3.client("ec2", region_name="us-east-1")

@dataclass
class SpotPoolAnalysis:
    instance_type: str
    az: str
    current_price: float
    avg_price_7d: float
    price_stdev: float
    on_demand_price: float
    discount_pct: float
    volatility_coefficient: float  # stdev / mean — quanto varia relativamente


def analyze_spot_pools(
    instance_types: list[str],
    on_demand_prices: dict[str, float],  # {instance_type: price}
    lookback_days: int = 7,
) -> list[SpotPoolAnalysis]:
    """
    Analisa histórico de preço Spot para encontrar pools mais estáveis e baratos.
    Nota: preço baixo + baixa volatilidade = pool com boa disponibilidade consistente.
    """
    end = datetime.now(timezone.utc)
    start = end - timedelta(days=lookback_days)

    results = []
    for instance_type in instance_types:
        paginator = ec2.get_paginator("describe_spot_price_history")
        pages = paginator.paginate(
            InstanceTypes=[instance_type],
            ProductDescriptions=["Linux/UNIX"],
            StartTime=start,
            EndTime=end,
        )

        # Agrupa por AZ
        prices_by_az: dict[str, list[float]] = {}
        for page in pages:
            for entry in page["SpotPriceHistory"]:
                az = entry["AvailabilityZone"]
                price = float(entry["SpotPrice"])
                prices_by_az.setdefault(az, []).append(price)

        od_price = on_demand_prices.get(instance_type, 0.0)

        for az, prices in prices_by_az.items():
            if len(prices) < 2:
                continue
            avg = mean(prices)
            sd = stdev(prices)
            cv = sd / avg if avg > 0 else 0  # coefficient of variation

            results.append(SpotPoolAnalysis(
                instance_type=instance_type,
                az=az,
                current_price=prices[0],  # mais recente
                avg_price_7d=avg,
                price_stdev=sd,
                on_demand_price=od_price,
                discount_pct=((od_price - avg) / od_price * 100) if od_price > 0 else 0,
                volatility_coefficient=cv,
            ))

    # Ordena: menor volatilidade primeiro, depois menor preço
    return sorted(results, key=lambda r: (r.volatility_coefficient, r.avg_price_7d))


def print_pool_ranking(pools: list[SpotPoolAnalysis], top_n: int = 5):
    print(f"\n{'Tipo':15} {'AZ':20} {'Atual':8} {'Média7d':8} {'Desconto':9} {'Volatilidade':13}")
    print("-" * 75)
    for p in pools[:top_n]:
        print(
            f"{p.instance_type:15} {p.az:20} "
            f"${p.current_price:.4f} ${p.avg_price_7d:.4f} "
            f"{p.discount_pct:6.1f}%    CV={p.volatility_coefficient:.3f}"
        )


# Exemplo de uso
if __name__ == "__main__":
    instance_types = [
        "c5.xlarge", "c5a.xlarge", "c5n.xlarge",
        "m5.xlarge", "m5a.xlarge",
    ]
    # Preços On-Demand us-east-1 (referência — verificar em ec2.aws.amazon.com/pricing)
    od_prices = {
        "c5.xlarge":  0.170,
        "c5a.xlarge": 0.154,
        "c5n.xlarge": 0.216,
        "m5.xlarge":  0.192,
        "m5a.xlarge": 0.172,
    }

    pools = analyze_spot_pools(instance_types, od_prices, lookback_days=7)
    print_pool_ranking(pools, top_n=10)

    # Identifica pools com desconto > 50% e volatilidade baixa (CV < 0.10)
    prime_pools = [p for p in pools if p.discount_pct > 50 and p.volatility_coefficient < 0.10]
    print(f"\nPools 'prime' (>50% desconto, CV<0.10): {len(prime_pools)}")
    for p in prime_pools:
        print(f"  → {p.instance_type} @ {p.az}: ${p.avg_price_7d:.4f}/h ({p.discount_pct:.1f}% off)")

9. CLI — Essential Examples

# 1. Histórico de preço Spot — últimas 24h para tipos específicos
aws ec2 describe-spot-price-history \
  --instance-types c5.xlarge m5.xlarge \
  --product-descriptions "Linux/UNIX" \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --query 'SpotPriceHistory[*].{Type:InstanceType,AZ:AvailabilityZone,Price:SpotPrice,Time:Timestamp}' \
  --output table

# 2. Spot Placement Score — identifica melhor região para 100 vCPUs
aws ec2 get-spot-placement-scores \
  --target-capacity 100 \
  --target-capacity-unit-type vcpu \
  --single-availability-zone-flag false \
  --instance-requirements-with-metadata '{
    "ArchitectureTypes": ["x86_64"],
    "VirtualizationTypes": ["hvm"],
    "InstanceRequirements": {
      "VCpuCount": {"Min": 2, "Max": 8},
      "MemoryMiB": {"Min": 4096, "Max": 32768},
      "InstanceGenerations": ["current"]
    }
  }' \
  --query 'SpotPlacementScores | sort_by(@, &Score) | reverse(@)' \
  --output table

# 3. Criar EC2 Fleet (instant mode) com price-capacity-optimized
aws ec2 create-fleet \
  --type instant \
  --target-capacity-specification TotalTargetCapacity=10,DefaultTargetCapacityType=spot \
  --spot-options AllocationStrategy=price-capacity-optimized,MaxTotalPrice=1.50 \
  --launch-template-configs '[
    {
      "LaunchTemplateSpecification": {
        "LaunchTemplateId": "lt-0abcdef1234567890",
        "Version": "$Latest"
      },
      "Overrides": [
        {"InstanceType": "c5.xlarge",  "SubnetId": "subnet-aaa"},
        {"InstanceType": "c5a.xlarge", "SubnetId": "subnet-aaa"},
        {"InstanceType": "m5.xlarge",  "SubnetId": "subnet-bbb"},
        {"InstanceType": "m5a.xlarge", "SubnetId": "subnet-bbb"},
        {"InstanceType": "c5.xlarge",  "SubnetId": "subnet-ccc"},
        {"InstanceType": "m5.xlarge",  "SubnetId": "subnet-ccc"}
      ]
    }
  ]' \
  --query 'FleetId'

# 4. Descrever frota e suas instâncias
aws ec2 describe-fleets \
  --fleet-ids fleet-0abc1234def56789 \
  --query 'Fleets[0].{State:FleetState,Capacity:TargetCapacitySpecification}'

aws ec2 describe-fleet-instances \
  --fleet-id fleet-0abc1234def56789

# 5. ASG — verificar instâncias Spot vs On-Demand na frota mista
aws autoscaling describe-auto-scaling-instances \
  --query 'AutoScalingInstances[?AutoScalingGroupName==`WorkerASG`].[InstanceId,InstanceType,LifecycleState,HealthStatus]' \
  --output table

# 6. Cancelar EC2 Fleet e suas instâncias
aws ec2 delete-fleets \
  --fleet-ids fleet-0abc1234def56789 \
  --terminate-instances

# 7. Simular interrupção com AWS FIS (Fault Injection Service)
#    — útil para testar graceful shutdown em ambientes de staging
aws fis create-experiment-template \
  --description "Test Spot interruption handling" \
  --actions '{"InterruptSpot": {
    "actionId": "aws:ec2:send-spot-instance-interruptions",
    "parameters": {"durationBeforeInterruption": "PT2M"},
    "targets": {"SpotInstances": "targetInstances"}
  }}' \
  --targets '{"targetInstances": {
    "resourceType": "aws:ec2:spot-instance",
    "resourceArns": ["arn:aws:ec2:us-east-1:123456789012:instance/i-0abc123"],
    "selectionMode": "ALL"
  }}' \
  --role-arn arn:aws:iam::123456789012:role/FISRole \
  --stop-conditions '[{"source":"none"}]'

10. Billing for Interrupted Instances

[FACT] According to official AWS documentation:

╔═══════════════════════════════════════════════════════════════════╗
║ Quem interrompeu       │ Cobrança da hora parcial                 ║
╠═══════════════════════════════════════════════════════════════════╣
║ AWS (interrupção Spot) │ Primeira hora: NÃO cobrada               ║
║                        │ Horas subsequentes parciais: COBRADAS    ║
╠═══════════════════════════════════════════════════════════════════╣
║ Você (terminate manual)│ Hora parcial: COBRADA normalmente        ║
╚═══════════════════════════════════════════════════════════════════╝

Exemplo: instância roda 2h 23min, AWS termina → paga 2h (inteiras)
         instância roda 0h 47min, AWS termina → paga R$0 (primeira hora)
         instância roda 1h 23min, você termina → paga 1h 23min (proporcional)

[FACT] For Spot Instances with stop and hibernate behavior: the instance stops accruing compute charges when stopped, but EBS and Elastic IPs continue to be charged.

11. Diagram: Interruption Handling Pipeline

Fluxo completo de interrupção em produção:

 AWS decide         IMDS/EventBridge        Aplicação
 interromper
     │
     ├──[early]──► EC2 Instance        ──► ASG lança instância
     │             Rebalance Rec.           substituta proativamente
     │             (best-effort)
     │
     ├──[T-2min]─► EC2 Spot Instance   ──► Lambda EventBridge Handler:
     │             Interruption Warning      • Detach do ASG
     │             (garantido*)              • Notifica SQS/SNS
     │                                       • Log de contexto
     │
     ├──[T-2min]─► IMDS polls (5s)     ──► spot-handler.sh:
     │             /spot/instance-action     • systemctl stop worker
     │                                       • flush pending work
     │                                       • upload logs to S3
     │
     └──[T=0]────► Instância termina   ──► ASG mantém desired capacity
                   (ou para/hiberna)        com nova instância saudável

* Exceto hibernate: sem aviso de 2min (hiberna imediatamente)

12. Pitfalls

[FACT] RequestSpotFleet is legacy: new projects must use CreateFleet or Auto Scaling Groups. AWS documentation marks RequestSpotFleet as "no planned investment" — there is no guarantee of future support.

[FACT] lowest-price has the highest interruption risk: the strategy focuses exclusively on price, without considering availability. Cheap pools tend to have high On-Demand demand → higher interruptions. Use price-capacity-optimized.

[FACT] Misconfigured maxPrice increases interruptions: specifying maxPrice causes more interruptions than not specifying it. If the Spot price exceeds your limit, the instance is interrupted. Most workloads should not set maxPrice.

[FACT] Hibernate has no 2-min warning: hibernation starts immediately upon receiving the signal — there is no 2-minute window for graceful shutdown. Use terminate or stop if you need time for cleanup.

[CONSENSUS] Spot Instances are not suitable for primary databases: stateful, tightly coupled, and intolerant of interruption. Use On-Demand or Reserved Instances for databases.

[FACT] Capacity Rebalancing can cause temporary scale-out: when launching a replacement before terminating the at-risk instance, the ASG temporarily goes above the desired capacity. This is expected and is not a bug.

[CONSENSUS] Spot + Savings Plans do not stack: Savings Plans cover On-Demand and Spot usage for Fargate/Lambda, but do not cover Spot EC2. For Spot EC2, the discount is already built into the Spot price.

[UNCERTAIN] Interruption rate per pool: the Spot Instance Advisor (console) shows historical interruption frequencies categorized (<5%, 5-10%, 10-15%, >20%). These categories are indicative but do not guarantee future behavior.

13. When NOT to Use Spot Instances

[FACT] AWS documentation is explicit — Spot Instances are not suitable for:

NÃO usar Spot para:
  ✗ Workloads inflexíveis (exigem tipo exato de instância)
  ✗ Workloads stateful sem checkpointing
  ✗ Aplicações tightly-coupled entre nós (HPC MPI sem checkpoint)
  ✗ Banco de dados primário (stateful, intolerante)
  ✗ Workloads que não toleram qualquer período sem capacidade total
  ✗ Aplicações que dependem de failover para On-Demand*

USAR Spot para:
  ✓ Big data / ETL batch
  ✓ Containers stateless (ECS, EKS)
  ✓ CI/CD runners
  ✓ Web servers stateless (com ALB)
  ✓ HPC com checkpointing
  ✓ Rendering / encoding
  ✓ ML training (com checkpointing em S3)

* Failover Spot→OD pode causar mais interrupções em outras Spot

14. Visual Summary

                    SPOT INSTANCES — MAPA CONCEITUAL

  Preço                                   Interrupção
  ─────                                   ───────────
  Até 90% desconto                        AWS avisa ~2 min antes
  Varia por pool (tipo + AZ)              Rebalance Rec: aviso antecipado
  Histórico: describe-spot-price-history  Polling IMDS a cada 5s
  Placement Score: 1-10 por região/AZ     EventBridge: 2 event types
                  │                                     │
                  ▼                                     ▼
         Diversificação                       Handling Actions
         ─────────────                       ────────────────
         ≥ 10 instance types                 Graceful shutdown
         Todos os AZs habilitados            Checkpoint em S3/DynamoDB
         ABITS: atributos, não tipos fixos   Detach do ASG
                  │                          Capacity Rebalancing
                  ▼
         Allocation Strategy
         ───────────────────
         price-capacity-optimized ← RECOMENDADA
         capacity-optimized       ← alto custo de interrupção
         diversified              ← fleets grandes/longas
         lowest-price             ← NÃO USAR
                  │
                  ▼
         APIs Modernos
         ─────────────
         CreateAutoScalingGroup   ← com lifecycle
         CreateFleet (instant)    ← sem auto scaling
         RequestSpotFleet         ← LEGADO, não usar
         RequestSpotInstances     ← LEGADO, não usar

Reflection Exercise

An ML team uses an Auto Scaling Group with lowest-price policy and 3 instance types (p3.2xlarge only). Training takes 6 hours and has no checkpointing. The interruption rate is high (>20% of jobs are interrupted before completion), causing high reprocessing costs.

Identify all design problems and propose a complete architectural solution, considering:

Which allocation strategy should be used and why?
How should instance type diversification be done for GPUs?
What should change in the training job design to tolerate interruptions?
How does Capacity Rebalancing help (or not) in this specific scenario?
Is there any situation where Spot would be unsuitable even with all the above improvements?

References

[FACT] Best practices for Amazon EC2 Spot — docs.aws.amazon.com (updated 2026)
[FACT] Spot Instance interruption notices — docs.aws.amazon.com
[FACT] EC2 instance rebalance recommendations — docs.aws.amazon.com
[FACT] Allocation strategies for EC2 Fleet or Spot Fleet — docs.aws.amazon.com
[FACT] Prepare for Spot Instance interruptions — docs.aws.amazon.com
[FACT] Spot placement score — docs.aws.amazon.com