SRErunbookcloud

Cloud Provider Outage Playbook: Steps for Engineering Teams When AWS or Cloudflare Go Down

UUnknown

2026-01-28

10 min read

A concise, battle-tested runbook for SREs to mitigate CDN or cloud provider outages—DNS, failover, traffic rerouting, comms, and postmortems.

When AWS or Cloudflare Go Down: A Practical Runbook for SREs and DevOps

Hook: Outages strike without warning — and when they hit a cloud provider or CDN, engineering teams face pressure to restore service, protect SLAs, and keep stakeholders calm. This playbook gives SREs and DevOps a concise, battle-tested sequence of actions to mitigate, fail over, and communicate during an AWS or Cloudflare disruption in 2026.

Why this matters now (2026 context)

Late 2025 and early 2026 saw accelerated adoption of multi-cloud orchestration, edge compute, and multi-CDN tooling. Simultaneously, customers demand stronger SLAs and real-time transparency. That technical and business landscape raises the bar: teams must expect provider outages as operational realities and have automated, validated runbooks that reduce human error under duress.

Principle: Speed and clarity beat perfect fixes while an incident is live. Aim to contain and route traffic first, then restore full functionality and investigate.

The Inverted Pyramid: What to do first (first 0–30 minutes)

Start here. These are the non-negotiable steps that protect customers and buy you time to plan deeper remediation.

1. Confirm the scope and impact

Check internal alerts and noisy dashboards (SLO/SLA dashboards, latency/error rates, synthetic checks).
Correlate with external monitoring (DownDetector, provider status pages) and your own real-user monitoring (RUM).
Identify affected surfaces: control plane (API consoles), data plane (routing, CDN cache), or both.

2. Open a tactical incident channel

Create a single source of truth: incident bridge (Zoom/Jitsi), Slack channel, or Opsgenie incident room. Assign an Incident Commander (IC).
Record the timeline and the names/roles of participants (IC, comms lead, network lead, DNS lead, engineering lead).

3. Set and communicate initial status

Publish a brief initial message within 10–15 minutes to internal stakeholders and your external status page: what’s affected, user impact, ETA for next update.
Use templated messages (see templates below) to avoid delay and ensure consistent language about SLAs and mitigation steps.

Mitigation & Failover Playbook (30–120 minutes)

These are tactical steps for traffic rerouting, DNS mitigation, and rapid degradation strategies. Follow them in priority order and document every change.

4. DNS mitigation checklist

DNS is your fastest lever to move traffic. But changes propagate according to TTL; be deliberate.

Lower TTL proactively: If you maintain short TTLs (30–60s) in production, you can redirect quickly. If TTLs are long, communicate that DNS-based failover may be slow and plan alternative actions.
Switch to secondary authoritative DNS: If your primary DNS (or provider) is affected, promote your registered secondary DNS provider. Ensure zone files are synchronized via automation ( Terraform, pulumi, or zonefile git ops).
Use failover records: For AWS Route 53, use failover routing policies and health checks. For Cloudflare, use Load Balancing with active-passive pools or API-driven pool switching.
Consider DNS-based traffic shaping: For multi-region deployments, weighted records can shift traffic gradually to healthy regions to avoid spikes.

5. Traffic rerouting & multi-cloud failover

Activate secondary provider or region: If Cloudflare’s CDN is down, enable an alternate CDN or origin-serving strategy (S3/MinIO static fallback or direct origin URLs). If AWS is down, shift to another cloud region or cloud vendor for critical services.
Use BGP or CDN-level failover where possible: For self-hosted edge or colo, prepare BGP announcements with RPKI-validated prefixes and pre-approved routing policies. Use network automation to announce/withdraw prefixes safely.
Service mesh / ingress-level failover: For microservices, leverage your service mesh (Istio, Linkerd) or ingress controller to route around failing regions or services. Use circuit breakers and bulkheads to limit blast radius.
Rate-limit and degrade gracefully: If full failover isn’t possible, apply strict rate limits and feature gating to prioritize core flows (logins, payments, API health checks) and disable non-essential features.

6. CDN-specific actions (Cloudflare / other CDNs)

Inspect CDN control plane status: If the CDN control plane is partially available, use the control APIs to reroute pools, disable problematic rules, or purge caches selectively.
Origin fallback: If edge caching can’t serve content, switch to origin-first policies or enable a direct-to-origin bypass while protecting origins with signed URLs and WAF rules.
Mitigate WAF or edge rule failures: Disable or roll back recently deployed edge rules that may be causing a spike in errors.

7. AWS-specific actions

Route 53 failover: Use Route 53’s health checks and failover routing to shift traffic to secondary regions/records; confirm TTLs are low before switching.
Use S3 static hosting or pre-warmed instances: If dynamic services are down, expose static UI assets hosted in S3 or pre-warmed EC2 instances in a healthy region to preserve read-only access.
RDS read-replicas and global databases: Promote read replicas only if fully tested; ensure replication lag is acceptable and failover doesn’t violate transactional consistency SLAs.

Command & API Examples (actionable snippets)

Use these as templates; validate in a staging account before relying on them during an incident.

Route 53: change record set to point to secondary IP

<code>aws route53 change-resource-record-sets --hosted-zone-id Z12345 --change-batch '{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "www.example.com",
      "Type": "A",
      "TTL": 60,
      "ResourceRecords": [{"Value": "203.0.113.10"}]
    }
  }]
}'</code>

Cloudflare: switch load balancer pool (API)

<code>curl -X PATCH "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers/{lb_id}/pools/{pool_id}" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"enabled": false}'
</code>

Terraform fail-safe tip

Maintain a minimal, pre-approved Terraform plan that can be applied by a junior on-call: route53 + DNS delegations + a static origin set. Keep it in a separate repo with CI disabled for emergency manual apply.

Communication Playbook: Internal & External (first 2 hours)

Clarity and cadence matter as much as technical fixes. Use templates and update frequently.

Initial external status (under 15 minutes)

Template:

We are aware of increased errors/latency affecting [service/region]. Our engineering team is actively investigating. We will provide an update within 30 minutes. Impact: [login / API / CDN assets].

Follow-up update (30–60 minutes)

Template:

Update: We have isolated the issue to [provider / component]. Temporary mitigations are in place (DNS change / traffic reroute / feature gating). Expected recovery: [ETA or "ongoing"]. We will post regular updates every 30 minutes.

Internal incident comms checklist

Notify product and customer success teams with impact summary and suggested customer responses.
Enable customer-facing staff with a short FAQ and escalation path for enterprise customers with SLAs.
Maintain a public incident timeline and a private incident log for forensic data (request IDs, headers, timestamps).

During Recovery: Validate & Harden (2–24 hours)

Once traffic is stabilized, focus on validation and preventing regressions.

8. Validate traffic and performance

Monitor synthetic checks, RUM, and logs for error rates and latency across all regions.
Run smoke tests for critical customer flows and verify database consistency after failovers.
Gradually revert temporary mitigations only after stable metrics persist (recommendation: maintain for at least 1–2 hours of healthy traffic).

9. Collect forensic evidence

Snapshot relevant logs and traces, preserve provider incident IDs, and export configuration states (DNS records, load balancer state, Terraform state) for postmortem.
Record exact commands and API calls with timestamps — these are critical for audit and SLA dispute resolution.

10. Assess SLA and regulatory obligations

Calculate downtime windows per service and determine if SLA credits are triggered. Retrieve provider incident IDs as proof for claims.
If regulated data or outages cross legal reporting thresholds (GDPR, HIPAA, SEC rules, or local telecom regs), notify Compliance and Legal immediately.

Postmortem & Continuous Improvement (24–72 hours)

A quality postmortem builds trust and reduces future risk. Use a blameless format and end with concrete action items and owners.

11. Postmortem checklist

Timeline: reconstruct minute-by-minute (who did what and when).
Root cause analysis: separate root cause(s) from contributing factors and highlight provider vs. internal failures.
Mitigation efficacy: what worked, what didn’t, and why — validate playbook steps and update them.
Action items: prioritize fixes (short-term mitigations, medium-term automation, long-term architecture changes). Assign owners and deadlines.

12. Update runbooks and test them

Convert lessons learned into automated runbooks and small, irreversible-proof playbooks for on-call teams.
Run scheduled chaos/DR drills that include provider outages (simulate CDN blackhole, provider console unavailability, DNS failure). Update SLIs/SLOs accordingly.

Prevention & Hardening: Investments that pay off

These are near-term technical investments that materially reduce outage impact.

Multi-CDN and multi-DNS: Use multi-CDN orchestration (DNS-based, edge-smart) and at least two independent authoritative DNS providers with automated sync. See domain registrar strategies.
Short TTLs and staged feature flags: Keep DNS TTLs low for critical records, and use feature flags for rapid fail-open/fail-closed behavior.
Automated playbooks: Codify emergency plans in scripts and IaC that are versioned and reviewed; maintain a minimal emergency Terraform plan for rapid apply.
Observability & synthetic checks: Increase geographic synthetic coverage (including mobile networks), and instrument error budgets and alerting thresholds that trigger runbooks automatically.
Legal & procurement: Negotiate provider SLAs with clear incident reporting and credits; require post-incident root cause statements for major incidents.
Network security: Adopt RPKI for BGP announcements and monitor for route hijacks; use eBPF for observability and traffic steering at the host level if you operate in colos or edge sites.

Playbook Examples: Two concise scenarios

Scenario A — Cloudflare edge outage (CDN data plane degraded)

Confirm via Cloudflare status and synthetic fails; open incident and assign IC.
Switch important zones to origin-first behavior via API or disable problematic Workers/routes to reduce edge errors.
Enable fallback CDN or S3 static hosting for assets; update DNS weight or CNAMEs with short TTLs.
Notify customers and provide ETA; monitor RUM and errors.
Collect logs and request Cloudflare incident ID; run postmortem and add multi-CDN automation.

Scenario B — AWS region control plane issue (console/API slow)

Determine which services are impacted (control plane vs data plane). If EC2/EBS are operational but console/API is slow, use CLI with alternate region endpoints or existing runbook automation to perform failover.
Promote secondary region for critical services using pre-tested scripts; shift Route 53 records to healthy endpoints.
Use provider support and account reps for priority escalation; track incident ID and timeline.
Validate data integrity (DB lag) before cutover and document the change for the postmortem.

Templates & Quick References

Internal incident brief (copy/paste)

Incident: [short title] Impact: [who/what] Start: [timestamp] Current status: [mitigation in place] Next update: [ETA] IC: [name] Action required: [list]

External status (copy/paste)

We are investigating an issue affecting [service]. We are actively mitigating and will post updates every [interval]. For enterprise customers, please contact [escalation contact].

Final Recommendations & Future Predictions (2026+)

Expect outages to remain part of cloud operations. However, the following trends will shape how teams manage them:

Increased automation of multi-provider failover: Orchestration platforms will make DNS and traffic shifts programmable and testable, reducing human error.
Edge-aware SRE practices: eBPF-based telemetry and edge runbooks will let teams steer traffic at the host and NIC level, improving resilience for latency-sensitive services.
Stronger BGP/RPKI adoption: As RPKI becomes more widespread, network-based hijack risks will decline, but teams should include BGP verification in their DR plans.
Shift-left resilience: Chaos engineering for provider outages will be integrated into CI pipelines and runbooks will be codified and validated pre-deployment.

Closing: What to do first after reading this

Verify you have a single, tested incident runbook with DNS and traffic rerouting playbooks and a public status page template.
Run a tabletop drill this quarter simulating a CDN or cloud provider outage and update your runbook with any failures from the drill.
Assign owners to automation tasks: multi-DNS sync, emergency Terraform repo, and synthetic checks expansion.

Call to action: If you want a ready-to-run emergency repo and templates tailored to your stack (Route 53 + Cloudflare, or multi-cloud fallbacks), request our Incident Repo starter kit — tested with SRE teams in production and updated for 2026 best practices.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.