Chaos Engineering for Headless CMS Dependency Failures

Chaos engineering for CMS dependencies injects controlled degradation — latency, 5xx, truncated payloads — into the data pipeline to prove that fallback routing, circuit breakers, and stale-content serving actually fire before a real outage tests them. Most headless stacks treat the content API as always-available, but in production GraphQL/REST endpoints throttle, partition by region, drift their schema, and return malformed JSON. When they degrade, frontend hydration stalls, ISR revalidation queues saturate, and CDN caches fail open. The goal isn’t uptime; it’s verifying graceful degradation when the CMS is partially or fully down.

Root-Cause Analysis

Cascading failures trace to three untested assumptions in the data-fetching layer:

Retry loops without backoff or caps. Default linear/exponential retries exhaust connection pools and trip CMS rate limits (HTTP 429). Without jitter and a hard cap, retries turn a partial outage into a full denial.
No timeout boundary at the edge. CDN routing and ISR handlers wait indefinitely for 200 OK. Build workers and serverless functions hang, burn execution quota, and block deploys.
No typed fallback contract. On 5xx or truncated JSON, components hit undefined property access and unhandled rejections that crash client rendering.

The root cause is treating a volatile third-party endpoint as infallible. Without fault-domain isolation, one CMS hiccup propagates straight into SSR, static generation, and client hydration.

The resilience layers that contain injected degradation:

flowchart TD
  Inject["Inject latency / 5xx / truncated JSON"] --> MSW["MSW interception layer"]
  MSW --> CB{"Circuit breaker: retries + backoff + jitter"}
  CB -->|"recovered"| OK["Serve fresh content"]
  CB -->|"timeout via AbortController"| ISR["ISR fallback to cached slugs"]
  CB -->|"origin 5xx"| Edge{"CDN cache state"}
  Edge -->|"stale-if-error"| Stale["Serve stale content"]
  Edge -->|"no cache"| FB["Typed fallback UI"]
  ISR --> FB
  Stale --> Assert["Assert fallback in CI"]
  FB --> Assert
  OK --> Assert

Step-by-Step Resolution

Inject chaos at the network-interception layer, set resilience policies in the fetch client, and assert ISR/CDN fallback under simulated degradation.

1. Intercept CMS Traffic with Mock Service Worker

Proxy CMS calls through a layer that injects latency, 503/504s, and malformed payloads. Run it in CI and staging — never against production endpoints.

// msw/handlers/cms.ts
import { http, HttpResponse, delay } from 'msw';

export const cmsHandlers = [
  http.get('https://api.cms-provider.com/graphql', async ({ request }) => {
    // Inject 30% chance of 503 Service Unavailable
    if (Math.random() < 0.3) {
      return HttpResponse.json(
        { error: 'Upstream throttled' },
        { status: 503, headers: { 'Retry-After': '5' } }
      );
    }

    // Inject 20% chance of 2s+ latency
    if (Math.random() < 0.2) {
      await delay(2500);
    }

    // Return valid or intentionally truncated payload
    return HttpResponse.json({
      data: {
        page: {
          title: 'Landing Page',
          hero: { /* ... */ }
        }
      }
    });
  })
];

For wiring this into your test runner deterministically, see Automated Testing for Headless Integrations.

2. Enforce Circuit Breaker Logic in Fetch Clients

Set retry limits, exponential backoff with jitter, and explicit failure thresholds. Disable retries on non-idempotent mutations so a partial outage doesn’t produce duplicate writes.

// lib/query-client.ts
import { QueryClient } from '@tanstack/react-query';

export const resilientQueryClient = new QueryClient({
  defaultOptions: {
    queries: {
      retry: 2,
      retryDelay: (attemptIndex) =>
        Math.min(1000 * 2 ** attemptIndex + Math.random() * 200, 5000),
      staleTime: 1000 * 60 * 5,
      gcTime: 1000 * 60 * 15,
      throwOnError: false, // Prevents unhandled promise rejections in UI
    },
    mutations: {
      retry: 0, // Never retry POST/PUT/PATCH during CMS degradation
    },
  },
});

3. Validate ISR Timeout Boundaries

Wrap getStaticProps or route handlers in an AbortController timeout. When the CMS exceeds your latency SLA, fall back to cached data so build workers never hang.

// app/blog/[slug]/page.tsx
import { notFound } from 'next/navigation';

const CMS_TIMEOUT_MS = 2000;

export async function generateStaticParams() {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), CMS_TIMEOUT_MS);

  try {
    const res = await fetch('https://api.cms-provider.com/graphql', {
      signal: controller.signal,
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ query: '{ pages { slug } }' }),
    });

    if (!res.ok) throw new Error(`CMS responded ${res.status}`);
    const json = await res.json();
    return json.data.pages.map((p: { slug: string }) => ({ slug: p.slug }));
  } catch (err) {
    // Fallback to pre-cached slugs or empty array for graceful build completion
    console.warn('ISR param fetch failed, using fallback:', err);
    return [];
  } finally {
    clearTimeout(timeout);
  }
}

Align these timeouts with your CDN’s stale-while-revalidate windows; the Next.js data fetching & caching docs cover how revalidation and cache headers interact.

4. Test CDN Routing Under Partial Outage

Set cache headers that let edge nodes serve expired content when origin health checks fail. This prevents an invalidation storm during degradation.

Cache-Control: public, max-age=300, stale-while-revalidate=86400, stale-if-error=604800

max-age=300: Fresh content served for 5 minutes.
stale-while-revalidate=86400: Background revalidation allowed for 24 hours.
stale-if-error=604800: Serve stale content for up to 7 days if origin returns 5xx or times out.

Simulate origin downtime and assert the CDN returns X-Cache: HIT or STALE rather than propagating 502 Bad Gateway to users.

5. Automate Chaos Scenarios in CI

Run fault injection in pre-deploy pipelines and assert that components render fallback states within a latency budget. Use Playwright or Cypress to check DOM structure and accessibility attributes during a simulated outage.

// tests/cms-fallback.spec.ts
import { test, expect } from '@playwright/test';

test('renders fallback UI when CMS returns 503', async ({ page }) => {
  // MSW intercepts and returns 503 automatically in CI
  await page.goto('/blog/chaos-engineering');
  
  await expect(page.locator('[data-testid="cms-fallback-banner"]')).toBeVisible();
  await expect(page.locator('h1')).toHaveText('Content Temporarily Unavailable');
  
  // Verify hydration completed without client-side errors
  const consoleErrors = await page.evaluate(() => window.__consoleErrors || []);
  expect(consoleErrors).toHaveLength(0);
});

Production Deployment & Observability

Control the blast radius. Enable fault injection in staging first, then roll into production behind a feature flag scoped to internal traffic or low-risk routes. Track three resilience metrics:

Error budget consumption: 5xx rate and timeout percentage against SLA.
Fallback activation rate: how often stale-if-error or client fallbacks fire.
Hydration mismatch count: target zero React hydration warnings while degraded.

Instrument the fetch layer with distributed tracing to correlate CMS latency spikes with render degradation. Run this alongside the rest of your Data Fetching & Caching Strategies and a CMS dependency stops being a single point of failure.