Generating XML sitemaps from headless CMS routes

Sitemap generators that traverse the filesystem or server routing table don’t work in a headless stack — there’s no filesystem to walk, only paginated API endpoints whose locale variants rarely match the final URLs. The sitemap becomes a compiled artifact built during the build or revalidation phase. This guide covers the aggregation pipeline: cursor-paginated fetch with backoff, status filtering, and stream-based XML serialization that won’t exhaust memory on a large content graph.

The Core Challenge

Asynchronous publication is the root problem. You must filter unpublished nodes before route compilation, resolve slug collisions programmatically, and normalize trailing slashes before serialization. Extraction starts with a recursive content-type query over cursor-paginated GraphQL or REST endpoints, then a transformation layer maps internal IDs to canonical paths. A strict routing contract is what keeps orphaned URLs out and ensures crawlers only see valid endpoints.

Route Extraction & State Filtering

Draft and scheduled content pollutes production sitemaps unless you filter at the query level. Trigger regeneration only on a publish transition — CMS webhooks or ISR hooks, not routine deploys. Naive scripts break on rate limits, so batch aggressively, back off exponentially on transient failures, and cache intermediate route arrays. Parallelize locale fetches but serialize the final XML assembly to avoid race conditions.

Multilingual Routing & Hreflang

Each locale variant resolves independently, and fallback chains must evaluate before hreflang mapping. Inject alternate language tags into the <url> block as <xhtml:link> elements — appending them as query parameters violates the Sitemaps XML Protocol and dilutes crawl budget. A fallback strategy that resolves missing translations to a default locale avoids duplicate-content penalties. This is core to Localization & SEO Optimization.

Implementation

This TypeScript pipeline handles cursor pagination, filters unpublished content, and enforces XML schema compliance. It uses the sitemap library for stream-based serialization, which avoids memory exhaustion on large content graphs.

import { SitemapStream, streamToPromise, EnumChangefreq } from 'sitemap';
import { Readable } from 'stream';

export interface CMSRoute {
  id: string;
  slug: string;
  locale: string;
  updatedAt: string;
  alternateLocales: { locale: string; url: string }[];
}

interface FetchOptions {
  apiEndpoint: string;
  token: string;
  maxRetries?: number;
}

async function fetchWithBackoff(url: string, headers: Record<string, string>, maxRetries = 3): Promise<Response> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const res = await fetch(url, { headers });
    if (res.ok) return res;
    
    const delay = Math.pow(2, attempt) * 1000;
    console.warn(`API rate limit or error. Retrying in ${delay}ms...`);
    await new Promise((resolve) => setTimeout(resolve, delay));
  }
  throw new Error(`Failed to fetch routes after ${maxRetries} attempts.`);
}

export async function fetchPublishedRoutes({ apiEndpoint, token, maxRetries = 3 }: FetchOptions): Promise<CMSRoute[]> {
  const routes: CMSRoute[] = [];
  let cursor: string | null = null;

  do {
    const params = new URLSearchParams({
      limit: '100',
      status: 'published',
      ...(cursor && { cursor })
    });

    const res = await fetchWithBackoff(`${apiEndpoint}?${params}`, {
      Authorization: `Bearer ${token}`,
      'Content-Type': 'application/json'
    }, maxRetries);

    const data = await res.json();
    const normalized = data.results.map((entry: any) => ({
      id: entry.id,
      slug: entry.slug.replace(/\/+$/, ''), // Normalize trailing slashes
      locale: entry.locale,
      updatedAt: entry.updated_at,
      alternateLocales: entry.alternateLocales || []
    }));

    routes.push(...normalized);
    cursor = data.pagination?.next_cursor || null;
  } while (cursor);

  return routes;
}

export async function generateSitemapXML(routes: CMSRoute[], baseUrl: string): Promise<string> {
  const stream = new SitemapStream({ hostname: baseUrl });
  
  // Transform CMS routes into sitemap-compatible objects
  const urlObjects = routes.map((route) => ({
    url: `/${route.locale}/${route.slug}`,
    lastmod: route.updatedAt,
    changefreq: EnumChangefreq.WEEKLY,
    priority: 0.7,
    links: route.alternateLocales.map((alt) => ({
      lang: alt.locale,
      url: alt.url.startsWith('http') ? alt.url : `${baseUrl}${alt.url}`
    }))
  }));

  const readable = Readable.from(urlObjects);
  readable.pipe(stream);

  const xmlBuffer = await streamToPromise(stream);
  return xmlBuffer.toString();
}

Integration Flow

The pipeline runs as an event-driven sequence from publish event to a purged, regenerated artifact.

flowchart LR
  Pub["CMS publish / update event"] --> Hook["Webhook listener (serverless)"]
  Hook --> Fetch["fetchPublishedRoutes() cursor pagination + backoff"]
  Fetch --> Filter["Filter to published, resolve hreflang"]
  Filter --> XML["generateSitemapXML() stream serialize"]
  XML --> Store["Write to CDN origin / object storage"]
  Store --> Purge["Purge /sitemap.xml cache"]

Webhook Listener: Deploy a lightweight serverless function or Next.js API route that listens for content.published or content.updated events from your headless provider.
Route Aggregation: Trigger fetchPublishedRoutes() to pull the latest published state. The exponential backoff wrapper ensures resilience against provider-side throttling.
XML Serialization: Pass the aggregated array into generateSitemapXML(). The sitemap library streams the output directly to memory, avoiding heap overflow on large datasets.
Cache Invalidation: Write the resulting XML string to your CDN origin or object storage (e.g., AWS S3, Vercel Blob). Purge the CDN cache for /sitemap.xml immediately after write completion.
Framework Hook: If using Next.js, tie this pipeline to revalidatePath('/sitemap.xml') or leverage on-demand ISR to keep the artifact synchronized without full rebuilds.

Pipeline Integration & Performance

Treat the sitemap as a compiled artifact, as in Dynamic Sitemap Generation, and run the pipeline in isolation from the frontend build to avoid deployment bottlenecks. Health-check XML validity before publishing — a schema validator or regex assertion on <urlset> compliance and namespace declarations.

Past 50,000 URLs or 50MB uncompressed, Google Search Central requires sharding. Shard by content type or locale and generate a sitemap index referencing each shard, with referential integrity across all of them.

Conclusion

A headless sitemap is a data pipeline, not file generation. Deterministic, state-filtered extraction eliminates crawl errors and keeps multilingual indexing compliant. The sitemap must mirror the published content graph and update exactly when publication state changes — cursor pagination, exponential backoff, and stream serialization are what let it scale with content velocity.