Robots.txt configuration for multi-locale headless sites

A robots.txt baked at build time freezes crawler directives to one environment and one routing table — and when a CDN caches it aggressively, every regional subpath inherits the same blanket rules, over-indexing preview branches or blocking legitimate locales. This guide generates robots.txt at runtime from the CMS locale registry: a dynamic route handler, per-locale sitemap references, and strict no-store cache headers. It’s part of Localization & SEO Optimization.

Why Build-Time Generation Fails

Most CI/CD workflows compile robots.txt as a static asset, which locks directives to one deployment and ignores the dynamic routing table. The CDN then caches it and applies blanket rules to every subpath. Subfolder localization breaks immediately. Decouple the file from static assets and route it through an edge handler or a framework dynamic endpoint.

Runtime Architecture

Query the CMS locale registry at request time for active locales, fallback chains, and environment flags, then build the file in a route handler. This keeps crawler directives aligned with the routing table and lets you block preview branches conditionally — no manual file swaps, no environment-specific build steps.

Next.js App Router Implementation

Add a dynamic route handler at app/robots.txt/route.ts. It fetches locale config from the CMS API or env vars and returns a text/plain response with strict cache headers, so the edge can’t serve stale directives.

import { NextResponse } from 'next/server';
import { getActiveLocales } from '@/lib/cms-locale-registry';

// Force dynamic rendering to bypass static generation
export const dynamic = 'force-dynamic';

export async function GET() {
  // Fetch live locale registry from CMS or configuration store
  const locales = await getActiveLocales();
  const isProduction = process.env.NODE_ENV === 'production';
  const baseUrl = process.env.NEXT_PUBLIC_SITE_URL || 'https://yourdomain.com';

  // Construct directives programmatically
  const directives = [
    `User-agent: *`,
    isProduction ? `Allow: /` : `Disallow: /`,
    // Dynamically inject locale-specific sitemap references
    ...locales.map(locale => `Sitemap: ${baseUrl}/${locale}/sitemap.xml`),
    `Sitemap: ${baseUrl}/sitemap-index.xml`
  ];

  // Return with strict cache control to prevent CDN stale caching
  return new NextResponse(directives.join('\n'), {
    headers: { 
      'Content-Type': 'text/plain; charset=utf-8', 
      'Cache-Control': 'no-store, max-age=0, s-maxage=0' 
    },
  });
}

Abstracting locale resolution into @/lib/cms-locale-registry keeps the handler environment-agnostic. force-dynamic stops the framework from prerendering the file, and the Cache-Control header tells intermediate proxies to bypass caching.

Fallback Routes & Canonicalization

Duplicate indexing happens when fallback routes mirror primary content — if the CMS serves /en/ at /default/ or /, crawlers treat them as separate pages. Pair canonical headers with the robots.txt directives to consolidate ranking signals, and append locale-specific sitemap references so crawlers find regional variants. This syncs with Dynamic Sitemap Generation to keep one index across active markets.

Cache Control at the Edge

Cloudflare Workers or Vercel Edge Middleware should intercept /robots.txt, parse the path for regional context, and inject Disallow for non-canonical locale variants under strict regional targeting. Per Google’s crawling documentation, Allow/Disallow rules evaluate by precedence, which is why programmatic generation matters for complex matrices. Always set Cache-Control: no-store, max-age=0 on the endpoint — MDN on Cache-Control explains how aggressive caching of directive files causes prolonged misindexing when rules change. Bypass the standard CDN cache key by signing the request with a locale or environment parameter.

Conclusion

Generating robots.txt at the edge from a live locale registry, with no-store caching, is what keeps crawler directives in step with the routing table — eliminating duplicate-content risk and keeping non-production content out of the index.