Automated SEO audits for headless CMS deployments
Crawlers stumble over ISR, edge-rendered responses, and deferred hydration — the exact patterns a headless stack relies on — so SEO regressions ship silently. Moving the audit into CI/CD catches them before deploy: validate schema and metadata pre-build, confirm server-injected tags post-build, and check CDN propagation post-deploy. This guide builds that three-phase pipeline with a GitHub Actions workflow and a Playwright audit script that fails the build on missing canonical or Open Graph tags.
Pipeline Phases
The pipeline runs three phases, each targeting a distinct failure mode of decoupled delivery.
flowchart TD
Pre["Pre-build validation"] --> PreC["Schema + metadata templates, hreflang vs routes"]
PreC --> Build["Generator build"]
Build --> Post["Post-build verification"]
Post --> PostC["Parse compiled HTML: server-injected meta, robots.txt, sitemap"]
PostC --> Deploy["Deploy + preview server"]
Deploy --> Synth["Post-deploy synthetic testing"]
Synth --> SynthC["Playwright crawl: status, canonical, og tags"]
SynthC --> Gate{"All routes pass?"}
Gate -->|"yes"| Ship["Allow deploy"]
Gate -->|"no"| Fail["Exit non-zero, block deploy"]
Pre-build validation checks schema conformity and metadata templates before the generator runs. Missing og:/twitter: tags, malformed canonicals, and inconsistent hreflang usually trace back to unvalidated CMS payloads. Parse the content graph so every published node maps to a valid route, and cross-reference locale prefixes against route definitions to prevent canonical collisions — part of broader Localization & SEO Optimization.
Post-build verification inspects the output directory for routing tables, asset hashes, and cache headers. Edge functions fetching from the CMS often serve stale metadata under aggressive CDN caching, so parse the compiled HTML to confirm meta tags are injected server-side, not deferred to hydration. Validate robots.txt and sitemap integrity here to keep orphaned routes out of production.
Post-deployment synthetic testing confirms CDN propagation and cache consistency across edge nodes. Headless-browser automation simulates real navigation, intercepts requests, and validates status codes. Trigger on main pushes or release tags against a local preview server on localhost:3000; parallel jobs cut CI time but need isolated ports and deterministic env vars.
CI/CD Orchestration
GitHub Actions or GitLab CI orchestrates the sequence: install, build, launch a preview server, run validation. This config isolates the audit environment and passes CMS secrets securely.
name: seo-audit-pipeline
on:
push:
branches: [main]
tags: ['v*']
workflow_dispatch:
jobs:
audit:
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- run: npm run build
- name: Start preview server
run: npm run preview &
env:
PORT: 3000
- name: Wait for server readiness
run: npx wait-on http://localhost:3000
- name: Execute SEO audit
run: npm run audit:seo
env:
BASE_URL: http://localhost:3000
CMS_WEBHOOK_SECRET: ${{ secrets.CMS_WEBHOOK_SECRET }}
AUDIT_THRESHOLD: 90
Audit Script
The script handles async route resolution and aggregates results. Headless frameworks defer route generation until first request, so synthetic navigation is what populates ISR caches. Playwright handles network interception and DOM inspection across Chromium, Firefox, and WebKit.
import { chromium, Page } from 'playwright';
import { readFileSync } from 'node:fs';
import { parseStringPromise } from 'xml2js';
interface AuditConfig {
baseUrl: string;
threshold: number;
sitemapPath: string;
}
interface AuditResult {
route: string;
status: number;
canonical: string | null;
metaTags: Record<string, string>;
passed: boolean;
}
async function parseSitemap(path: string): Promise<string[]> {
const xml = readFileSync(path, 'utf-8');
const result = await parseStringPromise(xml);
return result.urlset.url.map((u: any) => u.loc[0]);
}
async function runAudit(config: AuditConfig): Promise<AuditResult[]> {
const browser = await chromium.launch({ headless: true });
const routes = await parseSitemap(config.sitemapPath);
const results: AuditResult[] = [];
for (const route of routes) {
const page = await browser.newPage();
const response = await page.goto(`${config.baseUrl}${route}`, {
waitUntil: 'networkidle',
timeout: 10000,
});
const canonical = await page.locator('link[rel="canonical"]').getAttribute('href');
const metaTags: Record<string, string> = {};
await page.locator('meta[name], meta[property]').evaluateAll((els) => {
const tags: Record<string, string> = {};
els.forEach((el) => {
const name = el.getAttribute('name') || el.getAttribute('property');
const content = el.getAttribute('content');
if (name && content) tags[name] = content;
});
return tags;
}).then((data) => Object.assign(metaTags, data));
const status = response?.status() ?? 0;
const hasRequiredMeta = !!metaTags['og:title'] && !!metaTags['og:description'] && !!canonical;
results.push({
route,
status,
canonical,
metaTags,
passed: status === 200 && hasRequiredMeta,
});
await page.close();
}
await browser.close();
return results;
}
// Execution entry point
const config: AuditConfig = {
baseUrl: process.env.BASE_URL || 'http://localhost:3000',
threshold: parseInt(process.env.AUDIT_THRESHOLD || '90', 10),
sitemapPath: './public/sitemap.xml',
};
runAudit(config).then((results) => {
const failures = results.filter((r) => !r.passed);
if (failures.length > 0) {
console.error(`Audit failed: ${failures.length} routes did not meet SEO requirements.`);
console.table(failures.map(({ route, status, canonical }) => ({ route, status, canonical })));
process.exit(1);
}
console.log(`Audit passed: ${results.length} routes validated successfully.`);
process.exit(0);
});
The flow: CI builds the site, serves it locally, and runs the script. The script parses the sitemap, visits each route, and waits for networkidle so ISR and edge functions finish hydrating. It extracts canonical and Open Graph tags, checks for HTTP 200, and aggregates results. Any failure exits non-zero and blocks the deploy.
Cache Consistency Edge Cases
The recurring failure is a metadata-injection race: a content webhook fires, the CDN serves stale HTML while the origin regenerates the route. Validate Cache-Control alongside Last-Modified to catch it. Dynamic Sitemap Generation gives you the crawlable route set, but the audit must cross-reference that sitemap against the CMS content index to expose orphaned paths and failed regenerations.
Parse XML against Google’s sitemap specifications, especially for multilingual hreflang and nested routes. Missing canonicals usually trace to route-mapping mismatches in localized setups. Deterministic build-stage validation is what keeps index bloat, duplicate-content penalties, and CWV regressions out of every environment — turning SEO from a manual checklist into a build gate.