Content Audit Workflows Before Headless Migration
Migrating off a monolithic CMS is a structural transformation, not a data export — and it fails when legacy content reaches the frontend without deterministic validation. The recurring symptoms are broken component contracts, stale preview environments, and cascading build failures. A production-safe audit pipeline maps implicit dependencies, normalizes draft states, and sanitizes asset references before decoupling begins.
What actually breaks
Failed migrations rarely stem from infrastructure. They come from three data-integrity failures that surface only when payloads hit the delivery layer:
- Implicit relational dependencies. Monolithic systems store content as rendered HTML blobs, inline styles, and implicit joins. Extracted via REST or GraphQL, these yield unstructured payloads that violate component interfaces; without explicit foreign keys, relational queries return
nullor malformed arrays and break hydration and ISR fallbacks. - State parity degradation. Draft and published states live in legacy flags (
post_status,is_published,visibility) rather than API-native versioning. Token-based preview systems expectingdraft/publishedenums render stale content, fail auth handshakes, or skip cache invalidation. - Orphaned asset references. Media URLs embed absolute paths, legacy CDN prefixes, or cache-buster query strings (
?v=1.2.3). During static generation these trigger 404 cascades, block builds, or bypass webhook-triggered rebuilds when the origin returns301redirects the bundler can’t follow.
The audit pipeline
The four gates run in order, each blocking ingestion until legacy content clears it:
flowchart LR
A["Legacy content"] --> B["Gate 1: schema contract"]
B --> C["Gate 2: draft state normalization"]
C --> D["Gate 3: asset URL hygiene"]
D --> E["Gate 4: reference graph validation"]
E --> F{"valid?"}
F -->|yes| G["Ingest to headless CMS"]
F -->|no| H["Quarantine / flag for review"]
Run the gates sequentially before any ingestion:
- Schema contract definition. Map legacy content types to headless models. Flag deprecated fields, inline HTML needing component extraction, and implicit relationships requiring explicit foreign-key resolution.
- Draft state normalization. Extract revision history, map legacy status flags to headless enums, and confirm preview tokens resolve to the correct revision hash — keeping editorial workflows and frontend routing in parity.
- Asset URL hygiene. Crawl every media reference, strip legacy query parameters, normalize paths to target CDN prefixes, and verify
200/304responses before ingestion. - Reference graph validation. Resolve cross-content relationships (related posts, taxonomy, navigation) and flag dangling references before they cause routing errors.
Schema validation and draft state mapping
This Node.js routine fetches legacy content, normalizes state flags, validates payloads against the target schema with Zod, and emits a structured audit report. It assumes a paginated REST or GraphQL endpoint and handles async concurrency safely.
import { z } from 'zod';
import { fetch } from 'undici';
import { fileURLToPath } from 'node:url';
// Target headless schema contract
const ArticleSchema = z.object({
id: z.string().uuid(),
title: z.string().min(1, 'Title cannot be empty'),
slug: z.string().regex(/^[a-z0-9-]+$/, 'Slug must be lowercase alphanumeric with hyphens'),
status: z.enum(['draft', 'published', 'archived']),
body_html: z.string().min(1, 'Body HTML is required'),
featured_image: z.string().url().nullable(),
related_ids: z.array(z.string().uuid()).default([])
});
// Legacy status mapping to headless enums
const STATUS_MAP = {
publish: 'published',
draft: 'draft',
trash: 'archived',
private: 'archived',
pending: 'draft'
};
export async function auditLegacyContent(endpoint, token, batchSize = 50) {
const response = await fetch(endpoint, {
headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' }
});
if (!response.ok) {
throw new Error(`Legacy API returned HTTP ${response.status}: ${response.statusText}`);
}
const legacyData = await response.json();
const auditReport = { valid: [], invalid: [], warnings: [] };
for (const item of legacyData) {
const normalizedStatus = STATUS_MAP[item.status] ?? 'archived';
// Normalize payload to match headless contract
const payload = {
id: item.guid || item.id || crypto.randomUUID(),
title: item.title?.rendered || item.title || '',
slug: item.slug || '',
status: normalizedStatus,
body_html: item.content?.rendered || item.body_html || '',
featured_image: item._embedded?.['wp:featuredmedia']?.[0]?.source_url || null,
related_ids: item._embedded?.['wp:term']?.[0]?.map(t => String(t.id)) || []
};
const result = ArticleSchema.safeParse(payload);
if (result.success) {
auditReport.valid.push(result.data);
} else {
auditReport.invalid.push({
id: payload.id,
status: payload.status,
errors: result.error.flatten().fieldErrors
});
}
// Warn on potential asset or state mismatches
if (payload.status === 'draft' && !payload.featured_image) {
auditReport.warnings.push(`Draft ${payload.id} missing featured image (may break preview fallbacks)`);
}
}
return auditReport;
}
// CLI execution example
if (process.argv[1] === fileURLToPath(import.meta.url)) {
const LEGACY_ENDPOINT = process.env.LEGACY_API_URL;
const AUTH_TOKEN = process.env.LEGACY_API_TOKEN;
if (!LEGACY_ENDPOINT || !AUTH_TOKEN) {
console.error('Missing LEGACY_API_URL or LEGACY_API_TOKEN environment variables');
process.exit(1);
}
auditLegacyContent(LEGACY_ENDPOINT, AUTH_TOKEN)
.then(report => {
console.log(`✅ Valid: ${report.valid.length}`);
console.log(`❌ Invalid: ${report.invalid.length}`);
console.log(`⚠️ Warnings: ${report.warnings.length}`);
console.log(JSON.stringify(report.invalid, null, 2));
})
.catch(err => {
console.error('Audit failed:', err.message);
process.exit(1);
});
}
Execution notes
- Run in CI or a staging container, and pipe the JSON output to a migration orchestrator that skips invalid payloads.
- Use
undicifor nativefetchand connection pooling; for high-volume audits, cap concurrency withp-limitto avoid legacy API throttling. - Check HTTP status explicitly. Legacy CDNs often return
301/302for moved assets, so resolve final URLs or flag them for manual review.
Wiring the audit into CI/CD
A pre-deploy job should run the validator against a staging clone of the legacy database. If the invalid array exceeds a threshold (e.g. > 2%), fail the pipeline and surface the Zod report to editors.
When normalizing draft states, align preview infrastructure with Preview & Draft Workflow Patterns so token-based auth resolves to the exact revision hash the audit validated — eliminating the common mismatch where preview routes render published payloads instead of draft revisions.
Map audit outputs to your Legacy System Decoupling Strategies before the content sync, so relational graphs, asset pipelines, and state machines stay synchronized across both environments.
Finally, validate media references against HTTP status codes during the asset-hygiene phase: static generators fail on 4xx/5xx unless fallback routing or placeholder assets are configured. Strict schema contracts and deterministic state mapping upfront eliminate most post-migration debugging and yield predictable, cache-friendly delivery.