Extract Body Contents from Full HTML Documents in PHP with DOMDocument
Introduction
Content editors, page builders, and import tools often receive two different shapes of HTML:
- a fragment, such as
<p>Contact us</p> - a full document, such as
<!doctype html><html><head>...</head><body>...</body></html>
If your component expects a fragment, rendering a full document inside it can create invalid markup, duplicate metadata, or layout bugs. The fix is not a full sanitizer. It is a small normalization step: when the input is a complete document, extract only the inner contents of <body>. When the input is already a fragment, leave it alone.
This article shows a practical PHP helper using DOMDocument.
The Goal
The helper should follow three rules:
- empty input returns an empty string
- normal HTML fragments return unchanged
- full HTML documents return only the body children
For example:
$html = '<!doctype html>
<html>
<head><title>Ignored</title></head>
<body>
<section>
<h2>Booking Form</h2>
<p>Choose a time.</p>
</section>
</body>
</html>';
echo bodyContents($html);Output:
<section>
<h2>Booking Form</h2>
<p>Choose a time.</p>
</section>But this fragment stays exactly as-is:
echo bodyContents('<p>Already a fragment</p>');
// <p>Already a fragment</p>The Helper
Here is the full implementation:
function bodyContents(string $html): string
{
$trimmed = trim($html);
if ($trimmed === '') {
return '';
}
if (! preg_match('/<html\b|<body\b|<!doctype/i', $trimmed)) {
return $html;
}
$dom = new DOMDocument('1.0', 'UTF-8');
libxml_use_internal_errors(true);
$loaded = $dom->loadHTML(
'<?xml encoding="UTF-8">' . $trimmed,
LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED
);
libxml_clear_errors();
if (! $loaded) {
return $html;
}
$body = $dom->getElementsByTagName('body')->item(0);
if (! $body) {
return $html;
}
$inner = '';
foreach ($body->childNodes as $child) {
$inner .= $dom->saveHTML($child);
}
return $inner;
}The function is intentionally conservative. It only parses input that looks like a full document. That avoids sending every small fragment through DOMDocument, which can normalize whitespace, repair tags, or slightly change output formatting.
Why Detect Before Parsing?
DOMDocument::loadHTML() is useful, but it is not a no-op formatter. It parses and repairs markup. For document-shaped HTML, that is exactly what we want. For simple fragments, it may be unnecessary and surprising.
This guard keeps the common path simple:
if (! preg_match('/<html\b|<body\b|<!doctype/i', $trimmed)) {
return $html;
}The pattern catches the usual signs of a full document:
- an
<html>tag - a
<body>tag - a
<!doctype>declaration
If none of those markers exist, the helper assumes the input is already suitable for inline rendering.
Parsing Without Warnings
Real-world editor HTML may include HTML5 tags, incomplete documents, or markup that libxml complains about. The PHP manual for DOMDocument::loadHTML notes that parsing behavior depends on libxml and that modern HTML can produce warnings.
For a normalization helper, warnings should not leak into logs or responses, so the parser uses internal error handling:
libxml_use_internal_errors(true);
$loaded = $dom->loadHTML($source, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
libxml_clear_errors();The flags reduce extra document wrappers in the serialized output. The function still treats parse failure as non-fatal:
if (! $loaded) {
return $html;
}That fallback matters. If the helper cannot confidently extract a body, it should return the original input rather than silently dropping content.
Preserving UTF-8
The implementation prepends an XML encoding hint before parsing:
'<?xml encoding="UTF-8">' . $trimmedThis is a common DOMDocument workaround for UTF-8 content. Without an explicit encoding signal, non-ASCII characters may be interpreted incorrectly depending on the input and libxml behavior.
If your source documents already contain reliable charset metadata, you may not need this exact approach. For pasted snippets and CMS content, the explicit hint is a useful defensive default.
Extracting Only Body Children
Once the document is loaded, the important step is serializing each child of <body>:
$body = $dom->getElementsByTagName('body')->item(0);
$inner = '';
foreach ($body->childNodes as $child) {
$inner .= $dom->saveHTML($child);
}DOMDocument::saveHTML() can serialize either the full document or a specific node. The PHP manual documents that optional node parameter on DOMDocument::saveHTML. Passing each child node gives us the equivalent of browser innerHTML.
This avoids returning the <body> wrapper itself.
Using It in a View Pipeline
A common use case is sanitizing the shape of stored template HTML before handing it to a component:
$template = bodyContents((string) $storedTemplate);
return view('components.dynamic-form', [
'template' => $template,
]);In Laravel Blade, that might look like this:
@php
$template = \App\Support\Html::bodyContents((string) $template);
@endphp
<livewire:dynamic-form :template="$template" />Now editors can paste either a fragment or a full HTML document, and the component still receives the shape it expects.
Important Security Boundary
Important
This helper does not sanitize HTML. It only extracts body contents from document-shaped input.
If users can submit untrusted HTML, run a proper sanitizer after extraction. For example, use an allowlist-based sanitizer such as HTML Purifier or the sanitizer already approved in your stack.
The safe pipeline is:
- normalize document shape with
bodyContents() - sanitize untrusted tags and attributes
- render with the escaping rules appropriate for your framework
Do not treat DOM parsing as a security filter.
PHP 8.4 Note
For new code on PHP 8.4+, also review Dom\HTMLDocument, which PHP recommends for HTML5-aware parsing. DOMDocument::loadHTML() remains widely available and works well for this narrow extraction task, but modern HTML parsing is a moving target.
If you need precise HTML5 tree construction, prefer the newer parser. If you only need a small compatibility helper in a PHP 8.1/8.2/8.3 application, DOMDocument is still a practical choice.
Conclusion
bodyContents() is small, but it prevents a common rendering problem: full pasted HTML documents leaking into components that expect fragments.
The useful patterns are:
- detect document-shaped input before parsing
- keep fragment input unchanged
- parse with quiet libxml error handling
- extract body children with
saveHTML($child) - fall back to the original input when parsing fails
That gives your content pipeline a stable HTML shape without pretending to solve sanitization, validation, or full HTML cleanup.
