Extract Body Contents from Full HTML Documents in PHP with DOMDocument

Introduction

Content editors, page builders, and import tools often receive two different shapes of HTML:

a fragment, such as <p>Contact us</p>
a full document, such as <!doctype html><html><head>...</head><body>...</body></html>

If your component expects a fragment, rendering a full document inside it can create invalid markup, duplicate metadata, or layout bugs. The fix is not a full sanitizer. It is a small normalization step: when the input is a complete document, extract only the inner contents of <body>. When the input is already a fragment, leave it alone.

This article shows a practical PHP helper using DOMDocument.

The Goal

The helper should follow three rules:

empty input returns an empty string
normal HTML fragments return unchanged
full HTML documents return only the body children

For example:

php

$html = '<!doctype html>
<html>
    <head><title>Ignored</title></head>
    <body>
        <section>
            <h2>Booking Form</h2>
            <p>Choose a time.</p>
        </section>
    </body>
</html>';

echo bodyContents($html);

Output:

html

<section>
    <h2>Booking Form</h2>
    <p>Choose a time.</p>
</section>

But this fragment stays exactly as-is:

php

echo bodyContents('<p>Already a fragment</p>');
// <p>Already a fragment</p>

The Helper

Here is the full implementation:

php

function bodyContents(string $html): string
{
    $trimmed = trim($html);

    if ($trimmed === '') {
        return '';
    }

    if (! preg_match('/<html\b|<body\b|<!doctype/i', $trimmed)) {
        return $html;
    }

    $dom = new DOMDocument('1.0', 'UTF-8');

    libxml_use_internal_errors(true);
    $loaded = $dom->loadHTML(
        '<?xml encoding="UTF-8">' . $trimmed,
        LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED
    );
    libxml_clear_errors();

    if (! $loaded) {
        return $html;
    }

    $body = $dom->getElementsByTagName('body')->item(0);

    if (! $body) {
        return $html;
    }

    $inner = '';

    foreach ($body->childNodes as $child) {
        $inner .= $dom->saveHTML($child);
    }

    return $inner;
}

The function is intentionally conservative. It only parses input that looks like a full document. That avoids sending every small fragment through DOMDocument, which can normalize whitespace, repair tags, or slightly change output formatting.

Why Detect Before Parsing?

DOMDocument::loadHTML() is useful, but it is not a no-op formatter. It parses and repairs markup. For document-shaped HTML, that is exactly what we want. For simple fragments, it may be unnecessary and surprising.

This guard keeps the common path simple:

php

if (! preg_match('/<html\b|<body\b|<!doctype/i', $trimmed)) {
    return $html;
}

The pattern catches the usual signs of a full document:

an <html> tag
a <body> tag
a <!doctype> declaration

If none of those markers exist, the helper assumes the input is already suitable for inline rendering.

Parsing Without Warnings

Real-world editor HTML may include HTML5 tags, incomplete documents, or markup that libxml complains about. The PHP manual for DOMDocument::loadHTML notes that parsing behavior depends on libxml and that modern HTML can produce warnings.

For a normalization helper, warnings should not leak into logs or responses, so the parser uses internal error handling:

php

libxml_use_internal_errors(true);
$loaded = $dom->loadHTML($source, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
libxml_clear_errors();

The flags reduce extra document wrappers in the serialized output. The function still treats parse failure as non-fatal:

php

if (! $loaded) {
    return $html;
}

That fallback matters. If the helper cannot confidently extract a body, it should return the original input rather than silently dropping content.

Preserving UTF-8

The implementation prepends an XML encoding hint before parsing:

php

'<?xml encoding="UTF-8">' . $trimmed

This is a common DOMDocument workaround for UTF-8 content. Without an explicit encoding signal, non-ASCII characters may be interpreted incorrectly depending on the input and libxml behavior.

If your source documents already contain reliable charset metadata, you may not need this exact approach. For pasted snippets and CMS content, the explicit hint is a useful defensive default.

Extracting Only Body Children

Once the document is loaded, the important step is serializing each child of <body>:

php

$body = $dom->getElementsByTagName('body')->item(0);

$inner = '';

foreach ($body->childNodes as $child) {
    $inner .= $dom->saveHTML($child);
}

DOMDocument::saveHTML() can serialize either the full document or a specific node. The PHP manual documents that optional node parameter on DOMDocument::saveHTML. Passing each child node gives us the equivalent of browser innerHTML.

This avoids returning the <body> wrapper itself.

Using It in a View Pipeline

A common use case is sanitizing the shape of stored template HTML before handing it to a component:

php

$template = bodyContents((string) $storedTemplate);

return view('components.dynamic-form', [
    'template' => $template,
]);

In Laravel Blade, that might look like this:

blade

@php
    $template = \App\Support\Html::bodyContents((string) $template);
@endphp

<livewire:dynamic-form :template="$template" />

Now editors can paste either a fragment or a full HTML document, and the component still receives the shape it expects.

Important Security Boundary

Important

This helper does not sanitize HTML. It only extracts body contents from document-shaped input.

If users can submit untrusted HTML, run a proper sanitizer after extraction. For example, use an allowlist-based sanitizer such as HTML Purifier or the sanitizer already approved in your stack.

The safe pipeline is:

normalize document shape with bodyContents()
sanitize untrusted tags and attributes
render with the escaping rules appropriate for your framework

Do not treat DOM parsing as a security filter.

PHP 8.4 Note

For new code on PHP 8.4+, also review Dom\HTMLDocument, which PHP recommends for HTML5-aware parsing. DOMDocument::loadHTML() remains widely available and works well for this narrow extraction task, but modern HTML parsing is a moving target.

If you need precise HTML5 tree construction, prefer the newer parser. If you only need a small compatibility helper in a PHP 8.1/8.2/8.3 application, DOMDocument is still a practical choice.

Conclusion

bodyContents() is small, but it prevents a common rendering problem: full pasted HTML documents leaking into components that expect fragments.

The useful patterns are:

detect document-shaped input before parsing
keep fragment input unchanged
parse with quiet libxml error handling
extract body children with saveHTML($child)
fall back to the original input when parsing fails

That gives your content pipeline a stable HTML shape without pretending to solve sanitization, validation, or full HTML cleanup.

Extract Body Contents from Full HTML Documents in PHP with DOMDocument ​

Introduction ​

The Goal ​

The Helper ​

Why Detect Before Parsing? ​

Parsing Without Warnings ​

Preserving UTF-8 ​

Extracting Only Body Children ​

Using It in a View Pipeline ​

Important Security Boundary ​

PHP 8.4 Note ​

Conclusion ​