Convert HTML to Markdown in PHP with league/html-to-markdown

Introduction

Converting HTML to Markdown is a common need when working with web scraping, CMS migrations, RSS feed processing, or preparing content for LLMs. The league/html-to-markdown library provides a robust, battle-tested solution with over 24 million downloads.

This article covers installation, configuration options, table support, and security considerations for untrusted input.

Installation

Install via Composer:

bash

composer require league/html-to-markdown

The library requires PHP 7.2+ with the xml, libxml, and dom extensions (enabled by default on most distributions).

Basic Usage

php

use League\HTMLToMarkdown\HtmlConverter;

$converter = new HtmlConverter();

$html = '<h1>Hello World</h1><p>This is a <strong>test</strong>.</p>';
$markdown = $converter->convert($html);

echo $markdown;
// # Hello World
//
// This is a **test**.

Configuration Options

The converter accepts an array of options:

php

$converter = new HtmlConverter([
    'strip_tags' => true,
    'remove_nodes' => 'script style',
    'hard_break' => true,
    'strip_placeholder_links' => true,
]);

Key Options Explained

Option	Default	Description
`strip_tags`	`false`	Remove HTML tags without Markdown equivalents, keeping their content
`remove_nodes`	`''`	Space-separated list of tags to remove entirely (including content)
`hard_break`	`false`	Convert `<br>` to `\n` instead of `\n` (GFM style)
`strip_placeholder_links`	`false`	Remove `<a>` tags without `href` attribute
`header_style`	`'setext'`	Use `'atx'` for `#` style headers on H1/H2
`preserve_comments`	`false`	Keep HTML comments in output

Recommended Configuration for Web Scraping

When converting scraped HTML, you typically want to strip navigation, scripts, and other non-content elements:

php

$converter = new HtmlConverter([
    'strip_tags' => true,
    'remove_nodes' => 'script head style noscript nav footer aside header',
    'hard_break' => true,
    'strip_placeholder_links' => true,
]);

Adding Table Support

Table conversion is not enabled by default because tables aren't part of the original Markdown spec. Add support with the TableConverter:

php

use League\HTMLToMarkdown\HtmlConverter;
use League\HTMLToMarkdown\Converter\TableConverter;

$converter = new HtmlConverter(['strip_tags' => true]);
$converter->getEnvironment()->addConverter(new TableConverter());

$html = '
<table>
    <tr><th>Name</th><th>Role</th></tr>
    <tr><td>Alice</td><td>Developer</td></tr>
    <tr><td>Bob</td><td>Designer</td></tr>
</table>';

echo $converter->convert($html);
// | Name | Role |
// | --- | --- |
// | Alice | Developer |
// | Bob | Designer |

Practical Example: Web Scraping Pipeline

Here's a complete utility function for converting scraped HTML to clean Markdown:

php

use League\HTMLToMarkdown\Converter\TableConverter;
use League\HTMLToMarkdown\HtmlConverter;

function htmlToMarkdown(string $html): string
{
    $converter = new HtmlConverter([
        'strip_tags' => true,
        'remove_nodes' => 'script head style noscript nav footer aside header',
        'hard_break' => true,
        'strip_placeholder_links' => true,
    ]);

    $converter->getEnvironment()->addConverter(new TableConverter());

    return $converter->convert($html);
}

// Usage
$html = file_get_contents('https://example.com/article');
$markdown = htmlToMarkdown($html);

Preprocessing: Cleaning Syntax-Highlighted Code

When scraping documentation sites, code blocks often contain <span> tags for syntax highlighting. These can clutter your Markdown output. Clean them before conversion:

php

function removeSpansFromCode(string $html): string
{
    $dom = new DOMDocument('1.0', 'UTF-8');

    libxml_use_internal_errors(true);
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    libxml_clear_errors();

    $xpath = new DOMXPath($dom);
    $spans = $xpath->query('//pre//span | //code//span');

    foreach ($spans as $span) {
        while ($span->childNodes->length > 0) {
            $span->parentNode->insertBefore($span->childNodes->item(0), $span);
        }
        $span->parentNode->removeChild($span);
    }

    return trim($dom->saveHTML());
}

// Usage: clean HTML before converting
$cleanHtml = removeSpansFromCode($html);
$markdown = htmlToMarkdown($cleanHtml);

Security Considerations

Important

By default, the library preserves unrecognized tags like <script>, <iframe>, and <div>. When processing untrusted user input, always enable strip_tags or remove_nodes.

For user-generated content, combine with HTML Purifier for additional safety:

php

use HTMLPurifier;
use HTMLPurifier_Config;
use League\HTMLToMarkdown\HtmlConverter;

function safeHtmlToMarkdown(string $untrustedHtml): string
{
    // First, sanitize with HTML Purifier
    $config = HTMLPurifier_Config::createDefault();
    $purifier = new HTMLPurifier($config);
    $cleanHtml = $purifier->purify($untrustedHtml);

    // Then convert to Markdown
    $converter = new HtmlConverter([
        'strip_tags' => true,
        'remove_nodes' => 'script style iframe object embed',
    ]);

    return $converter->convert($cleanHtml);
}

Common Issues

DOMDocument Not Found

On CentOS or minimal PHP installations, you may see:

Fatal error: Class 'DOMDocument' not found

Fix by installing the PHP XML extension:

bash

# CentOS/RHEL
sudo yum install php-xml

# Ubuntu/Debian
sudo apt-get install php-xml

Malformed HTML Warnings

Suppress warnings for malformed HTML (common with scraped content):

php

$converter = new HtmlConverter(['suppress_errors' => true]);

Conclusion

The league/html-to-markdown library handles the complexity of HTML-to-Markdown conversion with sensible defaults and extensive customization. Key takeaways:

Use strip_tags and remove_nodes to clean unwanted elements
Add TableConverter for table support
Always sanitize untrusted input before processing
Preprocess syntax-highlighted code blocks for cleaner output

For LLM pipelines or content processing, this combination of configuration options provides clean, readable Markdown from almost any HTML source.

Convert HTML to Markdown in PHP with league/html-to-markdown ​

Introduction ​

Installation ​

Basic Usage ​

Configuration Options ​

Key Options Explained ​

Recommended Configuration for Web Scraping ​

Adding Table Support ​

Practical Example: Web Scraping Pipeline ​

Preprocessing: Cleaning Syntax-Highlighted Code ​

Security Considerations ​

Common Issues ​

DOMDocument Not Found ​

Malformed HTML Warnings ​

Conclusion ​