Skip to content

Convert HTML to Markdown in PHP with league/html-to-markdown

Introduction

Converting HTML to Markdown is a common need when working with web scraping, CMS migrations, RSS feed processing, or preparing content for LLMs. The league/html-to-markdown library provides a robust, battle-tested solution with over 24 million downloads.

This article covers installation, configuration options, table support, and security considerations for untrusted input.

Installation

Install via Composer:

bash
composer require league/html-to-markdown

The library requires PHP 7.2+ with the xml, libxml, and dom extensions (enabled by default on most distributions).

Basic Usage

php
use League\HTMLToMarkdown\HtmlConverter;

$converter = new HtmlConverter();

$html = '<h1>Hello World</h1><p>This is a <strong>test</strong>.</p>';
$markdown = $converter->convert($html);

echo $markdown;
// # Hello World
//
// This is a **test**.

Configuration Options

The converter accepts an array of options:

php
$converter = new HtmlConverter([
    'strip_tags' => true,
    'remove_nodes' => 'script style',
    'hard_break' => true,
    'strip_placeholder_links' => true,
]);

Key Options Explained

OptionDefaultDescription
strip_tagsfalseRemove HTML tags without Markdown equivalents, keeping their content
remove_nodes''Space-separated list of tags to remove entirely (including content)
hard_breakfalseConvert <br> to \n instead of \n (GFM style)
strip_placeholder_linksfalseRemove <a> tags without href attribute
header_style'setext'Use 'atx' for # style headers on H1/H2
preserve_commentsfalseKeep HTML comments in output

When converting scraped HTML, you typically want to strip navigation, scripts, and other non-content elements:

php
$converter = new HtmlConverter([
    'strip_tags' => true,
    'remove_nodes' => 'script head style noscript nav footer aside header',
    'hard_break' => true,
    'strip_placeholder_links' => true,
]);

Adding Table Support

Table conversion is not enabled by default because tables aren't part of the original Markdown spec. Add support with the TableConverter:

php
use League\HTMLToMarkdown\HtmlConverter;
use League\HTMLToMarkdown\Converter\TableConverter;

$converter = new HtmlConverter(['strip_tags' => true]);
$converter->getEnvironment()->addConverter(new TableConverter());

$html = '
<table>
    <tr><th>Name</th><th>Role</th></tr>
    <tr><td>Alice</td><td>Developer</td></tr>
    <tr><td>Bob</td><td>Designer</td></tr>
</table>';

echo $converter->convert($html);
// | Name | Role |
// | --- | --- |
// | Alice | Developer |
// | Bob | Designer |

Practical Example: Web Scraping Pipeline

Here's a complete utility function for converting scraped HTML to clean Markdown:

php
use League\HTMLToMarkdown\Converter\TableConverter;
use League\HTMLToMarkdown\HtmlConverter;

function htmlToMarkdown(string $html): string
{
    $converter = new HtmlConverter([
        'strip_tags' => true,
        'remove_nodes' => 'script head style noscript nav footer aside header',
        'hard_break' => true,
        'strip_placeholder_links' => true,
    ]);

    $converter->getEnvironment()->addConverter(new TableConverter());

    return $converter->convert($html);
}

// Usage
$html = file_get_contents('https://example.com/article');
$markdown = htmlToMarkdown($html);

Preprocessing: Cleaning Syntax-Highlighted Code

When scraping documentation sites, code blocks often contain <span> tags for syntax highlighting. These can clutter your Markdown output. Clean them before conversion:

php
function removeSpansFromCode(string $html): string
{
    $dom = new DOMDocument('1.0', 'UTF-8');

    libxml_use_internal_errors(true);
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    libxml_clear_errors();

    $xpath = new DOMXPath($dom);
    $spans = $xpath->query('//pre//span | //code//span');

    foreach ($spans as $span) {
        while ($span->childNodes->length > 0) {
            $span->parentNode->insertBefore($span->childNodes->item(0), $span);
        }
        $span->parentNode->removeChild($span);
    }

    return trim($dom->saveHTML());
}

// Usage: clean HTML before converting
$cleanHtml = removeSpansFromCode($html);
$markdown = htmlToMarkdown($cleanHtml);

Security Considerations

Important

By default, the library preserves unrecognized tags like <script>, <iframe>, and <div>. When processing untrusted user input, always enable strip_tags or remove_nodes.

For user-generated content, combine with HTML Purifier for additional safety:

php
use HTMLPurifier;
use HTMLPurifier_Config;
use League\HTMLToMarkdown\HtmlConverter;

function safeHtmlToMarkdown(string $untrustedHtml): string
{
    // First, sanitize with HTML Purifier
    $config = HTMLPurifier_Config::createDefault();
    $purifier = new HTMLPurifier($config);
    $cleanHtml = $purifier->purify($untrustedHtml);

    // Then convert to Markdown
    $converter = new HtmlConverter([
        'strip_tags' => true,
        'remove_nodes' => 'script style iframe object embed',
    ]);

    return $converter->convert($cleanHtml);
}

Common Issues

DOMDocument Not Found

On CentOS or minimal PHP installations, you may see:

Fatal error: Class 'DOMDocument' not found

Fix by installing the PHP XML extension:

bash
# CentOS/RHEL
sudo yum install php-xml

# Ubuntu/Debian
sudo apt-get install php-xml

Malformed HTML Warnings

Suppress warnings for malformed HTML (common with scraped content):

php
$converter = new HtmlConverter(['suppress_errors' => true]);

Conclusion

The league/html-to-markdown library handles the complexity of HTML-to-Markdown conversion with sensible defaults and extensive customization. Key takeaways:

  • Use strip_tags and remove_nodes to clean unwanted elements
  • Add TableConverter for table support
  • Always sanitize untrusted input before processing
  • Preprocess syntax-highlighted code blocks for cleaner output

For LLM pipelines or content processing, this combination of configuration options provides clean, readable Markdown from almost any HTML source.