Convert HTML to Markdown in PHP with league/html-to-markdown
Introduction
Converting HTML to Markdown is a common need when working with web scraping, CMS migrations, RSS feed processing, or preparing content for LLMs. The league/html-to-markdown library provides a robust, battle-tested solution with over 24 million downloads.
This article covers installation, configuration options, table support, and security considerations for untrusted input.
Installation
Install via Composer:
composer require league/html-to-markdownThe library requires PHP 7.2+ with the xml, libxml, and dom extensions (enabled by default on most distributions).
Basic Usage
use League\HTMLToMarkdown\HtmlConverter;
$converter = new HtmlConverter();
$html = '<h1>Hello World</h1><p>This is a <strong>test</strong>.</p>';
$markdown = $converter->convert($html);
echo $markdown;
// # Hello World
//
// This is a **test**.Configuration Options
The converter accepts an array of options:
$converter = new HtmlConverter([
'strip_tags' => true,
'remove_nodes' => 'script style',
'hard_break' => true,
'strip_placeholder_links' => true,
]);Key Options Explained
| Option | Default | Description |
|---|---|---|
strip_tags | false | Remove HTML tags without Markdown equivalents, keeping their content |
remove_nodes | '' | Space-separated list of tags to remove entirely (including content) |
hard_break | false | Convert <br> to \n instead of \n (GFM style) |
strip_placeholder_links | false | Remove <a> tags without href attribute |
header_style | 'setext' | Use 'atx' for # style headers on H1/H2 |
preserve_comments | false | Keep HTML comments in output |
Recommended Configuration for Web Scraping
When converting scraped HTML, you typically want to strip navigation, scripts, and other non-content elements:
$converter = new HtmlConverter([
'strip_tags' => true,
'remove_nodes' => 'script head style noscript nav footer aside header',
'hard_break' => true,
'strip_placeholder_links' => true,
]);Adding Table Support
Table conversion is not enabled by default because tables aren't part of the original Markdown spec. Add support with the TableConverter:
use League\HTMLToMarkdown\HtmlConverter;
use League\HTMLToMarkdown\Converter\TableConverter;
$converter = new HtmlConverter(['strip_tags' => true]);
$converter->getEnvironment()->addConverter(new TableConverter());
$html = '
<table>
<tr><th>Name</th><th>Role</th></tr>
<tr><td>Alice</td><td>Developer</td></tr>
<tr><td>Bob</td><td>Designer</td></tr>
</table>';
echo $converter->convert($html);
// | Name | Role |
// | --- | --- |
// | Alice | Developer |
// | Bob | Designer |Practical Example: Web Scraping Pipeline
Here's a complete utility function for converting scraped HTML to clean Markdown:
use League\HTMLToMarkdown\Converter\TableConverter;
use League\HTMLToMarkdown\HtmlConverter;
function htmlToMarkdown(string $html): string
{
$converter = new HtmlConverter([
'strip_tags' => true,
'remove_nodes' => 'script head style noscript nav footer aside header',
'hard_break' => true,
'strip_placeholder_links' => true,
]);
$converter->getEnvironment()->addConverter(new TableConverter());
return $converter->convert($html);
}
// Usage
$html = file_get_contents('https://example.com/article');
$markdown = htmlToMarkdown($html);Preprocessing: Cleaning Syntax-Highlighted Code
When scraping documentation sites, code blocks often contain <span> tags for syntax highlighting. These can clutter your Markdown output. Clean them before conversion:
function removeSpansFromCode(string $html): string
{
$dom = new DOMDocument('1.0', 'UTF-8');
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$spans = $xpath->query('//pre//span | //code//span');
foreach ($spans as $span) {
while ($span->childNodes->length > 0) {
$span->parentNode->insertBefore($span->childNodes->item(0), $span);
}
$span->parentNode->removeChild($span);
}
return trim($dom->saveHTML());
}
// Usage: clean HTML before converting
$cleanHtml = removeSpansFromCode($html);
$markdown = htmlToMarkdown($cleanHtml);Security Considerations
Important
By default, the library preserves unrecognized tags like <script>, <iframe>, and <div>. When processing untrusted user input, always enable strip_tags or remove_nodes.
For user-generated content, combine with HTML Purifier for additional safety:
use HTMLPurifier;
use HTMLPurifier_Config;
use League\HTMLToMarkdown\HtmlConverter;
function safeHtmlToMarkdown(string $untrustedHtml): string
{
// First, sanitize with HTML Purifier
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
$cleanHtml = $purifier->purify($untrustedHtml);
// Then convert to Markdown
$converter = new HtmlConverter([
'strip_tags' => true,
'remove_nodes' => 'script style iframe object embed',
]);
return $converter->convert($cleanHtml);
}Common Issues
DOMDocument Not Found
On CentOS or minimal PHP installations, you may see:
Fatal error: Class 'DOMDocument' not foundFix by installing the PHP XML extension:
# CentOS/RHEL
sudo yum install php-xml
# Ubuntu/Debian
sudo apt-get install php-xmlMalformed HTML Warnings
Suppress warnings for malformed HTML (common with scraped content):
$converter = new HtmlConverter(['suppress_errors' => true]);Conclusion
The league/html-to-markdown library handles the complexity of HTML-to-Markdown conversion with sensible defaults and extensive customization. Key takeaways:
- Use
strip_tagsandremove_nodesto clean unwanted elements - Add
TableConverterfor table support - Always sanitize untrusted input before processing
- Preprocess syntax-highlighted code blocks for cleaner output
For LLM pipelines or content processing, this combination of configuration options provides clean, readable Markdown from almost any HTML source.

