Stop using preg_* on HTML and start using \Dom\HTMLDocument instead
https://shkspr.mobi/blog/2025/05/stop-using-preg_-on-html-and-use-domhtmldocument/
It is a truth universally acknowledged that a programmer in possession of some HTML will eventually try to parse it with a regular expression.
This makes many people very angry and is widely regarded as a bad move.
In the bad old days, it was somewhat understandable for a PHP coder to run a quick-and-dirty preg_replace()
on a scrap of code. They probably could control the input and there wasn't a great way to manipulate an HTML5 DOM.
Rejoice sinners! PHP 8.4 is here to save your wicked souls. There's a new HTML5 Parser which makes everything better and stops you having to write brittle regexen.
Here are a few tips - mostly notes to myself - but I hope you'll find useful.
Sanitise HTML
This is the most basic example. This loads HTML into a DOM, tries to fix all the mistakes it finds, and then spits out the result.
PHP$html = '<p id="yes" id="no"><em>Hi</div><h2>Test</h3><img />';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED , "UTF-8" );echo $dom->saveHTML();
It uses LIBXML_HTML_NOIMPLIED
because we don't want a full HTML document with a doctype, head, body, etc.
If you want Pretty Printing, you can use my library.
Get the plain text
OK, so you've got the DOM, how do you get the text of the body without any of the surrounding HTML
PHP$html = '<p><em>Hello</em> World!</p>';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR , "UTF-8" );echo $dom->body->textContent;
Note, this doesn't replace images with their alt text.
Get a single element
You can use the same querySelector()
function as you do in JavaScript!
PHP$element = $dom->querySelector( "h2" );
That returns a pointer to the element. Which means you can run:
PHP$element->setAttribute( "id", "interesting" );echo $dom->querySelector( "h2" )->attributes["id"]->value;
And you will see that the DOM has been manipulated!
Search for multiple elements
Suppose you have a bunch of headings and you want to get all of them. You can use the same querySelectorAll()
function as you do in JavaScript!
To get all headings, in the order they appear:
PHP$headings = $dom->querySelectorAll( "h1, h2, h3, h4, h5, h6" );foreach ( $headings as $heading ) { // Do something}
Advanced Search
Suppose you have a bunch of links and you want to find only those which point to "example.com/test/". Again, you can use the same attribute selectors as you would elsewhere
PHP$dom->querySelectorAll( "a[href^=https\:\/\/example\.com\/test\/]" );
Replacing content
Sadly, it isn't quite as simple as setting the innerHTML
. Each search returns a node. That node may have children. Those children will also be node which, themselves, may have children, and so on.
Let's take a simple example:
PHP$html = '<h2>Hello</h2>';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );$element = $dom->querySelector( "h2" );$element->childNodes[0]->textContent = "Goodbye";echo $dom->saveHTML();
That changes "Hello" to "Goodbye".
But what if the element has child nodes?
PHP$html = '<h2>Hello <em>friend</em></h2>';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );$element = $dom->querySelector( "h2" );$element->childNodes[0]->textContent = "Goodbye";echo $dom->saveHTML();
That outputs <h2>Goodbye<em>friend</em></h2>
- so think carefully about the structure of the DOM and what you want to replace.
Adding a new node
This one is tricky! Let's suppose you have this:
HTML<div id="page"> <main> <h2>Hello</h2>
You want to add an <h1>
before the <h2>
. Here's how to do this.
First, you need to construct the DOM:
PHP$html = '<div id="page"><main><h2>Hello</h2>';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );
Next, you need to construct an entirely new DOM for your new node.
PHP$newHTML = "<h1>Title</h1>";$newDom = \Dom\HTMLDocument::createFromString( $newHTML, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );
Next, extract the new element from the new DOM, and import it into the original DOM:
PHP$element = $dom->importNode( $newDom->firstChild, true );
The element now needs to be inserted somewhere in the original DOM. In this case, get the h2
, tell its parent node to insert the new node before the h2
:
PHP$h2 = $dom->querySelector( "h2" );$h2->parentNode->insertBefore( $element, $h2 );echo $dom->saveHTML();
Out pops:
HTML<div id="page"> <main> <h1>Title</h1> <h2>Hello</h2> </main></div>
An alternative is to use the appendChild()
method. Note that it appends it to the end of the children. For example:
PHP$div = $dom->querySelector( "#page" );$div->appendChild( $element );echo $dom->saveHTML();
Produces:
HTML<div id="page"> <main> <h2>Hello</h2> </main> <h1>Title</h1></div>
And more?
I've only scratched the surface of what the new 8.4 HTML Parser can do. I've already rewritten lots of my yucky old preg_
code to something which (hopefully) is less likely to break in catastrophic ways.
If you have any other tips, please leave a comment.
#HTML #HTML5 #php