PHP and the Negative Lookahead

I am, quite unabashedly, a sucker for a good regex question. I dig regular expressions. I love their power, their flexibility and their precision. Often I’ll use a regex even when I don’t have to because I can be absolutely precise about what I want even at the expense of a millisecond or two.

Being the sucker that I am, I was thoroughly hooked when @benrasmusen asked for regex help on Twitter yesterday.

He had a set of text that looked something like this:

<p>With more than 20 years’ experience recording, mixing, and mastering, I have an in-depth understanding of the mindsets involved in each step of a project. The focus of the mixer is the balances of the individual elements and how they relate to create a song. By contrast, the mastering engineer focuses on the song as a whole.</p>
<p>In mixing numerous songs for an album, the mixing engineer typically has a hard time creating consistency across all the songs. But the mastering engineer seeks to maintain energy levels and sonic flow among all the songs (except when not appropriate), creating a unified signature for the record. Those involved in recording a project often become emotionally attached to it, whereas the mastering engineer can provide an objective ear.</p>

He also had a set of terms that he wanted to find in that text so that he could replace each instance of the term with markup that injected a link around the text. For example, the word “mastering” might be replaced with:

<a href="#" title="A link to information about mastering">mastering</a>

The problem he was having was that the title attribute value applied to the injected link occasionally contained a word that matched a term that would be replaced in a future iteration. For example, in his case, the title of the link around “mastering” included the word “mix”. “Mix” was another term that had to be replaced with its own link and that happened after the replacement of “mastering”. Result: the instance of “mix” in the title attribute of one link was being replaced with another and he ended up with nested links – a new link nested right within the title attribute of another.

Oy.

Over the years, I think I’ve become a competent user of regex and perhaps even proficient. I am not, however, a guru. There are a number of concepts that elude me on a practical level, though I understand the theory well enough. One of these concepts is the lookahead and another is its sister concept, the lookbehind.

In this case, I knew enough to know that a lookahead (in this case, a negative lookahead) was needed, but not enough to know exactly how to do it. So I looked. And now I’m documenting for my own later reference. What he needed was this regex:

$regex = '/\b' . $term . '\b(?![^<]*>)/ig'

This finds all of the word-delimited instances of the replacement term that are not followed by zero or more instances any character other than a “<” and is, in turn, followed by a “>”. In other words, all of the instances of a replacement term that are not followed at some point, but before opening a new tag, by a “>”.

It’s not perfect, of course, but it’s workable, I think.

Subscribe2 Comments on PHP and the Negative Lookahead

  1. Ben Rasmusen said...

    I didn’t realize you had blogged about this! I just wanted to say thanks again for the help!

  2. Rob Wilkerson said...

    Yeah, little reminder for myself. It was an interesting problem that I might want/have to revisit to solve a different problem down the line. Half of my posts are mostly pseudo-sticky-notes to myself. This site might as well be a refrigerator with magnets all over it.

    Hopefully the solution is still working. :-)