Home      Updates      RSS Feed      About      Code      Contact

  Regular expressions not so regular? February  7 2012  9:50 PM EST

Now, if you are someone who has heard of regular expressions and HTML/XML, you have also probably heard that regular expressions cannot parse HTML as HTML is not a regular language. If you've dug a little deeper, or read any of the comments surrounding this, you may also have heard that the term "regular expression" has come to mean something more than the official computer science term. So, can these super-regex match HTML?

When I first asked this question -- PCRE supports recursive regular expressions. Does that mean it _can_ parse HTML? -- I got several "answers" about HTML not being regular, and that I shouldn't even be thinking about parsing HTML with regex as I should be using a real parser. I was linked to this website , which gives a proof (over my head, at the moment) that XML is not regular, and therefore cannot be matched by a regular expression. None of those "answers" deterred me in my pursuit of knowledge, however.

The article mentions that regular expressions cannot guarantee palindrome-edness, which I knew for a fact PCRE could (at least to a limited extent). Take this for example:

^(.)(.).\2\1$
This matches five letter palindromes using PCRE backreferences. That gave me some backup to the argument that "regex are not regular expressions", so, off I went to wikipedia. There I found out that the definition of regular expression is not that complex. Wikipedia states that the simplest regular expressions are the empty set, empty string, and any single character in the alphabet. Then anything coming from the operations:

  1. Concatenation, so { "a" }{ "b" } becomes { "ab" }
  2. Alternation, so { "a" } | { "b" } becomes { "a", "b" }
  3. Kleene Star , which is the same star used in nearly every other regex. It means the preceding thing 0 or more times.

Parenthesis are used for grouping (to give different precedence to the operators). The ? and + included in many flavours of regex are simplified cases of the three operations listed above. The same is also true with the squiggly braces { and }.

At this point it is really late, so I'm now just going to give you the progression my regex has taken, and hopefully come back later to fill in the gaps. Please, if you see this still here in the future, send me an email. The contact link is at the bottom/top.

The first thing I got was something that could recursively match tags.

^(<([^>]+)>((?1)+|[^<]*)</\2>)$

Then I added tags that didn't have a separate closing one, like <img />

^(<([^>]+)>((?1)+|[^<]*)</\2>|<[^/]+/>)$

Then we got to comments. I was inexperienced with (read never before used) negative/positive lookahead/behind, so it took a bit of jiggering to get to this:

^(<([^>]+)>((?1)+|[^<]*)</\2>|<[^/]+/>|<!--(.*(?<!-->)|)-->)$

The very next thing was the annoying special cases for <!DOCTYPE> and <?xml?> at the very start. That yielded:

^((<\?xml.*(?!<!\?>)\?>)?<!DOCTYPE[^>]*>)?(<([^>]+)>((?3)+|[^<]*)</\4>|<[^/]+/>|<!--(.*(?<!-->)|)-->)$

The next thing I did (after this was originally posted, but not long after ;)) was to fix the script issue ajanata came up with by requiring an exact end to the start tag, and to nail down the allowed tag names.

^((<\?xml.*(?!<!\?>)\?>)?<!DOCTYPE[^>]*>)?(<(\w+)(?:\s+([^>]+))>((?3)+|.*(?!<</\4>))</\4>|<[^/]+/>|<!--(.*(?<!-->)|)-->)$

Everything up until this point had just seemed like special cases. Then ajanata pointed me to the spec, and I saw the link for tables. I still haven't confirmed it, but something that would definitely discount PCRE for HTML matching would be if the td count inside of each tr of the same table need to match (allowing for the possibility of colspan). At that point I switched pages to the XML spec ;)

Addendum: I still haven't come back to write the rest of this post. It's now the 11th, and I haven't done a whole lot more with the XML matching regex. I did, however, improve the palindrome matching one so that it can match arbitrarily long palindromes (up to recursion depth * 2 + 1 long):

s/^((.)((?1)|.|)\2)?(.*)$/This is (?1:not )a palindrome: \1\4
This uses conditional replacement, which apparently a lot of regex implementations do not implement. But, then, I doubt it is used very often.

Another addendum: I was just asked how the palindrome matching regex works, so I'll put a copy of my reply here.

Let's handle the first group first. Here's a simpler version that only matches 2 and 3 character palindromes:

((.)(.|)\2)
That matches a character, saving it to the second group. If you're familiar with the \2 syntax to access it in a replacement, it is the same here, a back reference to the currently captured second group's contents.

You can match "squares" -- words that are the same letters twice -- using this, by doing something like:
(.+)\1
that is, anything followed by itself.

So, the the we have so far:

((.)(.|)\2)
matches any character, followed by any character or nothing (the "middle" of the palindrome), finished with the same character as the beginning.

Now here is where it get's fun :) Just as you can reference what you have already captured, you can also reference previous patterns. So
([abc][def])(?1)
says to match one of a, b, c followed by d, e, f; and then another group of letters a, b, or c followed by d, e, or f. Luckily, we don't have to fully finish a pattern to reference it. Why should this be allowed? So you can reference a pattern inside of itself! Yes, we just achieved a recursive regex.
((.)(.|(?1)|)\2)
That is the same guy as we started with, but we changed the middle of the palindrome from looking like just one or 0 characters into looking like a palindrome.

There are a few modifiers so that all strings match, and so the palindrome must be the whole line. In practice, I modify the string beforehand to strip out punctuation, space and case (though the ignore case modifier may have been better). This is actually working on a bot of mine living in #uakroncs on slashnet.
Basically, either the first group matches (which is the definition of a palindrome), or the final group matches (the "(.*)" bit). In the replacement string I use conditional replacement. It's syntax is

(?[group]true-replacement:false-replacement)
Group here can either be a number or the name of one of the groups. So, if group 1 matched, it is a palindrome and we put the empty string in there. Otherwise it is not a palindrome, so we insert the word "not" in there. Finally, only \1 or \4 has anything in it, so we can just put them both at the end to get the input back out.

 

My name is Jeff Chapman; You can reach me at: [email protected]