Why is it that regex cannot match an XML element?

By | January 12, 2018
Questions:

This article argues that regular expressions cannot match nested structures because regexes are finite automatons.

He then offers a list of problems in which the answer states that the following cannot be solved using regexes:

  1. matching an XML element
  2. matching a C/VB/C# math expression
  3. matching a valid regex

Since 2 & 3 can conceivably contain brackets; this nesting is unsolvable for regexes.
But why is it impossible to match an XML element ? (He didn’t provide examples).

Answers:

You can match a limited subset of HTML tags, if you know in advance the tags to be matched.

But you can’t (reliably or nicely) parse arbitrary HTML. It is not a regular language.

Questions:
Answers:

How would you match this valid XML with regex?

<!--<d>>--<<--><div class='foo' id="bar" inline></div>

It’s like making a wooden car. Sure you can try to do it, but why?

But then comes the part of parsing the XML. How would you extract a set of possibly infinite attributes from an infinite set of elements using a finite set of groups? It’s just not possible due to the nature and structure of regex.

Questions:
Answers:

There are theoretical answers, based on what kind of grammar XML is and what kind of grammar regular expressions can match. These answers are sometimes flawed by the fact that most regular expression libraries we use today can do things that the formal regular expressions defined in computer science can’t do (like back-references).

And there are practical answers. The practical answer is: don’t do it because it’s the wrong tool for the job, your code will be hard to write and unmaintainable, it will be inefficient, it will have bugs, and no-one will know how to change it when the structure of the document changes. And because there are better tools for the job, called XML parsers.

Questions:
Answers:

Regular expressions are free of state. To parse an XML file, you need state. A < might signal the opening of an XML element. If it’s inside a comment <!-- < --> or an attribute value "<" though it means something else. Using Regexen you can only express things in terms of things that come before or after other things. To correctly parse < as opening an XML element you’d need to express something along the lines of:

< but not after <!-- if <!-- wasn’t followed by --> and not after " if " was not closed but only if " was an attribute because " as a text value has no influence on the next < and if not…

And that’s only for a simple <, not even covering all the possibilities. There are a handful of XML special chars that all have the same kind of circular conditions. Constructing a Regex that expresses all these conditions correctly for all cases is virtually impossible. It’s trivial with a state machine though.

Leave a Reply

Your email address will not be published. Required fields are marked *