Extended Regular Expressions: Succinctness and Decidability

(Dominik D. Freydenberger)

Abstract

Most modern implementations of regular expression engines allow the use of variables (also called backreferences). The resulting extended regular expressions (which, in the literature, are also called practical regular expressions, rewbr, or regex) are able to express non-regular languages.

The present paper demonstrates that extended regular-expressions cannot be minimized effectively (neither with respect to length, nor number of variables), and that the tradeoff in size between extended and “classical” regular expressions is not bounded by any recursive function. In addition to this, we prove the undecidability of several decision problems (universality, regularity, and cofiniteness) for extended regular expressions. Furthermore, we show that all these results hold even if the ex- tended regular expressions contain only a single variable.

Versions of the Paper

Theory of Computing Systems. Final version, preprint.
STACS 2011 (★ invited to special issue). Final version (open access).
more information

Additional Comments

By combining the coding techniques from this paper with the simulation of the Collatz function from that paper, it is possible to construct regular expressions of the following form (written in Python style):
^(((ba{n-1}b[ab]*)|(ba{n+1}[ab]*)|(a[ab]*)|([ab]*a)|([ab]*bb[ab]*)|(b)|([ab]*aab)|([ab]*b(aa)*ab(aaaaaa)*(()|(a|aa|aaa|aaaaa))b[ab]*))|([ab]*b(?P<x>(a*))(((?P=x)b(?P=x)a)|((?P=x)ab(?P=x){6}aaaaa)|((?P=x)(aa)+b(?P=x)b)|((?P=x)(aa)+ab(?P=x){6}aaaab))[ab]*))$
By choosing a value for n (and replacing n-1 and n+1 appropriately), we obtain a regular expression that is equivalent to ^[ab]*$ if and only if iterating the Collatz function on starting value n leads to the cycle 4, 2, 1. This is quite easy to see after reading both papers — but considering the length of those, I would be willing to provide a more detailed explanation, if someone asks nicely.
Note the interesting contrast between classical and extended regular expressions: For the former, equivalence testing (and minimality) are PSPACE-complete problems. But adding a single variable (that ranges over the language 0^*) makes this problem undecidable. Obviously, we could restrict variables to a finite amount of values. Then, using a reasoning that is similar to the one in the paper for tradeoffs for finite languages, we would be able to use a brute-force minimization algorithm. But this would be neither elegant, nor practical. Instead, I consider it more promising to examine classes of extended regular expressions that restrict other parameters: For the price of losing some regular languages, we now have a chance at efficient minimization. (This is the topic of my current research project).