Dominik D. Freydenberger

Project Summary

Introduction
Objectives

Introduction

The goal of this project is the development of new variants of regular expressions with backreferencing, in particular variants that can be optimized efficiently.

Regular expressions are a formal model from theoretical computer science that is used to define formal languages. Since their introduction in 1956, they have been extensively studied not only in formal language theory, but also in areas as, for example, database theory or bioinformatics.

In parallel to this development, practical computer science has adopted regular expressions as a standard model for the definition of patterns in strings. Common applications include the search in texts or data bases, filter rules for servers, and intrusion detection systems.

Over the years, regular expressions in practical computer science evolved into a model that is significantly different from the classical regular expressions in theory. One important difference is the adoption of a backreference operator, commonly called backreferences. These backreferences are used in almost all modern software libraries for regular expressions.

On the one hand, extending regular expressions with backreferences increases their expressive power significantly. On the other, it also reduces their manageability. For example, matching regular expressions with backreferences is an NP-hard problem.

In many applications of regular expressions, the same expression is used multiple times. For efficiency reasons, it is preferable to use a regular expression (with backreferences) that is of minimal length. Hence, an algorithm that converts a regular expression (with backferences) into an equivalent expression of minimal length would be of great practical use.

As shown in Freydenberger (2013), such a minimization algorithm does not exist. This holds even if the expression is restricted to use only a single backreference. These results also apply to a number of different minimization criteria.

Objectives

The central task of this project is to identify classes of restricted regular expressions with backreferences that can be minimized effectively and efficiently. In contrast to previous theoretical examinations of regular expressions with backreferences, this project will focus on other structural restrictions than the number of backreferences. In particular, it will be necessary to examine classes that do not have enough power to define the full class of regular languages.

A secondary goal is the application of the obtained results to closely related areas, in particular to the learning of regular expressions with backreferences, and the transfer of these results to related models, in particular various query languages from database theory.