Fast Learning of Restricted Regular Expressions and DTDs

(Dominik D. Freydenberger, Timo Kötzing)

Abstract

We study the problem of generalizing from a finite sample to a language taken from a predefined language class. The two language classes we consider are subsets of the regular languages and have significance in the specification of XML documents (the classes corresponding to so-called chain regular expressions, CHAREs, and to single-occurrence regular expressions, SOREs). The previous literature gives a number of algorithms for generalizing to SOREs providing a trade-off between quality of the solution and speed. Furthermore, a fast but non-optimal algorithm for generalizing to CHAREs is known. For each of the two language classes we give an efficient algorithm returning a minimal generalization from the given finite sample to an element of the fixed language class; such generalizations are called descriptive. In this sense of descriptivity, both our algorithms are optimal.

Versions of the Paper

Theory of Computing Systems. Final version, preprint.
ICDT 2013 (invited to special issue). Final version, preprint.

Additional Comments

The implementation of the algorithms can be found over there