sore-inf

Introduction

sore-inf is a tool that learns deterministic single occurrence regular expressions (SOREs) from positive examples. This is an implementation of the learning algorithms from the paper that Timo Kötzing and I published at ICDT 2013, containing bugfixes from the journal version (see there for links to both versions). For further explanation, see the paper. If you want to learn DTDs, use the dtd-inf tool.

Installation

You need to install Python 3 on your computer (I do not know or care whether Python 2 will work). Download the package, unpack it. You can then run python3 sore-inf.py --help. (Depending on your system, you can give sore-inf.py executable rights and run it directly.)

Example usage

./sore-inf.py abc acb c

Computes a deterministic SORE for the sample consisting of the words abc, acb, and c. Note that the prettifcation algorithm uses character classes, e.g., instead of (a|b|c), it writes [a-c].

Implementation notes

Right now, we only allow letters that consist of single characters. As this is a tool that is mostly directed to theorists, it would be useful to have letters like b1 or A327, but we couldn't be bothered to implement this. (In fact, if you want to do this: most of the code already supports this, you would only need to change some parsing code in src/dtd-inf/sore.py.)
The code has not been audited or debugged in large scale (it should work, but don't bet the farm on it). It is also not particularly optimized, neither for efficiency, nor for readability. (But it should be quite fast.) Also, same files use white space as indentation, others tabs. Sorry.
The way the prettification algorithm is used is particularly ugly: Instead of prettifying the inferred regular expression in a tree type data structure, the expression is inferred as string, which is then parsed and prettified. A nice implementation would skip the intermediate string representation. This is one of the reasons why, if your element names contain any fancy unicode stuff, something might break.
Apart from the algorithms in the paper, this implementation also uses an additional post-processing step that makes the expressions prettier (e.g., turns nested expressions like (a|(b|(c|d))) into (a|b|c|d), etc.) The language theoretic details can be found in an eternally half-finished technical report that is available from me.

Authors and license

The core inference algorithm was implemented by Dominik D. Freydenberger and uses this implementation of Tarjan's Algorithm by Dries Verdegem (which, to our knowledge, is in the public domain). The prettification algorithm is a part of the M.O.D.O.D. library, which was designed (only for DREs) by Dominik D. Freydenberger and implemented by Christoph Burschka. The creation of the M.O.D.O.D. library was generously supported by the program "Nachwuchswissenschaftler/innen im Fokus" (Goethe University). We put this stuff under the MIT License, and the source code is already included.