sore-inf
Introduction
sore-inf is a tool that learns deterministic single occurrence regular expressions (SOREs) from positive examples. This is an implementation of the learning algorithms from the paper that Timo Kötzing and I published at ICDT 2013, containing bugfixes from the journal version (see
there for links to both versions).
For further explanation, see the paper. If you want to learn DTDs, use the
dtd-inf tool.
Installation
You need to install
Python 3 on your computer (I do not know or care whether Python 2 will work). Download the
package, unpack it. You can then run
python3 sore-inf.py --help
. (Depending on your system, you can give
sore-inf.py
executable rights and run it directly.)
Example usage
./sore-inf.py abc acb c
Computes a deterministic SORE for the sample consisting of the words
abc
,
acb
, and
c
. Note that the prettifcation algorithm uses character classes, e.g., instead of
(a|b|c)
, it writes
[a-c]
.
Implementation notes
Right now, we only allow letters that consist of single characters. As this is a tool that is mostly directed to theorists, it would be useful to have letters like b1
or A327
, but we couldn't be bothered to implement this. (In fact, if you want to do this: most of the code already supports this, you would only need to change some parsing code in src/dtd-inf/sore.py
.)
The code has not been audited or debugged in large scale (it should work, but don't bet the farm on it). It is also not particularly optimized, neither for efficiency, nor for readability. (But it should be quite fast.) Also, same files use white space as indentation, others tabs. Sorry.
The way the prettification algorithm is used is particularly ugly: Instead of prettifying the inferred regular expression in a tree type data structure, the expression is inferred as string, which is then parsed and prettified. A nice implementation would skip the intermediate string representation. This is one of the reasons why, if your element names contain any fancy unicode stuff, something might break.
Apart from the algorithms in the paper, this implementation also uses an additional post-processing step that makes the expressions prettier (e.g., turns nested expressions like (a|(b|(c|d)))
into (a|b|c|d)
, etc.) The language theoretic details can be found in an eternally half-finished technical report that is available from me.
Authors and license
The core inference algorithm was implemented by Dominik D. Freydenberger and uses
this implementation of Tarjan's Algorithm by Dries Verdegem (which, to our knowledge, is in the public domain). The prettification algorithm is a part of the
M.O.D.O.D. library, which was designed (only for DREs) by Dominik D. Freydenberger and implemented by
Christoph Burschka. The creation of the M.O.D.O.D. library was generously supported by the program "Nachwuchswissenschaftler/innen im Fokus" (Goethe University). We put this stuff under the MIT License, and the source code is already included.