dtd-inf

Introduction

dtd-inf is an XML schema inference tool that learns DTDs from positive examples. This is an implementation of the learning algorithms from the paper that Timo Kötzing and I published at ICDT 2013, containing bugfixes from the journal version (see there for links to both versions).

dtd-inf computes the most precise element type declarations of the input XML that are possible, with the restriction that every declaration may use element names only once. While this might sound quite restricted, it seems to be enough for most applications. If you do not care about DTDs, but do care about regular expressions, you can use the sore-inf tool (also, the flag -d and the hidden bonus flags that are mentioned in the README might be of interest to you).

Installation

You need to install Python 3 on your computer (I do not know or care whether Python 2 will work). Download the package, unpack it. You can then run python3 dtd-inf.py --help. (Depending on your system, you can give dtd.py executable rights and run it directly.)

Example usage

My favorite example is the Mondial XML file. It is complicated enough to produce some interesting output, but not so large that parsing it takes forever. So it's perfect for toying around.

./dtd-inf.py mondial.xml
Computes a DTD for the file.
./dtd-inf.py mondial.xml -j
Omits the doctype stuff around the element type declarations.
./dtd-inf.py mondial.xml -js
Also omits all empty (boring) element type declarations.
./dtd-inf.py mondial.xml -j -e country city
Only learns the element type declarations for the elements country and city. You can also read from multiple files, e.g.
./dtd-inf.py file1.xml file2.xml -e elt1 elt2
Here, the help that is automatically generated by argparse is a little bit misleading: First, specify the list of files, then use the flags, to avoid ambiguous statements like ./dtd-inf.py -e elt1 elt2 file1.xml file2.xml.

Implementation notes

Authors and license

The core inference algorithm was implemented by Dominik D. Freydenberger and uses this implementation of Tarjan's Algorithm by Dries Verdegem (which, to our knowledge, is in the public domain). The prettification algorithm is a part of the M.O.D.O.D. library, which was designed (only for DREs) by Dominik D. Freydenberger and implemented by Christoph Burschka. The creation of the M.O.D.O.D. library was generously supported by the program "Nachwuchswissenschaftler/innen im Fokus" (Goethe University). We put this stuff under the MIT License, and the source code is already included.