A Logic for Document Spanners
(Dominik D. Freydenberger)
Document spanners are a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren. One of the central models in this framework are core spanners, which are based on regular expressions with variables that are then extended with an algebra.
As shown by Freydenberger and Holldack, there is a connection between core spanners and ECreg, the existential theory of concatenation with regular constraints.
The present paper further develops this connection by defining SpLog, a fragment of ECreg that has the same expressive power as core spanners. This equivalence extends beyond equivalence of expressive power, as we show the existence of polynomial time conversions between this fragment and core spanners. This even holds for variants of core spanners that are based on automata instead of regular expressions. Applications of this approach include an alternative way of defining relations for spanners, insights into the relative succinctness of various classes of spanner representations, and a pumping lemma for core spanners.
Versions of the Paper
- The conference version contained a brief mentioned how (U)CRPQs with string equalities directly map to fragements of SpLog. The situation is a little bit more complicated, as this holds only if the path queries have some natural restriction. Details on this can be found in Section 7 of the full version.
- If you are interested in graph databases, skip the conference version and head directly to the journal version (see above).
- As mentioned in the Abstract, this paper continues work that was started in a previous paper by Mario Holldack and me.
- Markus Schmid and I use the polynomial time transformation of regex to formulas in our paper on deterministic regex.
- Benny Kimelfeld, Liat Peterfreund, and I used the notion of functional automata and the idea of defining spanner semantics through ref-words in this paper on evaluating spanners (mostly without string equalities).
- Thinking about it a few months later, I think that I should have chosen the title A Logic for Core Spanners instead, instead of the more general Document Spanners, considering that SpLog has the same expressive power as core spanners. I have to admit that for some time, my intuition considered core spanners and document spanners to be synonyms – obviously, nobody would ever want to work with spanners that are more powerful than core spanners (that would be like wanting to use more than 640k of RAM). Anyway, I have to think about renaming this paper for the journal version (which would break the current layout of my publications page, but be in tradition with the original spanners paper by Fagin et al., which was also renamed between the two versions). (Added later: I completely forgot to rethink this, and submitted the paper with the original title. That is one way of resolving this.)