Document Spanners: From Expressive Power to Decision Problems

(Dominik D. Freydenberger, Mario Holldack)

Supported by the project Taming Extended Regular Expressions

Abstract

We examine document spanners, a formal framework for information extraction that was introduced by Fagin et al.. A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string). We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on core spanners, which extend the former by standard algebraic operators and string equality selection. First, we compare the expressive power of core spanners to three models -- namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.

Versions of the Paper

Theory of Computing Systems. Final version (open access)
ICDT 2016 (★ invited to special issue). Final version, full version (with proofs)
more information

Additional Comments

In a more recent paper, I extended the conversion of core spanner representations to logical formulas into SpLog, a fragment of the existential theory of concatenation with regular constraints that has exactly the same expressive power as core spanners.
The conference version contains an annoying little mistake: We (twice) claim that every EC^reg can be converted into an equi-satisfiable word equation, with a polynomial bound on the blowup. The latter claim is wrong: The constructions we cite can lead to an exponential blowup. Luckily, the results remain unaffected.