## Complexity Bounds for Relational Algebra over Document Spanners

(Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, Markus Kröll)

### Abstract

We investigate the complexity of evaluating queries in Relational
Algebra (RA) over the relations extracted by regex formulas (i.e.,
regular expressions with capture variables) over text documents.
Such queries, also known as the regular document spanners, were
shown to have an evaluation with polynomial delay for every positive
RA expression (i.e., consisting of only natural joins, projections
and unions); here, the RA expression is fixed and the input consists
of both the regex formulas and the document. In this work, we
explore the implication of two fundamental generalizations. The
first is adopting the "schemaless" semantics for spanners, as
proposed and studied by Maturana et al. The second is going beyond
the positive RA to allowing the difference operator.

We show that each of the two generalizations introduces
computational hardness: it is intractable to compute the natural
join of two regex formulas under the schemaless semantics, and the
difference between two regex formulas under both the ordinary and
schemaless semantics. Nevertheless, we propose and analyze syntactic
constraints, on the RA expression and the regex formulas at hand,
such that the expressive power is fully preserved and, yet,
evaluation can be done with polynomial delay. Unlike the previous
work on RA over regex formulas, our technique is not (and provably
cannot be) based on the static compilation of regex formulas, but
rather on an ad-hoc compilation into an automaton that incorporates
both the query and the document. This approach also allows us to
include black-box extractors in the RA expression.

### Versions of the Paper

### Additional Comments