Grammars for Document Spanenrs
A new grammar-based language for defining information-extractors from textual content based on the document spanners framework of Fagin et al. is proposed. While studied languages for document spanners are mainly built upon regex formulas, which are regular expressions extended with variables, this new language is based on context-free grammars. The expressiveness of these grammars is compared with previously studied classes of spanners and the complexity of their evaluation is discussed. An enumeration algorithm that outputs the results with constant delay after cubic preprocessing in the input document is presented.
READ FULL TEXT