An Excellent entry in Joe’s blog on this paper
In our quest for the killer app for BayesStore — a database that supports machine learning models and inference as first-class citizens, information extraction (IE) naturally jump to the table. This applications has huge amounts of noise, incomplete data generated in the extraction process, which, in a lot of the cases, is directly driven by machine learning models!
The current text analytics solutions can only perform offline analytics where the document in text database is exported and processed using machine learning packages and the extraction results are then imported into the database, upon which the users can pose relational queries. However, this solution is especially problematic when the data is too large (more and more true nowadays) or the data is constantly changing, or the model is also changing. Under those scenarios, the offline precomputation is either impossible (data is too large) or quickly become stale (data and model changes).
Thus, we use the BayesStore design principles to support a new type of machine learning model CRF and its inference algorithm Viterbi, which in turn support on-line efficient information extraction. By “on-line”, we mean that inference results are not precomputed and loaded into the database (state-of-the-art), but is computed at the query time! Moreover, since BayesStore store the CRF model, it can compute queries beyond maximum-likelihood. One example showed in the paper is conditional inference, when users give evidence (e.g. the token labels), and a lower probability extraction might be returned conditioned on the evidence.
The paper shows how to ralation-alize the model, the documents and the Viterbi dynamic programming algorithm, and how to make them efficient. Moreover, the paper shows the algorithms for optimizing inference with the following select and join conditions. It also shows one simple cost-model to pick among the different implementation of select-topk queries.
This paper is the first-step towards on-line information extraction. It showed how to represent one type of state-of-the-art extraction model, how to implement one important inference algorithm, how to optimize and build cost-model for some queries involving both relational and top-k inference, and finally how to venture beyond top-k through conditional inference.
We are currently working on the extending this paper to support the full relational and inference algebra over the probabilistic extracted information.