Data + Model + View

September 13, 2009

Berkeley DBSeminar 2009: Scalable Data Analytics — Intro

Filed under: announcement — Daisy @ 4:10 pm

The Database Group Seminar (DBSeminar) at UC Berkeley invites outside speakers from academia as well as from industry to talk about their most recent research and ideas.  It happens on Fridays and preceded by a lunch (DBlunch) and friendly round-table chat.

This semester, I am playing host to the DBSeminar, which has a Scalable Data Analytics theme, with speakers from Systems, Languages, Algorithms/Machine Learning and Visualization. We already have a bunch of interesting talks lined-up, including Professors from UPenn, Amherst, UWashington, Rice, and people from Yahoo! Research, Google, Microsoft, Cloudera, Fox Interactive Media, and Truviso (a on-line analytics company sprung from our group!).

The first talk is next Friday from Erik Meijer @ Microsoft — the father of LINQ. I will dedicate several follow-up posts on the seminar talks as the semester goes. If you are interested, you can see the full schedule at http://db.cs.berkeley.edu/dblunch.php.

May 10, 2009

BayesStore talk @ Berkeley ML-tea

Filed under: discussion — Daisy @ 1:19 am

Today I gave a talk to the ML group at Berkeley. We had a full hour of very interactive discussion with a group of 20 students. I talked with two of Michael Jordan’s students afterwards, Ariel Kleiner and Alex, who were particularly interested in BayesStore.

Some of the interesting comments were:

- BayesStore system maybe useful in two scenarios: (1) during model training and debugging process, which could last for one’s thesis (a lot of the ML academics live in this world, and is agnostic how the model they develop gets used in real life) (2) after model is trained, used as a query interface to serve the model and the data in production
- What is the difference to the users between a traditional database and a BayesStore probabilistic database? My answer was not much difference, except that you can query over uncertain data or latent variables (i.e. predicted data).
- Learning is also discussed. They were willing to buy my argument that learning is usually done on smaller training data and can be performed outside of the database, and the developed model can be imported into the database to serve queries over them (at model tuning time or production time)
- We also discussed why we want to implement the inference operators in database engine, rather than call out external CRF library, etc. And my answer to that is that data has gravity, you would like to place the processing as close to the data as possible. And database inference operators opens up the optimization opportunities.
- Another comment is we cannot always apply textbook models as is on real data, and I got the point through that we want to use all the smarts that statisticians and machine learning researchers come up with in producing the model.

In general, I received very positive feedback and people are excited and really bought into our BayesStore vision as an interface to support analytics over data using SML models or a useful tool during the model debugging/tuning phase, which is great.

May 2, 2009

DeclarativeIE talk @ Stanford DB group

Filed under: discussion — Daisy @ 1:15 am

The visit at Stanford was very nice! We had a full room of people (~ 30). There was a group of people from IBM too, including Mohan. As you can expect, Jennifer and Hector asked some very detailed question about ARIES. :-) They also talked about how to change the format of their Qual exam.

My talk on declarative IE went first. Jeff and Jennifer asked questions about the model representations in relational database — More specifically,

(1) how is the factor table generated for tokens not in thtraining set (ans: currently they are materialized, but we are currently developing feature extraction pipelines to generate the factor table entries for new tokens)

(2) How do you model different correlations other than those in the CRF model? (ans: we focus on CRF in the declarativeIE paper, but much more complicated correlations can be supported by BayesStore, which needs different inference algorithms than used for CRF)

Jeff asked what and why we claim we can do a better job than the machine learning folks. (ans: we only have  initial answers — optimization and reuse in complex queries involving both relational and inference operators, potential combination of multiple models.  It is also about programmability.)

Hector and Jennifer also raise the question of why do we choose to store the model rather than just the inference results. They are more in the mind set of traditional data warehouse, where you clean all the data before you put them into the database. The answer to that is (1) we cannot afford to store the full distribution of the model (2) the stored inference result is a view over the model, only representing part of the distribution (3) in some cases there are just too much data to compute the full inference (4) in other cases we can have user feedback, which would change the inference results (5) some queries interested in low probability possible worlds.

Hector also raised another interesting address segmentation problem, where the address strings are not so rigid as in US, but rather in some other countries, is more like ” go 2 miles, turn right, and it’s behind the white house”. :-)

April 16, 2009

Declarative IE

Filed under: papers — Daisy @ 5:37 am

An Excellent entry in Joe’s blog on this paper

In our quest for the killer app for BayesStore — a database that supports machine learning models and inference as first-class citizens, information extraction (IE) naturally jump to the table. This applications has huge amounts of noise, incomplete data generated in the extraction process, which, in a lot of the cases, is directly driven by machine learning models!

The current text analytics solutions can only perform offline analytics where the document in text database is exported and processed using machine learning packages and the extraction results are then imported into the database, upon which the users can pose relational queries. However, this solution is especially problematic when the data is too large (more and more true nowadays) or the data is constantly changing, or the model is also changing. Under those scenarios, the offline precomputation is either impossible (data is too large) or quickly become stale (data and model changes).

Thus, we use the BayesStore design principles to support a new type of machine learning model CRF and its inference algorithm Viterbi, which in turn support on-line efficient information extraction. By “on-line”, we mean that inference results are not precomputed and loaded into the database (state-of-the-art), but is computed at the query time! Moreover, since BayesStore store the CRF model, it can compute queries beyond maximum-likelihood. One example showed in the paper is conditional inference, when users give evidence (e.g. the token labels), and a lower probability extraction might be returned conditioned on the evidence.

The paper shows how to ralation-alize the model, the documents and the Viterbi dynamic programming algorithm, and how to make them efficient. Moreover, the paper shows the algorithms for optimizing inference with the following select and join conditions. It also shows one simple cost-model to pick among the different implementation of select-topk queries.

This paper is the first-step towards on-line information extraction. It showed how to represent one type of state-of-the-art extraction model, how to implement one important inference algorithm, how to optimize and build cost-model for some queries involving both relational and top-k inference, and finally how to venture beyond top-k through conditional inference.

We are currently working on the extending this paper to support the full relational and inference algebra over the probabilistic extracted information.

March 9, 2009

BayesStore: VLDB08

Filed under: papers — Daisy @ 5:06 am

BayesStore is a novel data management system which stores, manages and supports both relational and inference queries over probabilistic data based on statistical or machine learning models. This idea was first developed in my research internship in Intel of summer 2006. BayesStore probabilistic data management system (PDBMS) is built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system.

Early approaches in building PDBMS have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures. However, these approaches introduce a gap between the statistical models which are used for probabilistic analytics and the uncertainty model in the PDBMS. Our solution to this “model-mismatch” problem is to support statistical models, evidence data and inference algorithms as first-class in a PDBMS.

The VLDB08 paper (1) described the relational representation of a machine learning model and its evidence data; (2) the algorithms for probabilistic relational operators over the first-order baysian networks (FOBN) model; and (3) the optimizations for probabilistic relational operators and inference operators over FOBN. One potential application for FOBN model is to be used over the sensor network data, such as Intel dataset.

This work is in several ways innovative: (1) this is one of the few projects that makes a clear goal to marry machine learning with data management systems (the only other research group we are aware of is led by Prof. Deshpande in Maryland), (2) the data model represented in this paper has a clean separation of the data and model, and is very expressive in defining random variables and factors, (3) this is probably the first paper talks about probabilistic relational operators over machine learning models, (4) the paper also demonstrated the natural representation, the expressive power and computation benefit of first-order models.

March 6, 2009

Introduction

Filed under: Uncategorized — Daisy @ 7:10 am

Hello, I am a graduate student in UC Berkeley and a member of the database research group. My thesis advisor is Michael J. Franklin, and I am co-advised by Joseph M. Hellerstein and Minos Garofalakis.

My research interest lies in scalable, accessible, on-line data analytics based on statistical and machine learning models. In this vein, I am interested in (1) Data: data management systems for scalable processing of data analytics tasks, (2) Model: statistical and machine learning models for advanced large data analytic tasks (such as collaborative filtering, predictive analytics and profile extractions), and (3) View: user interface and programing languages design for such data analytics systems for different users (e.g. model developer, user, etc.)

As a graduate student, I enjoy collaborate with my advisors, my fellow grad students, researchers at industrial research labs and also smart and motivated undergrad student. I am inspired by people around me everyday.

Last but not least, I love Berkeley! Go Bears!

Blog at WordPress.com.