Data + Model + View

July 22, 2010

Analysts of the Big Data Era

Filed under: in-database SML,query-driven SML,talks — Daisy @ 7:21 am

This is the opening of Brian Dolan’s talk in his talk in the Berkeley DB seminar Fall 2009: Welcome to the Petabyte Age.

And here is the welcom message from  Hal Varian, Google’s Chief Economist:

I keep saying the sexy job in the next ten years
will be statisticians. People think I’m joking, but
who would have guess that computer engineers
Would’ve been the sexy job of the 1990’s?
The ability to take data—to be able to understand
it, to process it, to extract value from it, to
visualize it, to communicate it—that’s going to be
a hugely important skill in the next decades, not
only at the professional level but even at the
educational level for elementary school kids,
for high school kids, for college kids. Because
now we really do have essentially free and
ubiquitous data. So the complimentary scarce
factor is the ability to understand that data and
extract value from it.
Hal Varian, Google’s Chief Economist
Source: “The McKinsey Quarterly”, Jan 2009.

I am fortunate to meet two professional who got the HOTTEST job of this era — analyst of the big data, and get a peak at what kind of problem they are solving and how they are solving it.

Brian worked as the director of research analytics for Fox Audience Network. He is dealing with 5 billion rows of data per day, and trillions of rows of data to query in total as of 2009. Clearly, how to store, how to manage, and how to learn knowledge from this massive data brings un-preceded challenges and opportunities in many domains, including data warehousing, data management (distributed computing), and large-scale statistical analysis. In his day to day life, Brian has to struggle with these problems to answer questions from executives and salemen in terms of on-line advertising. An example question could be:”How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle?” And a follow-up query could be:” How are these people similar to those that visited Nissan?” These questions are impossible to be answered by current DB/Data Mining/SML/BI tools. The way Brian is dealing with it is to be an expert in data management and statistical machinear learning, to modify data storage, management, computation to manage Big Data. There is no integrated solution for answering such analytical, yet highly valuable questions.

The same problem happened to Roger, head of the data analyst team in O’Reilly. They are analyzing a database with job postings, and their tasks involve queries to analyze the trend of certain jobs in a specific industrial sector. Currently, they are pulling data out of database to do it in R. However, they have to go through the painful down-sampling, and pre-processing in database, so that they can pipe reasonable sized data into R, because passing all the data is just not an option: 1) R cannot handle it 2) piping in and out DB is really expensive. Also, such tasks depends heavily on natural language processing.

I think in-database query-driven NLP/ML research that I am part-of cut right through those problems, to enable data analyst to have an integrated system that can perform scalable data storage, management and statistical analysis.


July 8, 2010

In-database Statistical Log Analysis

Filed under: in-database SML,papers,query-driven SML — Daisy @ 10:39 pm

A paper, Splash: Integrated Ad-Hoc Querying of Data and Statistical Models, by Lujun Fang, Kristen LeFevre of UMich, presents a system called Splash, which integrates statistical modeling as aggregate functions in SQL, and use SQL queries to drive ad-hoc model creating and inference analysis.

The motivating application is log analysis, for recording and detecting inappropriate access and misuse over auditing logs (e.g., database access logs). An example given is that Kaiser Permanente recently fired fifteen employees for inappropriately viewing the medical records of Nadya Suleman, the highly-publicized mother of octuplets. Few tools has been built to allow auditors to easily, systematically and proactively analyze the logs to observe the legislations and regulations over sensitive data.

The API of Spark is the following: 1) feature extractor features(), which takes as input a database record, and produces a feature vector; 2) profile aggregation function profile (D), where D is the set of features extracted from data, generating a probability density function as the profile of all the data. The profile(D) function can be used with group by clause  to generate profile at different granularities. To interact with profile objects, they have 3) sim(feature, profile) to perform classification — whether a data item is of a specific class represented by profile, and 4) sim(profile, profile) to compare the similarities of two classes, through KL-divergence.

In addition, the paper also describes how to generate representative data items to describe the profiles, which can be used for example, to explain an abnormal profile detected by a system administrator. The paper also described ways to achieve efficiency using materialization and compression of profile.

The experiments examines performance scalability and different optimizations. Also, it compares performing log analysis in SQL using the above API versus in Weka, and observed that: it is relatively easy to express conventional tasks (e.g., simple classification) in both systems, primarily because Weka provides a custom API for these tasks. However, when the analysis task involves additional data processing, or compound tasks, it is necessary to embed calls to the  Weka API in a larger (custom-coded) program, which is inconvenient and time-consuming for ad-hoc analysis. In contrast, compound tasks can usually be expressed quite easily in Splash.

The above results show the flexibility from the high-level abstraction of statistical models/inference in a language like SQL and the performance benefits of supporting statistical functions/inference in a system like Database. This is exactly the motivation and approach that BayesStore is taking. I think this paper shows a cool API and good application for in-database statistical model-based analysis.

July 6, 2010

K/Kdb — scalable array processing language and database

Filed under: commercial products,in-database SML,languages,talks — Daisy @ 12:04 am

Arthur Whitney gave a talk a while back (in March) at Parlab Seminar, talking about his array processing language — K programming language and the database system kdb that supports this language and its more general version: the Q programming language.

According to Arthur, Q is similar to SQL but is simpler and more expressive. Q implements all relational capabilities as well as time series analysis. Q efficiently supports atoms, lists, maps, tables and functions. Tables are stored as lists of associative arrays (stored in columns). K and Q languages are designed with financial applications in mind.  The company kx systems has implemented kdb/tick and kdb/taq financial applications over kdb.

The main features of the k programming language is its conciseness, and it has built-in primitives for array operations and parallelism. In terms of parallel computation, k does it by data partitioning embedded in “f each x” statement. The attribute to partition data on can be specified by the user.

The main selling point of kdb is its scalability and efficiently to deal with millions record per second of real-time data and billions of historic data in the financial applications for analytic workloads. kdb is one of the early in-memory column-based database system, predate the recent C-Store work.

I think kdb is head-to-head competing with streaming database in the financial sector. It might be more efficient by getting rid of all the database overhead, although they do not have any comparisons or benchmarks. R can be used on top of Kx to do more sophisticated analytics, while kdb only support simple functions, data selection and “f each x” clause.

One interesting note is kx systems started with one customer and build itself up without any venture capital funding. And now it has many big financial institutes as its customers. Ras brought up the overhead of parallelization — load data to GPU (multicore) — so it is better have many operations chained up for the same data.

May 6, 2010

AMP class final projects

Filed under: class — Daisy @ 10:37 pm

During the semester, the two assignments were both to take some large datasets (e.g., Twitter data), from several GB to TB, and do some analytics over them. Here are some lessons learned. There are lots of real-life problems on big data, and lots of tools to cope with these problems (e.g., MR, Pig, Hive, Sawzall, DryadLINQ), but limited reports about what’s easy and what’s hard.

The challenges with the assignment on the Twitter data are: 1) dirty data, 2) hard to choose “proper” tools, 3) hard to check the correctness of the answer, 4) little effort to produce generally applicable code, 5) difficult to debug.

For big data, there are lots of variance in the data format, which is hard to make robust code.  The distributed execution is tricky to debug, where developers cannot easily attach a debugger and logs can be hard to find.  Problems also lies in bad language integration, many bugs caused by mismatch between framework and language type, error handling, etc.

Everyone agreed that there is an astute need for better tooling. Frameworks need better error handling/recovery, and automatic test cases generation and summarization of data that your code cannot handle. A better language support for distributed computing is also needed: DryadLINQ is the best so far, whereas UDF’s are currently pretty miserable to write in PIG and Hive.

Finally, there is a need for better training on the programmers as well as non-technical professionals to perform and make use of data analytics over the distributed computing environment on big data.

Different class projects were presented. One is looking at using MP to parallelize different entity resolution algorithm. Another is to perform inference on real-time traffic GPS sensor reading streams. A third is to design a language over Avro that is strongly typed and easy to write UDF’s and queries, which makes it very easy to perform analytics tasks like the ones in the assignment — from days of work to couple of lines of code.

February 4, 2010


Filed under: artificial AI,class — Daisy @ 9:15 pm

Recently the work “crowdsourcing” received a lot of attention in the tech world. The basic assumption for the success application of “crowdsourcing” is that human beings are better and faster in some tasks (e.g. image tagging) than computer. And if those tasks can be partitioned into many little tasks, which can be carried out by a single person, then “crowdsourcing” might be the right technique to solve the problem. In this sense, “crowdsourcing” is also a form of “cloud computing”, only that the “cloud” consists of people rather than computers.

There are different incentives: 1. money 2. fun 3. fame 4. team work .. and different applications can use one or multiple of those.

The criteria of problems that can be solved/benefited from “crowdsourcing” is an important topic. I have already mentioned two — 1. easy for human, hard for computer 2. parallelizable. The third criterion, is “verifiable” or “falsifiable”, i.e. there is a way or an authority verify or falsify the contribution. The way can be computation, or can be human judges.

The tooling around the crowdsourcing is also interesting. A reputation system is needed to rate the contributions, which could be anonymous. Game theory strategies might be needed to design the crowdsourcing incentive and scheme.

Digging Into Data Challenge

Filed under: Uncategorized — Daisy @ 7:41 pm
Tags: ,

The Digging into Data Challenge is a grand competition organized by four international research organizations ( JISC, NEH, NSF, SSHRC ) from three countries — Canada, UK, and US — to focus the attention of the social science and humanities research communities to large-scale data analysis and its potential applications to a wide range of scholarly resources, to encourage innovative large scale data analytics approaches to help scholars use these tools to ask new questions about and gain new insights into the world. This challenge was launched Jan. 2009 and Dec. 2009 the Awardees announced, with project spanning from music (23,000 hours), speech, 18th-century letters (53,000), criminal record, authorship identification of culture heritage materials, etc. It is amazing that the collaborators of all these winning projects involves researchers from multiple countries and trying use large-scale data analytic power to find relationship, discrepancies and perform computation in large heterogeneous datasets. A collection of data repositories that research teams can get access to, including digital art collection, digital library, archeology, can be found here.

January 25, 2010

Three ML Techniques — can they be parallelized?

Filed under: class,machine learning,Uncategorized — Daisy @ 6:17 pm

In the AMP class today, Michael Jordan discussed three ML techniques in the context of scalable ML algorithms.Most ML tools deals with data that can fit in memory, and only until recently it is more than enough.

There are different data types, problems and models that are studied in Machine Learning and Statistics. Past literature has parallelized machine learning algorithm through Map-Reduce interface (Andrew Ng et. al.), through iterative interface (SPARK), or through customized partitioning and communication protocols (different parallel graphical model inference algorithms). The following three ML algorithms/techniques are not yet parallelized and has the potential to be benefited from parallelism with different degrees of difficulty.

Theme 1: Kernel methods (most complex)

quadratic programming (maximizing the margin)

typical kernel methods with complexity O(n^3) where n is GB. One why is to figure out how to throw away data point smarter and smarter given computation budget, more data point better results (?)

Ref: Ling Huang’s recent work on using representative data points to process millions of data points

Theme 2: Ensemble Methods

Netflix Prize winner has 300 models ensembled through voting

bootstrap (sample on the synthetic “ground truth”)– simple way to put error bars on knowledge from data

Theme 3: Random Forest (SVN) — champion of accuracy

decision tree (supervised learning) + bootstrap

decision is unstable, highly invariable

really easy to use map-reduce to parallelize

clustering forest

January 22, 2010

NIPS 2009 Workshop: Large-Scale Machine Learning – Parallelism and Massive Datasets

Filed under: big data,conference,machine learning,talks — Daisy @ 3:12 am

I was trying to find some state-of-the-art scalability problems (use cases), models and data sets for my research when I stumbled across NIPS 2009 Workshop Videos. For one thing — having online conference talk video is a fascinating for learning. The workshop is focused on “Large-Scale Machine Learning: Parallelism and Massive Datasets”, and a number of talks are very interesting.

Edwin Pednault from IBM Watson talked about an infrastructure for rapid implementation of parallel reusable analytics called Hadoop-ML. It tries to build a layer on top of Hadoop to shield ML model and algorithm developers from the instability and optimization problems of the control and communication layer of the parallel systems such as Hadoop. They build a higher level API for developing machine learning models, which then gets translated into map-reduce. The API has 5 functions to be defined for each ML model: 1. initialization 2. beginIteration 3. processRecord 4. mergeTask and 5. endIteration. Then, Edwin also illustrated the API by two examples — decision tree and k-means algorithm.

It is interesting motivation to build higher level API on top of parallel systems, however, without looking at the paper, I do not see why the 5 methods API is much easier than the map-reduce language model. To be fair, I do not think there is an answer to what is the right interface for developing ML models to shield the statisticians and ML developers from the lower level system issues such as data processing, management, optimization, communication, etc.. And for this same reason, I am betting on a different high-level programming language — SQL from the database community. A lot of the recent work has shown that SQL can be used to express a broad range of SML models, and SQL implementation can leverage 40 years of research on parallel database and query optimization, as well as the more recent Hadoop parallel system through HadoopDB.

I really enjoyed the talk by Pedro Domingos from U. of Washington. He discussed a number of inference and learning algorithms over large Markov Networks with “millions of variables, billions of features, thousands of latent variables, and strong dependencies”. For inference in particular, he discussed the lazy inference and relational cutting planes by Sebastian Riedel at UMass, and (approximate) lifted inference algorithms. In terms of learning, he mentioned relational clustering and pathfinding. There is a very interesting reference to the relational databases with those algorithms/techniques. In the Q&A session, Pedro mentioned that the problem of learning/inference in large graphical model is cursed because there are three difficult problems that needs to be cracked — structure learning, weight learning and inference. Maybe the solution is to learn structures that have easy to learn weights and easy inference. One direction is certainly to learn first-order RVs and dependencies. It was exciting to me, because his vision is drawing a closer connection with the relational database, and because 2 years ago, my research on BayesStore started out using the first-order Bayesian Networks.

It is interesting to me that the scalability issues do not necessarily need to be solved by parallelism, but maybe just solved by a smart serial algorithm. It is fair to say that we might need to look at “dumb” algorithms, because they are likely the ones that are easier to be parallelized, but the “smart” algorithm might be just scalable enough on single computer.

Then there are two talks from Computer Vision, one from Bill Freeman from MIT talking about what problems were solved (with references to commercial products) and what problems they would like to be solved by ML community.

Thore Graepel from MSR talked about their Bayesian probabilistic classification model (not Naive Bayes) for  computational advertising. He touched upon the learning problem with streaming data creates the need for on-line learning algorithm and infrastructure. He mentioned that the learning and inference needs to be done in production systems, rather than offline, in a batch system. He also mention the need to put error-bars/confidence to the models or the output of the classifications, for example based on the number training. Of course, the model involves exploration of the rest of the ad-space.

One interesting observation is that computational advertising is a natural system that combines Algorithm, Machine and People — you have the Bayesian classification algorithm, you need machine to parallelize online learning algorithm with data streaming in, and naturally you have the user click streams as a feedback to the goodness of your model.

Alex Gray from Georgia Tech talked about possible “dwarfs” for supporting machine learning algorithms on top of a parallel infrastructure. He listed five: 1. linear algebra 2. graph algorithm 3.  N-body problems 4. optimizations and 5. map-reduce.

I think he is correct that we need more programming models beyond map-reduce to support the development of machine learning models over parallel systems. One of them might well be iterative language model, which is prevalent in any learning problem as an optimization problem (e.g. linear regression, gradient descent). I am not sure what other programming models there are that are suitable for ML, but they will need to be different in terms of patterns in communication, complexity in computation and data.

Daniel Walker talked about his research project with Google about efficient algorithms for training conditional maximum entropy models. The goal is to reduce the communication cost. Training conditional maximum entropy is an optimization problem, which needs N iterations, and each iteration, the updated parameter at the master node needs to be broadcast to the worker nodes. They designed an ensemble weight mixtures model, which eliminates the communication cost. This work can be compared to SPARK project, whose approach is not to redesign the model to be more computationally efficient, but to redesign the communication pattern to optimize for such algorithms using Nexus.

Apart from those, there are also more technical papers on the parallel algorithms for junction tree and belief propagation etc.

January 21, 2010

AMP class

Filed under: class — Daisy @ 3:22 am

This semester I am going to sit in Mike Franklin’s grad division class. Mike Franklin also happens to be my research adviser. The vision of the class is to explore the research frontiers and shape the scope of the future AMP lab — Algorithms, Machines and People.

Algorithms being the machine learning models and algorithms, Machines being the cloud computing systems and infrastructure, and People is a necessary component in the loop to make a real usable system, which include techniques such as mechanic turks and crowd sourcing.

A number of industrial speakers are lined up to talk about their relevant research and problems in this space. The schedule of the course can be found at here.

This AMP vision is very related to my research, which is to provide in-database scalable (Machine) algorithms (Algorithm) for advanced analytics based on statistical models. This vision is very related to this Blog as well. This exploratory class uses the Berkeley tradition of starting a big research project with 1 exploratory grad division course.

January 20, 2010

Data Exhaust Alchemy

Filed under: big data,events — Daisy @ 7:27 am

I attended a VLAB event at Stanford organized by Roger Magoulas from O’Reilly. The event entitled “Data Exhaust Alchemy” is about how to turn the Web’s “waste” into “gold”, through data analytics. It is a great introductory event to the set of companies that are generating revenue from the information extracted from the Web data. The event include panelists including Mike (JB) John-Baptiste from PeerSet, DJ Patil from LinkedIn and Jeff Hammerbacher from Cloudera, Mark Breier and Pablos Holman.

One of the highlights of the event is certainly when the panelist are asked about the role of database in such SML-based analytics tasks. Obviously, Jeff is on the no-SQL side of the boat for such tasks, claiming that relational databases are good for structured data, transactions and simple analytics, but not for advanced analytics involving SML models. Although he did acknowledge the fact that SQL are being used and researched for supporting different SML models. I agree that Hadoop/Map-Reduce has tremendous momentum among the statisticians and machine learning developers, but SQL has been shown to be able to express Multiple Linear Regression, Naive Bayes, PageRank, SVN, CRF, etc. So the argument for no-SQL that SQL cannot support SML models is not true.

Then DJ took up the question and made an analogy — imagine you have a leak in the pipe, a handy-man comes and would you like him to only have one tool or he has many tools? Of course, you want him to have many tools to be able to solve different kind of problems, similar to how LinkedIn internally uses many different tools to solve different infrastructure problems — LinkedIn uses Cloudera as well as commercial databases. I guess the question I have is, if there is a framework that made it easy to apply SML in-database(or parallel database for that matter), would data analysts and/or business analysts use this framework? Or would they still prefer to craft their solution using a bunch of tools?

The other big kicks are Pablos reading off personal information directly from people’s credit cards using a RFID reader, bought for $8 from Ebay, and trying to make a match-making service using an open-source spam filters, by sending information on the candidates one likes/dislikes(no-spam/spam) as emails.

Next Page »

Blog at