Data + Model + View

January 22, 2010

NIPS 2009 Workshop: Large-Scale Machine Learning – Parallelism and Massive Datasets

Filed under: big data,conference,machine learning,talks — Daisy @ 3:12 am

I was trying to find some state-of-the-art scalability problems (use cases), models and data sets for my research when I stumbled across NIPS 2009 Workshop Videos. For one thing — having online conference talk video is a fascinating for learning. The workshop is focused on “Large-Scale Machine Learning: Parallelism and Massive Datasets”, and a number of talks are very interesting.

Edwin Pednault from IBM Watson talked about an infrastructure for rapid implementation of parallel reusable analytics called Hadoop-ML. It tries to build a layer on top of Hadoop to shield ML model and algorithm developers from the instability and optimization problems of the control and communication layer of the parallel systems such as Hadoop. They build a higher level API for developing machine learning models, which then gets translated into map-reduce. The API has 5 functions to be defined for each ML model: 1. initialization 2. beginIteration 3. processRecord 4. mergeTask and 5. endIteration. Then, Edwin also illustrated the API by two examples — decision tree and k-means algorithm.

It is interesting motivation to build higher level API on top of parallel systems, however, without looking at the paper, I do not see why the 5 methods API is much easier than the map-reduce language model. To be fair, I do not think there is an answer to what is the right interface for developing ML models to shield the statisticians and ML developers from the lower level system issues such as data processing, management, optimization, communication, etc.. And for this same reason, I am betting on a different high-level programming language — SQL from the database community. A lot of the recent work has shown that SQL can be used to express a broad range of SML models, and SQL implementation can leverage 40 years of research on parallel database and query optimization, as well as the more recent Hadoop parallel system through HadoopDB.

I really enjoyed the talk by Pedro Domingos from U. of Washington. He discussed a number of inference and learning algorithms over large Markov Networks with “millions of variables, billions of features, thousands of latent variables, and strong dependencies”. For inference in particular, he discussed the lazy inference and relational cutting planes by Sebastian Riedel at UMass, and (approximate) lifted inference algorithms. In terms of learning, he mentioned relational clustering and pathfinding. There is a very interesting reference to the relational databases with those algorithms/techniques. In the Q&A session, Pedro mentioned that the problem of learning/inference in large graphical model is cursed because there are three difficult problems that needs to be cracked — structure learning, weight learning and inference. Maybe the solution is to learn structures that have easy to learn weights and easy inference. One direction is certainly to learn first-order RVs and dependencies. It was exciting to me, because his vision is drawing a closer connection with the relational database, and because 2 years ago, my research on BayesStore started out using the first-order Bayesian Networks.

It is interesting to me that the scalability issues do not necessarily need to be solved by parallelism, but maybe just solved by a smart serial algorithm. It is fair to say that we might need to look at “dumb” algorithms, because they are likely the ones that are easier to be parallelized, but the “smart” algorithm might be just scalable enough on single computer.

Then there are two talks from Computer Vision, one from Bill Freeman from MIT talking about what problems were solved (with references to commercial products) and what problems they would like to be solved by ML community.

Thore Graepel from MSR talked about their Bayesian probabilistic classification model (not Naive Bayes) for  computational advertising. He touched upon the learning problem with streaming data creates the need for on-line learning algorithm and infrastructure. He mentioned that the learning and inference needs to be done in production systems, rather than offline, in a batch system. He also mention the need to put error-bars/confidence to the models or the output of the classifications, for example based on the number training. Of course, the model involves exploration of the rest of the ad-space.

One interesting observation is that computational advertising is a natural system that combines Algorithm, Machine and People — you have the Bayesian classification algorithm, you need machine to parallelize online learning algorithm with data streaming in, and naturally you have the user click streams as a feedback to the goodness of your model.

Alex Gray from Georgia Tech talked about possible “dwarfs” for supporting machine learning algorithms on top of a parallel infrastructure. He listed five: 1. linear algebra 2. graph algorithm 3.  N-body problems 4. optimizations and 5. map-reduce.

I think he is correct that we need more programming models beyond map-reduce to support the development of machine learning models over parallel systems. One of them might well be iterative language model, which is prevalent in any learning problem as an optimization problem (e.g. linear regression, gradient descent). I am not sure what other programming models there are that are suitable for ML, but they will need to be different in terms of patterns in communication, complexity in computation and data.

Daniel Walker talked about his research project with Google about efficient algorithms for training conditional maximum entropy models. The goal is to reduce the communication cost. Training conditional maximum entropy is an optimization problem, which needs N iterations, and each iteration, the updated parameter at the master node needs to be broadcast to the worker nodes. They designed an ensemble weight mixtures model, which eliminates the communication cost. This work can be compared to SPARK project, whose approach is not to redesign the model to be more computationally efficient, but to redesign the communication pattern to optimize for such algorithms using Nexus.

Apart from those, there are also more technical papers on the parallel algorithms for junction tree and belief propagation etc.

Create a free website or blog at WordPress.com.