Data + Model + View

May 6, 2010

AMP class final projects

Filed under: class — Daisy @ 10:37 pm

During the semester, the two assignments were both to take some large datasets (e.g., Twitter data), from several GB to TB, and do some analytics over them. Here are some lessons learned. There are lots of real-life problems on big data, and lots of tools to cope with these problems (e.g., MR, Pig, Hive, Sawzall, DryadLINQ), but limited reports about what’s easy and what’s hard.

The challenges with the assignment on the Twitter data are: 1) dirty data, 2) hard to choose “proper” tools, 3) hard to check the correctness of the answer, 4) little effort to produce generally applicable code, 5) difficult to debug.

For big data, there are lots of variance in the data format, which is hard to make robust code.  The distributed execution is tricky to debug, where developers cannot easily attach a debugger and logs can be hard to find.  Problems also lies in bad language integration, many bugs caused by mismatch between framework and language type, error handling, etc.

Everyone agreed that there is an astute need for better tooling. Frameworks need better error handling/recovery, and automatic test cases generation and summarization of data that your code cannot handle. A better language support for distributed computing is also needed: DryadLINQ is the best so far, whereas UDF’s are currently pretty miserable to write in PIG and Hive.

Finally, there is a need for better training on the programmers as well as non-technical professionals to perform and make use of data analytics over the distributed computing environment on big data.

Different class projects were presented. One is looking at using MP to parallelize different entity resolution algorithm. Another is to perform inference on real-time traffic GPS sensor reading streams. A third is to design a language over Avro that is strongly typed and easy to write UDF’s and queries, which makes it very easy to perform analytics tasks like the ones in the assignment — from days of work to couple of lines of code.

Advertisement

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.