This is the opening of Brian Dolan’s talk in his talk in the Berkeley DB seminar Fall 2009: Welcome to the Petabyte Age.
And here is the welcom message from Hal Varian, Google’s Chief Economist:
I keep saying the sexy job in the next ten years
will be statisticians. People think I’m joking, but
who would have guess that computer engineers
Would’ve been the sexy job of the 1990′s?
The ability to take data—to be able to understand
it, to process it, to extract value from it, to
visualize it, to communicate it—that’s going to be
a hugely important skill in the next decades, not
only at the professional level but even at the
educational level for elementary school kids,
for high school kids, for college kids. Because
now we really do have essentially free and
ubiquitous data. So the complimentary scarce
factor is the ability to understand that data and
extract value from it.
Hal Varian, Google’s Chief Economist
Source: “The McKinsey Quarterly”, Jan 2009.
I am fortunate to meet two professional who got the HOTTEST job of this era — analyst of the big data, and get a peak at what kind of problem they are solving and how they are solving it.
Brian worked as the director of research analytics for Fox Audience Network. He is dealing with 5 billion rows of data per day, and trillions of rows of data to query in total as of 2009. Clearly, how to store, how to manage, and how to learn knowledge from this massive data brings un-preceded challenges and opportunities in many domains, including data warehousing, data management (distributed computing), and large-scale statistical analysis. In his day to day life, Brian has to struggle with these problems to answer questions from executives and salemen in terms of on-line advertising. An example question could be:”How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle?” And a follow-up query could be:” How are these people similar to those that visited Nissan?” These questions are impossible to be answered by current DB/Data Mining/SML/BI tools. The way Brian is dealing with it is to be an expert in data management and statistical machinear learning, to modify data storage, management, computation to manage Big Data. There is no integrated solution for answering such analytical, yet highly valuable questions.
The same problem happened to Roger, head of the data analyst team in O’Reilly. They are analyzing a database with job postings, and their tasks involve queries to analyze the trend of certain jobs in a specific industrial sector. Currently, they are pulling data out of database to do it in R. However, they have to go through the painful down-sampling, and pre-processing in database, so that they can pipe reasonable sized data into R, because passing all the data is just not an option: 1) R cannot handle it 2) piping in and out DB is really expensive. Also, such tasks depends heavily on natural language processing.
I think in-database query-driven NLP/ML research that I am part-of cut right through those problems, to enable data analyst to have an integrated system that can perform scalable data storage, management and statistical analysis.