A paper, Splash: Integrated Ad-Hoc Querying of Data and Statistical Models, by Lujun Fang, Kristen LeFevre of UMich, presents a system called Splash, which integrates statistical modeling as aggregate functions in SQL, and use SQL queries to drive ad-hoc model creating and inference analysis.
The motivating application is log analysis, for recording and detecting inappropriate access and misuse over auditing logs (e.g., database access logs). An example given is that Kaiser Permanente recently fired fifteen employees for inappropriately viewing the medical records of Nadya Suleman, the highly-publicized mother of octuplets. Few tools has been built to allow auditors to easily, systematically and proactively analyze the logs to observe the legislations and regulations over sensitive data.
The API of Spark is the following: 1) feature extractor features(), which takes as input a database record, and produces a feature vector; 2) profile aggregation function profile (D), where D is the set of features extracted from data, generating a probability density function as the profile of all the data. The profile(D) function can be used with group by clause to generate profile at different granularities. To interact with profile objects, they have 3) sim(feature, profile) to perform classification — whether a data item is of a specific class represented by profile, and 4) sim(profile, profile) to compare the similarities of two classes, through KL-divergence.
In addition, the paper also describes how to generate representative data items to describe the profiles, which can be used for example, to explain an abnormal profile detected by a system administrator. The paper also described ways to achieve efficiency using materialization and compression of profile.
The experiments examines performance scalability and different optimizations. Also, it compares performing log analysis in SQL using the above API versus in Weka, and observed that: it is relatively easy to express conventional tasks (e.g., simple classification) in both systems, primarily because Weka provides a custom API for these tasks. However, when the analysis task involves additional data processing, or compound tasks, it is necessary to embed calls to the Weka API in a larger (custom-coded) program, which is inconvenient and time-consuming for ad-hoc analysis. In contrast, compound tasks can usually be expressed quite easily in Splash.
The above results show the flexibility from the high-level abstraction of statistical models/inference in a language like SQL and the performance benefits of supporting statistical functions/inference in a system like Database. This is exactly the motivation and approach that BayesStore is taking. I think this paper shows a cool API and good application for in-database statistical model-based analysis.