PROJECT INFORMATION:

Award: IIS-0546155

Title: CAREER: Memory-Constrained Predictive Data Mining

PI: Slobodan Vucetic, Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122

 

Memory-constrained learning – a cookbook for mining large databases using limited memory computational devices


Data mining is a burgeoning field that concentrates on the development of computationally efficient and accurate tools for knowledge discovery from very large data sets. Many successful data mining algorithms scale linearly or even better with data size and can learn with high accuracy from millions of examples on a regular PC. Interestingly, linear or sub-linear scaling is often not enough in a today’s world where data collections in sciences, medicine, engineering, and business are reaching terabytes (8 followed by 12 zeros bits of information) or even petabytes (8 followed by 15 zeros bits of information) and where there is an increasing use of microcontrollers whose memory and computational capacity are highly limited. A critical question in this new world is how to optimize learning from data that exceeds the available computational resources by orders of magnitude.

 

In the CAREER project “Memory-Constrained Predictive Data Mining,” Prof. Vucetic and his students from Temple University are demonstrating that highly accurate learning could be achieved using devices with seemingly inferior processing power and memory capacity. In their approach, a computer is treated as a reservoir that sequentially observes a large data stream and, at each given moment, maintains a selected number of examples and a prediction model that describes the data. After each observed stream example, the reservoir content is updated such that the accuracy is increased while the memory constraint is kept satisfied. “Not all examples are created equal,” explains Prof. Vucetic, “some are highly informative, while others are simply confirming that winters are cold. The success of the reservoir approach lies in the ability to select and maintain the most informative examples.” The proposed approach prefers the examples on which the current predictor is the most uncertain. In addition to this intuitively appealing criterion, the ability to peek into every stream example also gives an opportunity to reduce noise of reservoir examples. “If you throw a coin once,” says Prof. Vucetic, “you will have no idea if it is biased or not; if you throw it thousand times the answer will be obvious to everyone.” Following this observation, the reservoir approach proposed by the Temple group maintains a summary about every set of similar examples instead of maintaining the individual examples.

 

The proposed reservoir approach was tested on 10 large databases (some of them containing more than 1 million examples) from various domains, where it was assumed that the reservoir can maintain information about only 100 examples. In most cases, the achieved accuracy was within only a few percent of the ideal solution achieved when the reservoir size was not limited. On the other hand, the accuracy was significantly larger (often by more than 10%) than when the randomly selected 100 examples were maintained in the reservoir (see the Figure). This result provides strong support for the hypothesis that innovative data mining methods can enable high-quality learning from large data sets using resource-limited computing devices. It promises that we should be able to successfully tackle large-scale knowledge discovery and decision support problems in areas such as the geosciences, the biomedical sciences, particle physics, multimedia databases, the internet, or business applications.

 

The ongoing activities on the project are focused on development of the specialized memory-constrained algorithms for the most popular data mining algorithms such as support vector machines, Gaussian processes, and decision trees. The effort is also concentrated on improving robustness of the developed algorithms such that they are virtually parameter-free and easy to use even by the non-experts. Finally, the scientists on the project are developing data compression techniques to boost the reservoir capacity by attribute discretization, data ordering, and compression of the prediction models. The preliminary results are very promising and indicate that high-quality knowledge discovery is possible using devices ranging from powerful workstations to handheld computers and cell phones to small, inexpensive sensors.

 

 

 

Figure1. left) Noisy checkerboard data – the goal is to discriminate between black and yellow dots and the achievable accuracy is 90%, middle) 100 randomly selected examples and the trained prediction model that has 76% accuracy, right) 100 examples selected by the reservoir algorithm and the trained prediction model that has 88% accuracy