PROJECT INFORMATION:
Award: IIS-0546155
Title: CAREER: Memory-Constrained
Predictive Data Mining
PI: Slobodan Vucetic, Department
of Computer and Information Sciences,
Memory-constrained
learning – a cookbook for mining large databases using limited memory
computational devices
Data mining is a burgeoning field that concentrates on the development of
computationally efficient and accurate tools for knowledge discovery from very large
data sets. Many successful data mining algorithms scale linearly or even better
with data size and can learn with high accuracy from millions of examples on a
regular PC. Interestingly, linear or sub-linear scaling is often not enough in
a today’s world where data collections in sciences, medicine, engineering, and
business are reaching terabytes (8 followed by 12 zeros bits of information) or
even petabytes (8 followed by 15 zeros bits of information) and where there is
an increasing use of microcontrollers whose memory and computational capacity are
highly limited. A critical question in this new world is how to optimize
learning from data that exceeds the available computational resources by orders
of magnitude.
In the CAREER project “Memory-Constrained Predictive
Data Mining,” Prof. Vucetic and his students from
The proposed reservoir approach was tested on 10
large databases (some of them containing more than 1 million examples) from
various domains, where it was assumed that the reservoir can maintain
information about only 100 examples. In most cases, the achieved accuracy was
within only a few percent of the ideal solution achieved when the reservoir size
was not limited. On the other hand, the accuracy was significantly larger
(often by more than 10%) than when the randomly selected 100 examples were
maintained in the reservoir (see the
Figure). This result provides strong support for the hypothesis that
innovative data mining methods can enable high-quality learning from large data
sets using resource-limited computing devices. It promises that we should be
able to successfully tackle large-scale knowledge discovery and decision
support problems in areas such as the geosciences, the biomedical sciences,
particle physics, multimedia databases, the internet, or business applications.
The ongoing activities on the project are focused on
development of the specialized memory-constrained algorithms for the most
popular data mining algorithms such as support vector machines, Gaussian processes,
and decision trees. The effort is also concentrated on improving robustness of
the developed algorithms such that they are virtually parameter-free and easy
to use even by the non-experts. Finally, the scientists on the project are developing
data compression techniques to boost the reservoir capacity by attribute
discretization, data ordering, and compression of the prediction models. The
preliminary results are very promising and indicate that high-quality knowledge
discovery is possible using devices ranging from powerful workstations to
handheld computers and cell phones to small, inexpensive sensors.


Figure1. left)
Noisy checkerboard data – the goal is to discriminate between black and yellow
dots and the achievable accuracy is 90%, middle)
100 randomly selected examples and the trained prediction model that has 76%
accuracy, right) 100 examples
selected by the reservoir algorithm and the trained prediction model that has 88%
accuracy