Department
of Computer and Information Sciences
Temple
University
303 Wachman Hall
1805 N.
Broad Street
Philadelphia
PA 19122
215-204-5774
http://www.ist.temple.edu/~vucetic
In the environment where new large-scale problems are
emerging in various disciplines and pervasive computing applications are
becoming more common, there is an urgent need for techniques that provide
efficient and accurate knowledge discovery by limited-capacity computing
devices. The objective of this project is to address this need by developing memory-constrained
predictive data mining algorithms that operate when data size exceeds the
available memory capacity. The approach is based on the integration of data
mining and data compression techniques to optimally utilize the memory for data
and model storage, learning, and ancillary operations. The methodology will be
thoroughly evaluated on a range of real-life problems that includes learning
from large sorted databases, biased data and nonstationary data. Various memory
constraints will be considered pertaining to devices ranging from powerful
workstations to handheld computers and cell phones to small, inexpensive
sensors. This research will reveal the memory lower bounds for accurate
learning from different types of data and by different types of learning
algorithms. The educational component of the project seeks to integrate
research into computer science instruction by designing exciting courses,
exploring effective teaching techniques, introducing research to undergraduate
and graduate students, and involving underrepresented student groups in
research. Broader impacts of the project will be in extending the frontiers of
computer and information science and in facilitating knowledge discovery in
various scientific, engineering, and business disciplines. Teaching materials
and research results, including developed software and databases, will be
widely disseminated to promote learning and enhance scientific understanding.
Bencella Wisner, presented results of her research as a poster titled "Analysis and Prediction of Traffic Volume in Large Transportation Systems" at the 10th Annual Philadelphia Alliance for Minority Participation (AMP) Research Symposium, where she received the second prize in the Mathematics and Computer Sciences Category.
During the first two years of the project, research activities included efforts on all three specific aims of the project, with the emphasis on the first and second aim. The following is a summary of these activities.
AIM 1: Reservoir Data Mining Algorithms. In this aim, the memory-constrained problem is relaxed by assuming that the memory can hold at most M examples, where M is much smaller than the available number of examples N. The goal is to retain the most informative sample of size M, after sequentially reading all available examples in a single pass.
Memory-constrained data mining introduces many challenges when learning from large data sets. It is often not practical or possible to keep the entire data set in main memory and often the examples could only be observed in a single run in the order in which they are presented. Traditional reservoir-based approach, where only a subset of observed examples is kept in the memory, performs well in this situation. Its major drawback is that the examples not included in the final reservoir are ignored. To remedy this situation we proposed a modification to the baseline reservoir algorithm. Instead of keeping the actual target values of reservoir examples, an estimate of their conditional expectation is kept and updated online as new examples are observed from the stream. If the estimate is successful, the resulting reservoir will have considerably reduced noise compared to the baseline reservoir algorithm. To get the estimate for the j-th reservoir example, target values of all similar observed examples are averaged. We developed a statistically-based procedure for neighborhood update that starts from a large neighborhood and proceeds by reducing the neighborhood size as soon as the statistical test indicates it.
We are exploring a reservoir approach that is based on the well-known observation that selective sampling leads to more accurate predictors than random sampling. The underlying idea is that it is beneficial to oversample regions of attribute space that are difficult to learn or where prediction is the most uncertain. Our approach is based on representing a data set with a number of prototypes that are updated in an online fashion and maintained in the reservoir. Each prototype is a representative of a small region of the attribute space. To guide selection and updating of the prototype our approach relies on the incremental-decremental support vector classifier. This is an online algorithm that is able to optimally update the existing classifier when an example is added or removed from the training data set. After each stream example is observed, our algorithm decides if it should be included in the reservoir as a new prototype. We explored several criteria and the most successful is the one that allows inclusion only when the new example is within the classification margin. When a new example is added to the reservoir, to maintain the constant number of the prototypes, the two closest prototypes are merged. The incremental-decremental algorithm is employed to update the support vector classifier based on the changes in the reservoir content.
AIM 2: Compression of Datasets and Prediction Models. This aim explores how to compress the data to maximize the effective size of the reservoir. A critical component for the optimal use of the available memory is the compression of data and prediction models.
Attribute quantization followed by compression can result in an effective increase of the reservoir size. The main challenge is to find the best tradeoff between quantization level and the number of examples that can be fitted in the reservoir. To perform the quantization, ancillary data and code should be stored in the memory. Our current efforts are focused on distributed data mining scenario where attributes are collected by several sensors and transmitted to a centralized location where learning is taking place. We are exploring how to quantize attributes from each sensor to minimize transmission costs and allow high quality learning. Our approach is based on vector quantization where each sensor observation is represented by its nearest codevector and where codevector label is optimized to increase the learning accuracy. As an alternative to attribute quantization, we are also exploring attribute selection. Our efforts are focusing on attribute selection using side information about attributes in microarray classification application. The proposed approach is looking at the functional properties of differentially expressed genes and is using only a subset of genes with most characteristic functional properties.
The reservoir approach that uses selective sampling and incremental-decremental support vector machines studied in Task 1.2 naturally leads to model compression. This is because SVM model is described by support vectors – decreasing their amount leads to memory savings for model storage. We studied a simple extension of our original algorithm that quantizes reservoir support vectors. Here, we used simple scalar quantization where b-bit representation is used instead of the standard 32-bit (single precision) or 64-bit (double precision) representation of the attributes. We explored the best tradeoffs between number of bits used and prediction accuracy.
AIM 3. Integrating Data Mining and Compression Algorithms. The aim is to perform an extensive study over a number of scenarios, including different data types and different levels of memory and processing constraints. In the second year, we focused on collecting and processing several excellent benchmark data sets for the planned evaluation. The first is remote sensing data related to aerosol retrieval collected by NASA's MISR and MODIS instruments. The second is EpiSims simulated data corresponding to a particular instance of infection outbreak in Portland, OR. The data consists of 5 data tables with detailed information for about 1.6 million people and 240 thousand locations in Portland, including people's movement, activities, social contacts, and the infection spread. The third is real-life Minneapolis traffic data collected in 30-second intervals since year 2000 by over 4,000 loop detectors. We are in the process of establishing memory-unconstrained upper bounds on accuracy. The accuracies will be useful when comparing with memory-constrained approaches developed in this project. We started studying an important class of multiple instance learning problems that is common to remote sensing applications. There, spatially contiguous regions are being observed by remote sensing instruments and a subset of observed regions is being labeled through accurate ground-based measurements. This set of labeled regions is then used to train a predictive model to be used over unlabeled regions. The issue that arises is that each region results in a bag of instances that contain multiple observations that are all labeled the same. An interesting property of such data from the perspective of this project is that individual bags can contain thousands of examples, that number of bags can also be very large (e.g tens of thousands), and that their number increases in time. These are exactly conditions that require use of memory-constrained data mining methods. As a preliminary study we proposed two methods for multiple instance regression where the goal is to select a small subset of the available examples and use them to train a highly accurate predictor.
Education and outreach activities. Activities for integration of research and education involved supervising 3 undergraduate independent studies where students were involved in data collection, processing, and memory-unconstrained data mining of the 3 data sets of the aim 3 mentioned above. One of them (Bencella Wisner) did her summer internship on the task of Minneapolis traffic data collection and prediction of traffic patterns. During 8 weeks of summer 2007 she was funded as a research assistant from the project. The 5 graduate students involved in the project coupled their research with the independent study courses I supervised, where they got familiar with the related work in the field. The preparation is ongoing towards establishment of the Scientific Data Mining Lab at the PI's department whose one of the major goal will be to promote the activities on this project among undergraduate and graduate students through student seminars, invited talks, and regular lab meetings. Establishment of the lab is planned for Spring 2008.
The research has already resulted in some interesting findings.
Reservoir methods with estimation of conditional expectation. We derived an analytical expression for relationship between neighborhood size, data density, regression function complexity, and quality of conditional expectation estimate. We showed that the optimal neighborhood size is directly proportional to target noise and inversely proportional to data density and complexity of the target function. Our experiments on synthetic two and three dimensional data streams representing a complex regression function showed that our approach for conditional expectation estimation is highly successful and that it achieves good performance regardless of the level of target noise. In all experiments, the quality of reservoir points gradually achieved that of noise-free version of the original reservoir examples. The success of target noise reduction is directly translated into capability to learn highly accurate prediction models.
Reservoir methods with selective sampling. We evaluated our selective sampling algorithm that uses incremental-decremental SVM on more than 10 large synthetic and real data sets with sizes from 10,000 to 1,000,000 examples from UCI data repository. The results clearly indicate success of the proposed algorithms. In the experiments we used reservoir sizes ranging from 50 to 500 examples. In all cases, the accuracy of our procedure was much higher than when using the random reservoir sample. Interestingly, as the size of the data stream increased, the accuracy of the resulting SVM increased and often approached the accuracy of the memory-unconstrained SVM. This is a significant result in that it illustrates that high-quality prediction models can be constructed from large data streams using highly resource-limited computational devices.
Data compression by attribute quantization. We are comparing our prototype-based algorithm for decentralized estimation with several existing approaches, including the algorithm that builds regression trees for each distributed sensor. Our results indicate that our memory-light quantization procedure can result in near-optimal prediction accuracy. Our future efforts will include evaluation of a number of scenarios including varying number of sensors, training data size, quantization levels, and sensor memory and processing power.
Prediction model compression. Our preliminary results indicate that incorporating support vector quantization as a part of our reservoir selective sampling algorithm can result in significant improvements in memory utilization. In fact, for the fixed memory size, the quantization can effectively increase the reservoir several times which can, in turn, lead to significant improvements in learning accuracy. Our future efforts will focus on the fine tuning of the procedure and exploring the best tradeoffs between reservoir capacity (number of support vector that could be maintained) and quantization coarseness.
Integrating Data Mining and Compression Algorithms. As a part of the initial analysis of the three benchmark data sets we collected, we observed that an important challenge in each of them is attribute selection. Attribute extraction and selection, if done properly, can largely reduce dimensionality of the data and result in significant increases in the number of examples that can be held in the memory. We are experimenting with several promising attribute extraction and selection algorithms for each of the three data sets. On the problem of multiple instance regression, we developed two novel algorithms for removal of noisy instances and tested them on synthetic and real-life remote sensing data. Our experiments showed that they are more successful than the existing approaches that include standard supervised learning, an aggregation algorithm that represents each bag as a single example, and the prime multiple instance algorithm.
Contributions within Discipline:
The research addresses a vital question of efficient knowledge discovery from large volumes of data available in various scientific, engineering, and business-oriented domains. The merit of the ongoing activities stems form integration of ideas from data mining and data compression. The methodology resulting from this project will have broad impact on computer science, and, more generally, on the whole field of information technology.
Contributions to Other Disciplines:
The project is addressing learning from a large variety of scientific, engineering, and business-related data sets. The methodologies that are being developed will facilitate knowledge discovery from such data and will thus contribute to a large array of disciplines.
Contributions to Education and Human Resources:
The project has already involved 7 graduate students and 4 undergraduate students. They gained valuable research skills in computer science and information technology.