Investigators
Lazarevic Aleksandar
Mehr I.
Obradovic Zoran
Raghavendra C.
Venkateswaran R.
Vucetic Slobodan
Problem
The objective is to develop automated data mining techniques for extracting useful knowledge from massive data set in parallel and distributed environment. The novel learning algorithms should have good computational scale-up properties and with the accuracy similar to what can be obtained by running extensive simulations on centralized data sequentially.
Results
- The Boosting Algorithm for Parallel and Distributed Learning
We propose a framework for parallel and distributed boosting algorithms (lazarevic01jb) intended for efficient integrating specialized classifiers learned over very large, distributed and possibly heterogeneous databases that cannot fit into main computer memory. Our parallel boosting algorithm is designed for tightly coupled systems with a small number of processors and a shared memory, with an objective of achieving the maximal prediction accuracy in fewer iterations than boosting on a single processor. The distributed boosting algorithm is proposed primarily for learning from several disjoint data sites when the data cannot be merged together, although it can also be used for parallel learning where a massive data set is partitioned into several disjoint subsets for a more efficient analysis. At each boosting round, the proposed method combines classifiers from all sites and creates a classifier ensemble on each site. The final classifier is constructed as an ensemble of all classifier ensembles built on disjoint data sets. - Data Reduction
Recently, we started developing performance controlled data reduction procedures aimed at facilitating repeated data use with complex and slow learning algorithms as well as at allowing an efficient data transfer and a centralized storage. Our approach consists of identifying and eliminating redundant data followed by compressing driving attributes based on a sensitivity analysis. Using our algorithm and allowing small modeling accuracy loss impressive database reductions were achieved on various real-life domains (e.g. compression up to 13 thousand times with 1% to 5% accuracy loss for a large forestry spatial data. These results indicate that transferring reduced data sets from multiple locations to a centralized site for an accurate knowledge discovery might often be possible in practice (vucetic00b). - Distributed Clustering
A novel distributed method for learning from heterogeneous spatial databases is proposed. Similar regions in multiple databases are identified by independently applying a spatial clustering algorithm on all sites, followed by transferring convex hulls corresponding to identified clusters and their integration. For each discovered region, the local regression models are built and transferred among data sites. The proposed method is shown to be computationally efficient and fairly accurate when compared to an approach where all the data are available at a central location (lazarevic00c). - Parallel and Distributed Neural Network Learning
A highly parallel neural network learning approach based on repetitive bounded depth trajectory branching is proposed (mehr96j), (mehr94).The algorithm was shown to improve accuracy with a much faster convergence by exploring a number of alternative trajectories in parallel in order to find one that avoids local minima.
In another approach cooperative efforts of several neural networks were used for more efficient gradient descent learning that is less depending on selection of an appropriate learning rate (venki94). Results on a network of heterogeneous workstations demonstrate that a team of few cooperative learners can be much more efficient than the best of corresponding individuals.
In another distributed algorithm, several single population genetic algorithms with different cross-over and mutation parameters are run as a set of processes that cooperate periodically and exchange information to solve the problem more efficiently (venki96).The algorithm is less stochastic than the standard genetic algorithm. System indicates that a distributed implementation can be used for application to real-life large scale problems.

