UVM Theses and Dissertations
Format:
Print
Author:
Li, Zhao
Dept./Program:
Computer Science
Year:
2012
Degree:
PhD
Abstract:
Clustering analysis, an unsupervised learning paradigm that divides similar objects into groups, has long played an important role in data mining. Given a data matrix, in contrast to traditional single-sided clustering on either columns or rows of the data matrix, co-clustering treats the data matrix in a symmetric fashion such that a partitioning of rows can induce a partitioning of columns or vice verse. By clustering both rows and columns of a data matrix simultaneously co-clustering methods have been successfully applied in text mining, gene expressions, image retrieval and recommendation systems. However, an important issue in applying co-clustering methods is how to choose a good algorithm and make it more practical in real applications. There are several research issues, for example, how to find the relationship between different co-clustering algorithms; how to effectively discover the structure of high dimensional data; how to design a fast co-clustering algorithm with competitive results; and how to determine the number of co-clusters automatically. In this dissertation, we propose effective strategies as follows to address these questions from the perspective of matrix computation.
1. A Weighting Scheme for Co-clustering: By proposing a weighting scheme the relationship between different co-clustering algorithms is developed and the weighting scheme can lead to better clustering results.
2. Nonnegative Matrix Tri-factorization based on Orthogonal Subspaces: By effectively adjusting orthogonality through matrix factorization the parts-based structure of high dimensional data is clearly revealed.
3. Fast Co-clustering by Ranking and Sampling: By discriminately sampling representative data points the computational complexity of co-clustering analysis is achieved in linear time with competitive results.
4. Single Updated Matrix Decomposition: By incorporating alternative least squares to nonnegative matrix tri-factorization only one small factor matrix needs to be updated. Thus the space and time complexities are tremendously reduced. Meanwhile, our method outperforms stae-of-the-art co-clustering algorithms on large datasets.
We have also built an online search engine named "Descriptive Clustering Meta Search Engine" to address several application problems in web information retrieval, such as determining the number of clusters and describing the resulting clusters.
1. A Weighting Scheme for Co-clustering: By proposing a weighting scheme the relationship between different co-clustering algorithms is developed and the weighting scheme can lead to better clustering results.
2. Nonnegative Matrix Tri-factorization based on Orthogonal Subspaces: By effectively adjusting orthogonality through matrix factorization the parts-based structure of high dimensional data is clearly revealed.
3. Fast Co-clustering by Ranking and Sampling: By discriminately sampling representative data points the computational complexity of co-clustering analysis is achieved in linear time with competitive results.
4. Single Updated Matrix Decomposition: By incorporating alternative least squares to nonnegative matrix tri-factorization only one small factor matrix needs to be updated. Thus the space and time complexities are tremendously reduced. Meanwhile, our method outperforms stae-of-the-art co-clustering algorithms on large datasets.
We have also built an online search engine named "Descriptive Clustering Meta Search Engine" to address several application problems in web information retrieval, such as determining the number of clusters and describing the resulting clusters.