Feature representation and selection are major challenges for automatic document clustering. We developed a novel method based on an LSI (Latent Semantic Indexing) probabilistic subspace model for this problem. The top ranking conceptual terms or term clusters are selected to represent the corpora according to their global and local statistical contribution to the LSI term space. Then, each term or document is defined as a signature which represents the distribution of its local statistical contribution on the top ranking LSI concept dimensions. Finally, two novel similarity measures are applied between the concept signatures and the document signatures, which bridge the LSI subspaces and significantly improve the performance of the cluster...
Due to the availability of internet-based abstract services and patent databases, bibliometric analy...
Due to the availability of internet-based abstract services and patent databases, bibliometric analy...
This paper presents work in progress on clustering methods that identify semantic concepts in a docu...
represents terms and documents by the distribution signatures of their statistical contribution acro...
Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent p...
Document clustering is a popular tool for automatically organizing a large collection of texts. Clus...
We propose a novel document clustering method, which aims to cluster the docu-ments into different s...
The statistical characteristics of dimensionality in latent semantic analysis (LSA) space were studi...
In this paper, a comparative analysis of text document clustering algorithms based on latent semanti...
The statistical characteristics of dimensionality in latent semantic analysis (LSA) space were studi...
The statistical characteristics of dimensionality in latent semantic analysis (LSA) space were studi...
Text clustering is an established technique for improving quality in information retrieval, for both...
This paper proposes a novel document clustering method based on Probabilistic Latent Semantic Indexi...
Due to the availability of internet-based abstract services and patent databases, bibliometric analy...
Abstract: Clustering is the problem of discovering “meaningful ” groups in given data. The first and...
Due to the availability of internet-based abstract services and patent databases, bibliometric analy...
Due to the availability of internet-based abstract services and patent databases, bibliometric analy...
This paper presents work in progress on clustering methods that identify semantic concepts in a docu...
represents terms and documents by the distribution signatures of their statistical contribution acro...
Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent p...
Document clustering is a popular tool for automatically organizing a large collection of texts. Clus...
We propose a novel document clustering method, which aims to cluster the docu-ments into different s...
The statistical characteristics of dimensionality in latent semantic analysis (LSA) space were studi...
In this paper, a comparative analysis of text document clustering algorithms based on latent semanti...
The statistical characteristics of dimensionality in latent semantic analysis (LSA) space were studi...
The statistical characteristics of dimensionality in latent semantic analysis (LSA) space were studi...
Text clustering is an established technique for improving quality in information retrieval, for both...
This paper proposes a novel document clustering method based on Probabilistic Latent Semantic Indexi...
Due to the availability of internet-based abstract services and patent databases, bibliometric analy...
Abstract: Clustering is the problem of discovering “meaningful ” groups in given data. The first and...
Due to the availability of internet-based abstract services and patent databases, bibliometric analy...
Due to the availability of internet-based abstract services and patent databases, bibliometric analy...
This paper presents work in progress on clustering methods that identify semantic concepts in a docu...