Abstract. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99 % is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “frac...
A clustering algorithm that exploits special characteristics of a data set may lead to superior resu...
Document clustering is a popular tool for automatically organizing a large collection of texts. Clus...
Abstract: Clustering is the problem of discovering “meaningful ” groups in given data. The first and...
Data accumulate and there is a growing need of automated systems for partitioning data into groups, ...
Data accumulate and there is a growing need of automated systems for partitioning data into groups, ...
Nowadays a common size of document corpus might have more than 5000 documents. It is almost impossib...
In this paper, a new approach on text clustering is proposed. Based on the concept-relational decomp...
In this paper, a new approach on text clustering is proposed. Based on the concept-relational decomp...
A breakneck progress of computers and web makes it easier to collect and store large amount of infor...
Clustering text documents is a fundamental task in modern data analysis, requiring approaches which ...
Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent p...
Abstract An invaluable portion of scientific data occurs naturally in text form. Given a large unlab...
In this paper we consider the problem of clustering collections of very short texts using subspace c...
This study focuses on high-dimensional text data clustering, given the inability of K-means to proce...
The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a po...
A clustering algorithm that exploits special characteristics of a data set may lead to superior resu...
Document clustering is a popular tool for automatically organizing a large collection of texts. Clus...
Abstract: Clustering is the problem of discovering “meaningful ” groups in given data. The first and...
Data accumulate and there is a growing need of automated systems for partitioning data into groups, ...
Data accumulate and there is a growing need of automated systems for partitioning data into groups, ...
Nowadays a common size of document corpus might have more than 5000 documents. It is almost impossib...
In this paper, a new approach on text clustering is proposed. Based on the concept-relational decomp...
In this paper, a new approach on text clustering is proposed. Based on the concept-relational decomp...
A breakneck progress of computers and web makes it easier to collect and store large amount of infor...
Clustering text documents is a fundamental task in modern data analysis, requiring approaches which ...
Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent p...
Abstract An invaluable portion of scientific data occurs naturally in text form. Given a large unlab...
In this paper we consider the problem of clustering collections of very short texts using subspace c...
This study focuses on high-dimensional text data clustering, given the inability of K-means to proce...
The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a po...
A clustering algorithm that exploits special characteristics of a data set may lead to superior resu...
Document clustering is a popular tool for automatically organizing a large collection of texts. Clus...
Abstract: Clustering is the problem of discovering “meaningful ” groups in given data. The first and...