Effective representation learning is critical for short text clustering due to the sparse, high-dimensional and noise attributes of short text corpus. Existing pre-trained models (e.g., Word2vec and BERT) have greatly improved the expressiveness for short text representations with more condensed, low-dimensional and continuous features compared to the traditional Bag-of-Words (BoW) model. However, these models are trained for general purposes and thus are suboptimal for the short text clustering task. In this paper, we propose two methods to exploit the unsupervised autoencoder (AE) framework to further tune the short text representations based on these pre-trained text models for optimal clustering performance. In our first method Structur...
Text data mining is a growing research field where machine learning and NLP areimportant technologie...
Abstract. Data Sparseness, the evident characteristic of short text, is caused by the diversity of l...
Clustering narrow domain short texts is considered to be a complex task because of the intrinsic fea...
Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF repr...
International audienceRecently there has been an increase in interest towards clustering short text ...
Supervised and unsupervised learning have been the focus of critical research in the areas of machin...
Extracting knowledge from text data is a complex task that is usually performed by first structuring...
One decisive problem of short text classification is the serious dimensional disaster when utilizing...
Aiming at the sparsity of short text features, lack of context, and the inability of word embedding ...
Abstract Classifying short texts to one category or clustering semantically related texts is challen...
Clustering aided classification methods are based on the assumption that the learned clusters under ...
This paper addresses the problem of learning to classify texts by exploiting information derived fro...
This paper presents SimCTC, a simple contrastive learning (CL) framework that greatly advances the s...
Clustering has been employed to expand training data in some semi-supervised learning methods. Clust...
So far, various methods have been used to classify text. One of the methods of text classification i...
Text data mining is a growing research field where machine learning and NLP areimportant technologie...
Abstract. Data Sparseness, the evident characteristic of short text, is caused by the diversity of l...
Clustering narrow domain short texts is considered to be a complex task because of the intrinsic fea...
Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF repr...
International audienceRecently there has been an increase in interest towards clustering short text ...
Supervised and unsupervised learning have been the focus of critical research in the areas of machin...
Extracting knowledge from text data is a complex task that is usually performed by first structuring...
One decisive problem of short text classification is the serious dimensional disaster when utilizing...
Aiming at the sparsity of short text features, lack of context, and the inability of word embedding ...
Abstract Classifying short texts to one category or clustering semantically related texts is challen...
Clustering aided classification methods are based on the assumption that the learned clusters under ...
This paper addresses the problem of learning to classify texts by exploiting information derived fro...
This paper presents SimCTC, a simple contrastive learning (CL) framework that greatly advances the s...
Clustering has been employed to expand training data in some semi-supervised learning methods. Clust...
So far, various methods have been used to classify text. One of the methods of text classification i...
Text data mining is a growing research field where machine learning and NLP areimportant technologie...
Abstract. Data Sparseness, the evident characteristic of short text, is caused by the diversity of l...
Clustering narrow domain short texts is considered to be a complex task because of the intrinsic fea...