An empirical configuration study of a common document clustering pipeline

Eklund, Anton
Forsman, Mona
Drewes, Frank

Publication date

January 2023

DOI

10.3384/nejlt.2000-1533.2023.4396

Abstract

Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or create topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PC...

Extracted data

We use cookies to provide a better user experience.

Data Protection

An empirical configuration study of a common document clustering pipeline

Abstract

Extracted data

An empirical configuration study of a common document clustering pipeline

Abstract

Extracted data

Related items

Related items