Abstract—To mine large digital libraries in humanistically meaningful ways, we need to divide them by genre. This is a task that classification algorithms are well suited to assist, but they need adjustment to address the specific challenges of this domain. Digital libraries pose two problems of scale not usually found in the article datasets used to test these algorithms. 1) Because libraries span several centuries, the genres being identified may change gradually across the time axis. 2) Because volumes are much longer than articles, they tend to be internally heterogeneous, and the classification task also requires segmentation. We describe a multilayered solution that trains hidden Markov models to segment volumes, and uses ensembles of...
In large-scale digital libraries, it is not uncommon that some bibliographic fields in metadata reco...
The current excitement in regards to machine learning has spurred enthusiasm amongst collection hold...
The sizes of modern digital libraries have grown beyond our capacity to comprehend manually. Thus we...
Using regularized logistic regression and hidden Markov models, we predict genre at the page level i...
Large digital collections offer new avenues of exploration for literary scholars. But their potentia...
This workset is data in support of the article "Mapping Mutable Genres in Structurally Complex Volum...
Abstract. In the traditional setting, text categorization is formulated as a concept learning proble...
This paper shows how statistical machine learning can be used to identify bibliopgraphy styles, thu...
We propose a generative model based on latent Dirichlet allocation for mining distinct topics in doc...
This paper examines automated genre classification of text documents and its role in enabling the ef...
In this thesis, we compare the bag of words approach with doc2vec doc- ument embeddings on the task ...
Text categorization is typically formulated as a concept learning problem where each instance is a s...
This thesis treats the sociotechnical notion of genre as a conflation of a communicative situation a...
textDigital media collections hold an unprecedented source of knowledge and data about the world. Y...
A topic model of 29,341 volumes of fiction, written in English and published between 1880 and 1999. ...
In large-scale digital libraries, it is not uncommon that some bibliographic fields in metadata reco...
The current excitement in regards to machine learning has spurred enthusiasm amongst collection hold...
The sizes of modern digital libraries have grown beyond our capacity to comprehend manually. Thus we...
Using regularized logistic regression and hidden Markov models, we predict genre at the page level i...
Large digital collections offer new avenues of exploration for literary scholars. But their potentia...
This workset is data in support of the article "Mapping Mutable Genres in Structurally Complex Volum...
Abstract. In the traditional setting, text categorization is formulated as a concept learning proble...
This paper shows how statistical machine learning can be used to identify bibliopgraphy styles, thu...
We propose a generative model based on latent Dirichlet allocation for mining distinct topics in doc...
This paper examines automated genre classification of text documents and its role in enabling the ef...
In this thesis, we compare the bag of words approach with doc2vec doc- ument embeddings on the task ...
Text categorization is typically formulated as a concept learning problem where each instance is a s...
This thesis treats the sociotechnical notion of genre as a conflation of a communicative situation a...
textDigital media collections hold an unprecedented source of knowledge and data about the world. Y...
A topic model of 29,341 volumes of fiction, written in English and published between 1880 and 1999. ...
In large-scale digital libraries, it is not uncommon that some bibliographic fields in metadata reco...
The current excitement in regards to machine learning has spurred enthusiasm amongst collection hold...
The sizes of modern digital libraries have grown beyond our capacity to comprehend manually. Thus we...