Consortial collections have led to unprecedented scales of digitized corpora, but the insights that they enable are hampered by the complexities of access, particularly to in-copyright or orphan works. Pursuing a principle of non-consumptive access, we developed the Extracted Features (EF) dataset, a dataset of quantitative counts for every page of nearly 5 million scanned books. The EF includes unigram counts, part of speech tagging, header and footer extraction, counts of characters at both sides of the page, and more. Distributing book data with features already extracted saves resource costs associated with large-scale text use, improves the reproducibility of research done on the dataset, and opens the door to datasets on copyrighted b...
While digital libraries based on page images and automat-ically generated text have made possible ma...
Text mining and information visualization techniques applied to large-scale historical and literary ...
A wealth of digital texts and the proliferation of automated research methodologies enable researche...
Consortial collections have led to unprecedented scales of digitized corpora, but the insights that ...
Consortial collections have led to unprecedented scales of digitized corpora, but the insights that ...
The emergence of large multi-institutional digital libraries has opened the door to aggregate-level ...
The digitization of millions of books under programs such as Google Book Search and Microsoft Live S...
We report on the work undertaken developing a web environment that allows users to search over 1 tri...
We present an automatic, learned model for the extraction of poetry from digitally scanned books. Th...
Research literature contains some of the most important information we have assembled as human speci...
The method by which users have traditionally exploited digital resources such as Early English Books...
Guest presentation in Projects in Rare Book Digitization course (Pratt University, LIS 666) on analy...
Companies including Jellybooks and Amazon have introduced analytics to collect, analyze and monetize...
The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by D...
The article is a researcher's eye view of the value of the library catalog not only as a database to...
While digital libraries based on page images and automat-ically generated text have made possible ma...
Text mining and information visualization techniques applied to large-scale historical and literary ...
A wealth of digital texts and the proliferation of automated research methodologies enable researche...
Consortial collections have led to unprecedented scales of digitized corpora, but the insights that ...
Consortial collections have led to unprecedented scales of digitized corpora, but the insights that ...
The emergence of large multi-institutional digital libraries has opened the door to aggregate-level ...
The digitization of millions of books under programs such as Google Book Search and Microsoft Live S...
We report on the work undertaken developing a web environment that allows users to search over 1 tri...
We present an automatic, learned model for the extraction of poetry from digitally scanned books. Th...
Research literature contains some of the most important information we have assembled as human speci...
The method by which users have traditionally exploited digital resources such as Early English Books...
Guest presentation in Projects in Rare Book Digitization course (Pratt University, LIS 666) on analy...
Companies including Jellybooks and Amazon have introduced analytics to collect, analyze and monetize...
The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by D...
The article is a researcher's eye view of the value of the library catalog not only as a database to...
While digital libraries based on page images and automat-ically generated text have made possible ma...
Text mining and information visualization techniques applied to large-scale historical and literary ...
A wealth of digital texts and the proliferation of automated research methodologies enable researche...