Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, oft...
Post-OCR is an important processing step that follows optical character recognition (OCR) and is mea...
Cultural heritage institutions increasingly make their collections digitally available. Consequently...
Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been...
Digitized document collections often suffer from OCR errors that may impact a document's readability...
Humanities scholars increasingly rely on digital archives for their research instead of time-consumi...
htmlabstractHumanities scholars increasingly rely on digital archives for their research in place of...
Bias in the retrieval of documents can directly influence the information access of a digital librar...
Bias in the retrieval of documents can directly influence the information access of a digital librar...
Important legacy paper documents are digitized and collected in online accessible archives. This ena...
Important legacy paper documents are digitized and collected in online accessible archives. This ena...
ABSTRACT Historical newspapers are increasingly accessed digitally for different purposes both by p...
Iterating with new and improved OCR solutions enforces decision making when it comes to targeting th...
Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so f...
Digitization of historical documents is a challenging task in many digital humanities projects. A po...
The user expectation from a digitized collection is that a full text search can be performed and tha...
Post-OCR is an important processing step that follows optical character recognition (OCR) and is mea...
Cultural heritage institutions increasingly make their collections digitally available. Consequently...
Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been...
Digitized document collections often suffer from OCR errors that may impact a document's readability...
Humanities scholars increasingly rely on digital archives for their research instead of time-consumi...
htmlabstractHumanities scholars increasingly rely on digital archives for their research in place of...
Bias in the retrieval of documents can directly influence the information access of a digital librar...
Bias in the retrieval of documents can directly influence the information access of a digital librar...
Important legacy paper documents are digitized and collected in online accessible archives. This ena...
Important legacy paper documents are digitized and collected in online accessible archives. This ena...
ABSTRACT Historical newspapers are increasingly accessed digitally for different purposes both by p...
Iterating with new and improved OCR solutions enforces decision making when it comes to targeting th...
Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so f...
Digitization of historical documents is a challenging task in many digital humanities projects. A po...
The user expectation from a digitized collection is that a full text search can be performed and tha...
Post-OCR is an important processing step that follows optical character recognition (OCR) and is mea...
Cultural heritage institutions increasingly make their collections digitally available. Consequently...
Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been...