In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objecti...
impresso. Media Monitoring of the Past is an interdisciplinary research project in which a team of c...
Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so f...
The user expectation from a digitized collection is that a full text search can be performed and tha...
In this paper, we study how to analyze and improve the quality of a large historical newspaper colle...
Increased digitization of historical newspapers by cultural heritage institutions1 has allowed human...
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera pub...
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera pub...
The availability of large digital archives of historical newspaper content has transformed the histo...
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera pub...
This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy so...
We present the results of text reuse de- tection, based on the corp...
In 2022, it is a common place that digital historical newspapers (DHN) have become increasingly avai...
Book chapter that documents the “Mapping Texts” project, an experiment focused on the problem of OCR...
Large text corpora are indispensable for natural language processing. However, in various fields suc...
In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tes...
impresso. Media Monitoring of the Past is an interdisciplinary research project in which a team of c...
Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so f...
The user expectation from a digitized collection is that a full text search can be performed and tha...
In this paper, we study how to analyze and improve the quality of a large historical newspaper colle...
Increased digitization of historical newspapers by cultural heritage institutions1 has allowed human...
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera pub...
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera pub...
The availability of large digital archives of historical newspaper content has transformed the histo...
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera pub...
This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy so...
We present the results of text reuse de- tection, based on the corp...
In 2022, it is a common place that digital historical newspapers (DHN) have become increasingly avai...
Book chapter that documents the “Mapping Texts” project, an experiment focused on the problem of OCR...
Large text corpora are indispensable for natural language processing. However, in various fields suc...
In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tes...
impresso. Media Monitoring of the Past is an interdisciplinary research project in which a team of c...
Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so f...
The user expectation from a digitized collection is that a full text search can be performed and tha...