Large text corpora are indispensable for natural language processing. However, in various fields such as literature and humanities, many documents to be studied are only scanned to images, but not converted to text data. Optical character recognition (OCR) is a technology to convert scanned document images into text data. However, OCR often misrecognizes characters due to the low quality of the scanned document images, which is a crucial factor that degrades the quality of constructed text corpora. This paper works on corpus construction for historical newspapers. We present a corpus construction method based on a pipeline of image processing, OCR, and filtering. To improve the quality, we further propose to integrate OCR error correction. ...
We present an approach for automatic detection and correction of OCR-induced misspellings in histori...
This paper tackles the task of named entity recognition (NER) applied to digitized historical texts ...
This paper tackles the task of named entity recognition (NER) applied to digitized historical texts ...
Historical documents pose a challenge for character recognition due to various reasons such as font ...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
The use of OCR software to convert printed characters to digital text is a fundamental tool within d...
For more than a decade, Republican magazines and newspapers have been collected by institutes and pr...
For more than a decade, Republican magazines and newspapers have been collected by institutes and pr...
Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been...
Digital humanities research that requires the digitization of medium-scale, project-specific texts c...
<p>This poster was presented at the National eScience Symposium and describes the research project t...
We present an approach for automatic detection and correction of OCR-induced misspellings in histori...
This paper tackles the task of named entity recognition (NER) applied to digitized historical texts ...
This paper tackles the task of named entity recognition (NER) applied to digitized historical texts ...
Historical documents pose a challenge for character recognition due to various reasons such as font ...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
The use of OCR software to convert printed characters to digital text is a fundamental tool within d...
For more than a decade, Republican magazines and newspapers have been collected by institutes and pr...
For more than a decade, Republican magazines and newspapers have been collected by institutes and pr...
Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been...
Digital humanities research that requires the digitization of medium-scale, project-specific texts c...
<p>This poster was presented at the National eScience Symposium and describes the research project t...
We present an approach for automatic detection and correction of OCR-induced misspellings in histori...
This paper tackles the task of named entity recognition (NER) applied to digitized historical texts ...
This paper tackles the task of named entity recognition (NER) applied to digitized historical texts ...