The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpus-assisted discourse Studies. However, OCR software is not totally accurate, and the resulting error rate may compromise the qualitative analysis of the studies. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for enhancing the quality of historical corpora. We applied the developed methodology to two case studies on newspapers of the beginning of the 20th century for the linguistic analysis of the metaphors representing migration and pandemics. The outcome of this project consists in a set of rules which are, eventually, valid f...
Historical documents pose a challenge for character recognition due to various reasons such as font ...
This article aims to quantify the impact optical character recognition (OCR) has on the quantitative...
We present an approach for automatic detection and correction of OCR-induced misspellings in histori...
Large text corpora are indispensable for natural language processing. However, in various fields suc...
International audienceAt a time when the quantity of - more or less freely - available data is incre...
International audienceAt a time when the quantity of - more or less freely - available data is incre...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
Digital libraries allow not only to improve the preservation of documents and to facilitate access b...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
The increasing availability of historical sources in a digital form has led to calls for new forms o...
The study of texts using a qualitative approach remains the dominant modus operandi in humanities re...
In this paper we describe our efforts in reducing and correcting OCR errors in the context of buildi...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
Historical documents pose a challenge for character recognition due to various reasons such as font ...
This article aims to quantify the impact optical character recognition (OCR) has on the quantitative...
We present an approach for automatic detection and correction of OCR-induced misspellings in histori...
Large text corpora are indispensable for natural language processing. However, in various fields suc...
International audienceAt a time when the quantity of - more or less freely - available data is incre...
International audienceAt a time when the quantity of - more or less freely - available data is incre...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
Digital libraries allow not only to improve the preservation of documents and to facilitate access b...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
The increasing availability of historical sources in a digital form has led to calls for new forms o...
The study of texts using a qualitative approach remains the dominant modus operandi in humanities re...
In this paper we describe our efforts in reducing and correcting OCR errors in the context of buildi...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
Historical documents pose a challenge for character recognition due to various reasons such as font ...
This article aims to quantify the impact optical character recognition (OCR) has on the quantitative...
We present an approach for automatic detection and correction of OCR-induced misspellings in histori...