Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounti...
The goal of this work is to develop statistical natural language models and processing techniques b...
Preserving historical archival heritage involves not only physical measures to safeguard these valua...
In this paper we describe our efforts in reducing and correcting OCR errors in the context of buildi...
For indexing the content of digitized historical texts, optical character recognition (OCR) errors a...
For indexing the content of digitized historical texts, optical character recognition (OCR) errors a...
Over the past few decades, large archives of paper-based documents such as books and newspapers have...
Optical Character Recognition (OCR), is extraction of textual data from scanned text documents to fa...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
Optical Character Recognition (OCR), is extraction of textual data from scanned text documents to fa...
International audienceIn this paper we present a novel approach to the automatic correction of OCR-i...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
Born-analog documents contain enormous knowledge which is valuable to our society. For the purpose o...
The goal of this work is to develop statistical natural language models and processing techniques b...
Preserving historical archival heritage involves not only physical measures to safeguard these valua...
In this paper we describe our efforts in reducing and correcting OCR errors in the context of buildi...
For indexing the content of digitized historical texts, optical character recognition (OCR) errors a...
For indexing the content of digitized historical texts, optical character recognition (OCR) errors a...
Over the past few decades, large archives of paper-based documents such as books and newspapers have...
Optical Character Recognition (OCR), is extraction of textual data from scanned text documents to fa...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
Optical character recognition (OCR) for historical documents is a complex procedure subject to a uni...
Optical Character Recognition (OCR), is extraction of textual data from scanned text documents to fa...
International audienceIn this paper we present a novel approach to the automatic correction of OCR-i...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of ...
Born-analog documents contain enormous knowledge which is valuable to our society. For the purpose o...
The goal of this work is to develop statistical natural language models and processing techniques b...
Preserving historical archival heritage involves not only physical measures to safeguard these valua...
In this paper we describe our efforts in reducing and correcting OCR errors in the context of buildi...