In this paper, we propose a novel method to extend sequence-to-sequence models to accurately process sequences much longer than the ones used during training while being sample- and resource-efficient, supported by thorough experimentation. To investigate the effectiveness of our method, we apply it to the task of correcting documents already processed with Optical Character Recognition (OCR) systems using sequence-to-sequence models based on characters. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual cor...
Optical Character Recognition (OCR), is extraction of textual data from scanned text documents to fa...
We consider the isolated spelling error correction problem as a specific subproblem of the more gene...
This paper describes a new method for language modelling and reports its application to handwritten ...
One of the major challenges of using historical document collections for research is the fact that O...
In this paper, stochastic error-correcting parsing is proposed as a powerful and flexible method to ...
Sequence learning describes the process of understanding the spatio-temporal relations in a sequenc...
This paper describes the second round of the ICDAR 2019 competition on post-OCR text correction and ...
In this paper, stochastic error-correcting parsing is pro-posed as a powerful and flexible method to...
Optical character recognition (OCR) is one of the most popular techniques used for converting printe...
Optical character recognition (OCR) remains a difficult problem for noisy documents or documents not...
Post-processing is a crucial step in improving the performance of OCR process. In this paper, we pre...
Post-OCR is an important processing step that follows optical character recognition (OCR) and is mea...
For indexing the content of digitized historical texts, optical character recognition (OCR) errors a...
The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digita...
International audienceCombining character level and word level RNNs for post-OCR error detection Pos...
Optical Character Recognition (OCR), is extraction of textual data from scanned text documents to fa...
We consider the isolated spelling error correction problem as a specific subproblem of the more gene...
This paper describes a new method for language modelling and reports its application to handwritten ...
One of the major challenges of using historical document collections for research is the fact that O...
In this paper, stochastic error-correcting parsing is proposed as a powerful and flexible method to ...
Sequence learning describes the process of understanding the spatio-temporal relations in a sequenc...
This paper describes the second round of the ICDAR 2019 competition on post-OCR text correction and ...
In this paper, stochastic error-correcting parsing is pro-posed as a powerful and flexible method to...
Optical character recognition (OCR) is one of the most popular techniques used for converting printe...
Optical character recognition (OCR) remains a difficult problem for noisy documents or documents not...
Post-processing is a crucial step in improving the performance of OCR process. In this paper, we pre...
Post-OCR is an important processing step that follows optical character recognition (OCR) and is mea...
For indexing the content of digitized historical texts, optical character recognition (OCR) errors a...
The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digita...
International audienceCombining character level and word level RNNs for post-OCR error detection Pos...
Optical Character Recognition (OCR), is extraction of textual data from scanned text documents to fa...
We consider the isolated spelling error correction problem as a specific subproblem of the more gene...
This paper describes a new method for language modelling and reports its application to handwritten ...