Most of the low resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Formats (PDFs) that contain legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimised for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages and many documents. For this purpose, we enhanced the performance of Tesseract 4.1.1 by employing LSTM-based training on many legacy fonts to recognise printed characters in the above languages. Espec...
The word error rate of any optical character recognition system (OCR) is usually substantially below...
The word error rate of any optical character recognition system (OCR) is usually substantially below...
Even if the technological and digital world is expanding more quickly, there are still many things t...
The robustness of a typical Handwritten character recognition system relies on the availability of c...
Abstract: Optical Character Recognition (OCR) refers to the process of converting printed Tamil te...
systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed docu...
The task of printed Optical Character Recognition (OCR), though considered ``solved'' by many, stil...
Language modeling has witnessed remarkable advancements in recent years, with Large Language Models ...
This paper focuses on recognizing Tamil characters that are handwritten and displaying their digital...
Handwritten character and number recognition remains challenging after decades of study of offline I...
This paper focuses on recognizing Tamil characters that are handwritten and displaying their digital...
As it is the seventh most-spoken language and fifth most-spoken native language in the world, the do...
Analyzing existing machine translation approaches for Sinhala-Tamil official government documents h...
We present an early version of a complete Optical Character Recognition (OCR) system for Tamil newsp...
Optical Character Recognition (OCR) is the machine conversion of handwritten or typed data into mach...
The word error rate of any optical character recognition system (OCR) is usually substantially below...
The word error rate of any optical character recognition system (OCR) is usually substantially below...
Even if the technological and digital world is expanding more quickly, there are still many things t...
The robustness of a typical Handwritten character recognition system relies on the availability of c...
Abstract: Optical Character Recognition (OCR) refers to the process of converting printed Tamil te...
systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed docu...
The task of printed Optical Character Recognition (OCR), though considered ``solved'' by many, stil...
Language modeling has witnessed remarkable advancements in recent years, with Large Language Models ...
This paper focuses on recognizing Tamil characters that are handwritten and displaying their digital...
Handwritten character and number recognition remains challenging after decades of study of offline I...
This paper focuses on recognizing Tamil characters that are handwritten and displaying their digital...
As it is the seventh most-spoken language and fifth most-spoken native language in the world, the do...
Analyzing existing machine translation approaches for Sinhala-Tamil official government documents h...
We present an early version of a complete Optical Character Recognition (OCR) system for Tamil newsp...
Optical Character Recognition (OCR) is the machine conversion of handwritten or typed data into mach...
The word error rate of any optical character recognition system (OCR) is usually substantially below...
The word error rate of any optical character recognition system (OCR) is usually substantially below...
Even if the technological and digital world is expanding more quickly, there are still many things t...