A classifier to determine page quality from an Optical Character Recognition (OCR) perspective is developed. It classifies a given page image as either good (i.e. high OCR accuracy is expected) or bad (i.e., low OCR accuracy expected). The classifier is based upon measuring the amount of white speckle, the amount of broken pieces, and the overall size information in the page. Two different sets of test data were used to evaluate the classifier: the Test dataset containing 439 pages and the Magazine dataset containing 200 pages. The classifier recognized 85% of the pages in the Test dataset correctly. However, approximately 40% of the low quality pages were misclassified as good. To solve this problem, the classifier was modified to re...
Mass digitization of historical documents is a challenging problem for optical character recognition...
Iterating with new and improved OCR solutions enforces decision making when it comes to targeting th...
Though there are millions of websites on the internet, half of the ones we come across do not provid...
Systems that predict optical character recognition (OCR) accuracy of an input image by a given OCR s...
We propose a set of metrics that evaluate the\ud uniformity, sharpness, continuity, noise, stroke wi...
We present a method for automatically selecting the best filter to treat poorly printed documents us...
OCR often performs poorly on degraded documents. One approach to improving performance is to determi...
The World Wide Web and search engines are widely used, and getting good results from searches is imp...
Optical Character Recognition (OCR) is the mechanical or electronic translation of scanned images of...
This document notes most of the research I had done for the National Library of the Netherlands (Kon...
Over the past years, considerable effort has been put into digitising library collections. As part o...
Machine understanding of documents has become a fundamental element in applications dealing with lar...
With the growth of web data, how to estimate web page quality effectively and rapidly becomes more a...
The purpose of this research is to measure the amount of image quality loss due to the effect of scr...
The millions of pages of historical documents that are digitized in libraries are increasingly used ...
Mass digitization of historical documents is a challenging problem for optical character recognition...
Iterating with new and improved OCR solutions enforces decision making when it comes to targeting th...
Though there are millions of websites on the internet, half of the ones we come across do not provid...
Systems that predict optical character recognition (OCR) accuracy of an input image by a given OCR s...
We propose a set of metrics that evaluate the\ud uniformity, sharpness, continuity, noise, stroke wi...
We present a method for automatically selecting the best filter to treat poorly printed documents us...
OCR often performs poorly on degraded documents. One approach to improving performance is to determi...
The World Wide Web and search engines are widely used, and getting good results from searches is imp...
Optical Character Recognition (OCR) is the mechanical or electronic translation of scanned images of...
This document notes most of the research I had done for the National Library of the Netherlands (Kon...
Over the past years, considerable effort has been put into digitising library collections. As part o...
Machine understanding of documents has become a fundamental element in applications dealing with lar...
With the growth of web data, how to estimate web page quality effectively and rapidly becomes more a...
The purpose of this research is to measure the amount of image quality loss due to the effect of scr...
The millions of pages of historical documents that are digitized in libraries are increasingly used ...
Mass digitization of historical documents is a challenging problem for optical character recognition...
Iterating with new and improved OCR solutions enforces decision making when it comes to targeting th...
Though there are millions of websites on the internet, half of the ones we come across do not provid...