Legal documents often have a complex layout with many different headings, headers and footers, side notes, etc. For the further processing, it is important to extract these individual components correctly from a legally binding document, for example a signed PDF. A common approach to do so is to classify each (text) region of a page using its geometric and textual features. This approach works well, when the training and test data have a similar structure and when the documents of a collection to be analyzed have a rather uniform layout. We show that the use of global page properties can improve the accuracy of text element classification: we first classify each page into one of three layout types. After that, we can train a classifier for ...
In this paper, a new dataset is proposed for page layout analysis of born-digital documents. By extr...
This thesis explores the domain of document analysis and document classification within the PDF docu...
(Automatic) document classification is generally defined as content-based assignment of one or more ...
In this paper, a machine learning approach to support the user during the correction of the layout a...
We present a general approach for the hierarchical segmentation and labeling of document layout stru...
With an abundance of legal documents now available in electronic format, legal scholars and practiti...
Introduction Searching in a large heterogeneous collection of scanned document images often produce...
This paper presents an eficient technique for doc-ument page layout structure extraction and classif...
The availability of large, heterogeneous repositories of electronic documents is increasing rapidly,...
A paginated legal bundle is an indexed version of all the evidence documents considered relevant to ...
Since most legal documents are released in digital form nowadays it has become more and more importa...
The current spread of digital documents raised the need of effective content-based retrieval techni...
International audienceIn this article, we present our work on baseline detection in images of histor...
Every day, thousands of digital documents are generated with useful information for companies, publi...
[[abstract]]The purpose of document layout analysis is to locate textlines and text regions in docum...
In this paper, a new dataset is proposed for page layout analysis of born-digital documents. By extr...
This thesis explores the domain of document analysis and document classification within the PDF docu...
(Automatic) document classification is generally defined as content-based assignment of one or more ...
In this paper, a machine learning approach to support the user during the correction of the layout a...
We present a general approach for the hierarchical segmentation and labeling of document layout stru...
With an abundance of legal documents now available in electronic format, legal scholars and practiti...
Introduction Searching in a large heterogeneous collection of scanned document images often produce...
This paper presents an eficient technique for doc-ument page layout structure extraction and classif...
The availability of large, heterogeneous repositories of electronic documents is increasing rapidly,...
A paginated legal bundle is an indexed version of all the evidence documents considered relevant to ...
Since most legal documents are released in digital form nowadays it has become more and more importa...
The current spread of digital documents raised the need of effective content-based retrieval techni...
International audienceIn this article, we present our work on baseline detection in images of histor...
Every day, thousands of digital documents are generated with useful information for companies, publi...
[[abstract]]The purpose of document layout analysis is to locate textlines and text regions in docum...
In this paper, a new dataset is proposed for page layout analysis of born-digital documents. By extr...
This thesis explores the domain of document analysis and document classification within the PDF docu...
(Automatic) document classification is generally defined as content-based assignment of one or more ...