International audienceIn this paper, we present a generator of semi structured documents (SSDs). This generator can provide samples of administrative documents that are useful for learning information extraction systems. It can also take care of the document annotation operation which is generally difficult to do and time consuming. We propose a general structure for SSDs and we prove that it perfectly works on three SSD types: invoices, payslips and receipts. Both the content and the layout are managed by random variables allowing them to be varied and to obtain different samples. These documents have some sort of similarity that gives them a common global model with particularities for each of them. The generator outputs the documents on ...
An overwhelming number of legal documents is available in digital form. However, most of the texts a...
With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the...
International audienceLike for many text understanding and generation tasks, pre-trained languages m...
Abstract. Information extraction from semi-structured documents comprises contents detection, wrappe...
AbstractThe number of semi-structured documents that is produced is steadily increasing. Thus, it wi...
The day-to-day working of an organization produces a massive volume of unstructured data in the form...
The paper presents a new method for extracting information from semi-structured resources, based on ...
In the last times, semi-supervised clustering has been an area that has received a lot of attention....
Annotation is a process of adding the information into the Document which is useful for extracting t...
The increasing amount of available semistructured data demands efficient mechanisms to store, proces...
The number of domains and tasks where information extraction tools can be used needs to be increased...
Information extraction from printed documents is still a crucial problem in many interorganizational...
Nowadays we are speaking about Web 2.0, which means the web of documents rather than the web of data...
The semi-structured information available in HTML and similar documents provide valuable information...
This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original...
An overwhelming number of legal documents is available in digital form. However, most of the texts a...
With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the...
International audienceLike for many text understanding and generation tasks, pre-trained languages m...
Abstract. Information extraction from semi-structured documents comprises contents detection, wrappe...
AbstractThe number of semi-structured documents that is produced is steadily increasing. Thus, it wi...
The day-to-day working of an organization produces a massive volume of unstructured data in the form...
The paper presents a new method for extracting information from semi-structured resources, based on ...
In the last times, semi-supervised clustering has been an area that has received a lot of attention....
Annotation is a process of adding the information into the Document which is useful for extracting t...
The increasing amount of available semistructured data demands efficient mechanisms to store, proces...
The number of domains and tasks where information extraction tools can be used needs to be increased...
Information extraction from printed documents is still a crucial problem in many interorganizational...
Nowadays we are speaking about Web 2.0, which means the web of documents rather than the web of data...
The semi-structured information available in HTML and similar documents provide valuable information...
This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original...
An overwhelming number of legal documents is available in digital form. However, most of the texts a...
With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the...
International audienceLike for many text understanding and generation tasks, pre-trained languages m...