Physical and logical structure recovering from electronic documents is still an open issue. In this paper, we propose a flexible and efficient approach for recovering document structures from PDF files. After a brief introduction of the PDF format and its major features, we report about our evaluation of different existing tools and works for PDF content extraction and analysis. To overcome the weaknesses of these systems, we propose a new analysis strategy, based on an intermediate representation, called XCDF, which enables representing physical structures in a canonical way. This paper then describes the PDF reverse engineering workflow and focuses on the document logical restructuring. Finally, the paper concludes with potential futur...
Information can include text, pictures and signatures that can be scanned into a document format, su...
The automated discovery of logical structure in text documents is an important problem that has rece...
Most of the electronic documents available from todays huge number of electronic information sources...
This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original...
Abstract. Accessing the structured content of PDF document is a difficult task, requiring pre-proces...
summary:We present a progress report on our ongoing project of reverse engineering scientific PDF do...
PDF became a very common format for exchanging printable documents. Further, it can be easily genera...
Nowadays PDF documents have become a dominating knowledge repository for both the academia and indus...
The PDF format plays a crucial role in the field of electronic academic literature publishing, but d...
This paper describes a tool for recombining the logical structure from an XML document with the type...
The availability of large, heterogeneous repositories of electronic documents is increasing rapidly,...
A strategy for document analysis is presented which uses Portable Document Format (PDF the underlyin...
Abstract. Tables are a common structuring element in many documents, such as PDF files. To reuse suc...
Information can include text, pictures and signatures that can be scanned into a document format, su...
Documents are often marked up in XML-based tagsets to delineate major structural components such as ...
Information can include text, pictures and signatures that can be scanned into a document format, su...
The automated discovery of logical structure in text documents is an important problem that has rece...
Most of the electronic documents available from todays huge number of electronic information sources...
This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original...
Abstract. Accessing the structured content of PDF document is a difficult task, requiring pre-proces...
summary:We present a progress report on our ongoing project of reverse engineering scientific PDF do...
PDF became a very common format for exchanging printable documents. Further, it can be easily genera...
Nowadays PDF documents have become a dominating knowledge repository for both the academia and indus...
The PDF format plays a crucial role in the field of electronic academic literature publishing, but d...
This paper describes a tool for recombining the logical structure from an XML document with the type...
The availability of large, heterogeneous repositories of electronic documents is increasing rapidly,...
A strategy for document analysis is presented which uses Portable Document Format (PDF the underlyin...
Abstract. Tables are a common structuring element in many documents, such as PDF files. To reuse suc...
Information can include text, pictures and signatures that can be scanned into a document format, su...
Documents are often marked up in XML-based tagsets to delineate major structural components such as ...
Information can include text, pictures and signatures that can be scanned into a document format, su...
The automated discovery of logical structure in text documents is an important problem that has rece...
Most of the electronic documents available from todays huge number of electronic information sources...