XCDF: A Canonical and Structured Document Format

Bloechle, Jean-Luc; Rigamonti, Maurizio; Hadjar, Karim; Lalanne, Denis; Ingold, Rolf

doi:10.1007/11669487_13

Jean-Luc Bloechle¹⁸,
Maurizio Rigamonti¹⁸,
Karim Hadjar¹⁸,
Denis Lalanne¹⁸ &
…
Rolf Ingold¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3872))

Included in the following conference series:

International Workshop on Document Analysis Systems

2798 Accesses
11 Citations

Abstract

Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on document image analysis, or on electronic content extraction. Then, XCDF, a canonical format with well-defined properties is proposed as a suitable solution for representing structured electronic documents and as an entry point for further researches and works. The system and methods used for reverse engineering PDF document into this canonical format are also presented. We finally present current applications of this work into various domains, spacing from data mining to multimedia navigation, and consistently benefiting from our canonical format in order to access PDF document content and structures.

Download to read the full chapter text

Chapter PDF

TabbyPDF: Web-Based System for PDF Table Extraction

Analysis of Documents Born Digital

TEXUS: Table Extraction System for PDF Documents

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Adobe PDF reference, http://partners.adobe.com/asn/tech/pdf/specifications.jsp
Adobe’s Online Converter, http://www.adobe.com/products/acrobat/access_onlinetools.html
Anjewierden, A.: AIDAS: Incremental logical structure discovery in PDF document. In: Sixth International Conference on Document Analysis and Recognition (ICDAR 2001), Seattle, USA, pp. 374–377 (2001)
Google Scholar
Anjewierden, A., Kabel, S.: Automatic indexing of documents with ontologies. In: 13th Belgian/Dutch Conference on Artificial Intelligence (BNAIC 2001), Amsterdam, Holland, pp. 23–30 (2001)
Google Scholar
Bagley, S.R., Brailsford, D.F., Hardy, M.R.B.: Creating reusable well-structured PDF as a sequence of component object graphic (COG) elements. In: ACM Symposium on Document Engineering (DocEng 2003), Grenoble, France, pp. 58–67 (2003)
Google Scholar
BCL, http://www.bcltechnologies.com/document/index.asp
Chao, H., Fan, J.: Layout and Content Extraction for PDF Documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004)
Chapter Google Scholar
Chao, H., Xiaofan, L.: Capturing the Layout of electronic Documents for Reuse in Variable Data. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 940–944 (2005)
Google Scholar
Futrelle, R.P., Shap, M., Cieslik, C., Grimes, A.E.: Extraction, layout analysis and classification of diagrams in PDF documents. In: Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland, pp. 1007–1012 (2003)
Google Scholar
Glance, http://www.pdf-tools.com/en/home.asp
Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: a new tool for eXtracting hidden structures from Electronic Documents. In: Document Image Analysis for Libraries (DIAL 2004), Palo Alto, USA, pp. 212–221 (2004)
Google Scholar
Hadjar, K., Hitz, O., Robadey, L., Ingold, R.: Configuration REcognition Model for Complex Reverse Engineering Methods: 2(CREM). In: Lopresti, D.P., Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 469–479. Springer, Heidelberg (2002)
Chapter Google Scholar
Hadjar, K., Ingold, R.: Arabic Newspaper Page Segmentation. In: Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland, pp. 895–899 (2003)
Google Scholar
Hardy, M.R.B., Brailsford, D., Thomas, P.L.: Creating Structured PDF Files Using XML Templates. In: ACM Symposium on Document Engineering (DocEng 2004), Milwaukee, USA, pp. 99–108 (2004)
Google Scholar
JFerret, http://mmm.idiap.ch
JPEDAL, http://www.jpedal.org
Lawrence, S., Bollacker, K., Lee Giles, C.: Indexing and Retrieval of Scientific Literature. In: Eighth International Conference on Information and Knowledge Management (CIKM 1999), Kansas City, USA, pp. 139–146 (1999)
Google Scholar
Lovegrove, W.S., Brailsford, D.F.: Document analysis of PDF files: methods, results and implications, pp. 207–220. Electronic publishing, Cologne University (1995)
Google Scholar
MatterCast, http://www.mattercast.com/default.aspx
Mekhaldi, D., Lalanne, D., Ingold, R.: From Searching to Browsing through Multimodal Documents Linking. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 924–928 (2005)
Google Scholar
Paknad, M.D., Ayers, R.M.: Method and apparatus for identifying words described in a portable electronic document. U.S. Patent 5,832,530 (1998)
Google Scholar
PDFTextStream, http://snowtide.com/home/PDFTextStream
Rahman, F., Alam, H.: Conversion of PDF documents into HTML: a case study of document image analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers 2003, USA, pp. 87–91 (2003)
Google Scholar
Rigamonti, M., Hadjar, K., Lalanne, D., Ingold, R.: Xed: un outil pour l’extraction et l’analyse de documents PDF. In: Huitième Colloque International Francophone sur l’Ecrit et le Document (CIFED 2004), La Rochelle, France, pp. 85–90 (2004)
Google Scholar
Rigamonti, M., Bloechle, J.-L., Hadjar, K., Lalanne, D., Ingold, R.: Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 1050–1054 (2005)
Google Scholar
Rigamonti, M., Lalanne, D., Evéquoz, F., Ingold, R.: Browsing multimedia archives through implicit and explicit cross-modal links. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 114–125. Springer, Heidelberg (2006)
Chapter Google Scholar
Souafi-Bensafi, S., Parizeau, M., Lebourgeois, F., Emptoz, H.: Logical labeling usings Bayesian Networks. In: Sixth International Conference on Document Analysis and Recognition (ICDAR 2001), Seattle, USA, pp. 832–836 (2001)
Google Scholar
Wellner, P., Flynn, M., Guillemot, S.: Browsing Recorded Meeting With Ferret. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, pp. 12–21. Springer, Heidelberg (2005)
Chapter Google Scholar
Xed online, http://diuf.unifr.ch/diva/xed
xpdf, http://www.foolabs.com/xpdf/home.html

Download references

Author information

Authors and Affiliations

DIVA Group, DIUF, University of Fribourg, Pérolles 2 – Bd de Pérolles 90, 1700, Fribourg, Switzerland
Jean-Luc Bloechle, Maurizio Rigamonti, Karim Hadjar, Denis Lalanne & Rolf Ingold

Authors

Jean-Luc Bloechle
View author publications
You can also search for this author in PubMed Google Scholar
Maurizio Rigamonti
View author publications
You can also search for this author in PubMed Google Scholar
Karim Hadjar
View author publications
You can also search for this author in PubMed Google Scholar
Denis Lalanne
View author publications
You can also search for this author in PubMed Google Scholar
Rolf Ingold
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Applied Mathematics, University of Bern, Neubrückstrasse 10, CH-3012, Bern, Switzerland
Horst Bunke
DocRec Ltd, 34 Strathaven Place, 7001, Atawhai, Nelson, New Zealand
A. Lawrence Spitz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bloechle, JL., Rigamonti, M., Hadjar, K., Lalanne, D., Ingold, R. (2006). XCDF: A Canonical and Structured Document Format. In: Bunke, H., Spitz, A.L. (eds) Document Analysis Systems VII. DAS 2006. Lecture Notes in Computer Science, vol 3872. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11669487_13

Download citation

DOI: https://doi.org/10.1007/11669487_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32140-8
Online ISBN: 978-3-540-32157-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

XCDF: A Canonical and Structured Document Format

Abstract

Chapter PDF

Similar content being viewed by others

TabbyPDF: Web-Based System for PDF Table Extraction

Analysis of Documents Born Digital

TEXUS: Table Extraction System for PDF Documents

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

XCDF: A Canonical and Structured Document Format

Abstract

Chapter PDF

Similar content being viewed by others

TabbyPDF: Web-Based System for PDF Table Extraction

Analysis of Documents Born Digital

TEXUS: Table Extraction System for PDF Documents

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation