Abstract
Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on document image analysis, or on electronic content extraction. Then, XCDF, a canonical format with well-defined properties is proposed as a suitable solution for representing structured electronic documents and as an entry point for further researches and works. The system and methods used for reverse engineering PDF document into this canonical format are also presented. We finally present current applications of this work into various domains, spacing from data mining to multimedia navigation, and consistently benefiting from our canonical format in order to access PDF document content and structures.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Adobe PDF reference, http://partners.adobe.com/asn/tech/pdf/specifications.jsp
Adobe’s Online Converter, http://www.adobe.com/products/acrobat/access_onlinetools.html
Anjewierden, A.: AIDAS: Incremental logical structure discovery in PDF document. In: Sixth International Conference on Document Analysis and Recognition (ICDAR 2001), Seattle, USA, pp. 374–377 (2001)
Anjewierden, A., Kabel, S.: Automatic indexing of documents with ontologies. In: 13th Belgian/Dutch Conference on Artificial Intelligence (BNAIC 2001), Amsterdam, Holland, pp. 23–30 (2001)
Bagley, S.R., Brailsford, D.F., Hardy, M.R.B.: Creating reusable well-structured PDF as a sequence of component object graphic (COG) elements. In: ACM Symposium on Document Engineering (DocEng 2003), Grenoble, France, pp. 58–67 (2003)
Chao, H., Fan, J.: Layout and Content Extraction for PDF Documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004)
Chao, H., Xiaofan, L.: Capturing the Layout of electronic Documents for Reuse in Variable Data. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 940–944 (2005)
Futrelle, R.P., Shap, M., Cieslik, C., Grimes, A.E.: Extraction, layout analysis and classification of diagrams in PDF documents. In: Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland, pp. 1007–1012 (2003)
Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: a new tool for eXtracting hidden structures from Electronic Documents. In: Document Image Analysis for Libraries (DIAL 2004), Palo Alto, USA, pp. 212–221 (2004)
Hadjar, K., Hitz, O., Robadey, L., Ingold, R.: Configuration REcognition Model for Complex Reverse Engineering Methods: 2(CREM). In: Lopresti, D.P., Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 469–479. Springer, Heidelberg (2002)
Hadjar, K., Ingold, R.: Arabic Newspaper Page Segmentation. In: Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland, pp. 895–899 (2003)
Hardy, M.R.B., Brailsford, D., Thomas, P.L.: Creating Structured PDF Files Using XML Templates. In: ACM Symposium on Document Engineering (DocEng 2004), Milwaukee, USA, pp. 99–108 (2004)
JFerret, http://mmm.idiap.ch
JPEDAL, http://www.jpedal.org
Lawrence, S., Bollacker, K., Lee Giles, C.: Indexing and Retrieval of Scientific Literature. In: Eighth International Conference on Information and Knowledge Management (CIKM 1999), Kansas City, USA, pp. 139–146 (1999)
Lovegrove, W.S., Brailsford, D.F.: Document analysis of PDF files: methods, results and implications, pp. 207–220. Electronic publishing, Cologne University (1995)
MatterCast, http://www.mattercast.com/default.aspx
Mekhaldi, D., Lalanne, D., Ingold, R.: From Searching to Browsing through Multimodal Documents Linking. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 924–928 (2005)
Paknad, M.D., Ayers, R.M.: Method and apparatus for identifying words described in a portable electronic document. U.S. Patent 5,832,530 (1998)
PDFTextStream, http://snowtide.com/home/PDFTextStream
Rahman, F., Alam, H.: Conversion of PDF documents into HTML: a case study of document image analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers 2003, USA, pp. 87–91 (2003)
Rigamonti, M., Hadjar, K., Lalanne, D., Ingold, R.: Xed: un outil pour l’extraction et l’analyse de documents PDF. In: Huitième Colloque International Francophone sur l’Ecrit et le Document (CIFED 2004), La Rochelle, France, pp. 85–90 (2004)
Rigamonti, M., Bloechle, J.-L., Hadjar, K., Lalanne, D., Ingold, R.: Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 1050–1054 (2005)
Rigamonti, M., Lalanne, D., Evéquoz, F., Ingold, R.: Browsing multimedia archives through implicit and explicit cross-modal links. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 114–125. Springer, Heidelberg (2006)
Souafi-Bensafi, S., Parizeau, M., Lebourgeois, F., Emptoz, H.: Logical labeling usings Bayesian Networks. In: Sixth International Conference on Document Analysis and Recognition (ICDAR 2001), Seattle, USA, pp. 832–836 (2001)
Wellner, P., Flynn, M., Guillemot, S.: Browsing Recorded Meeting With Ferret. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, pp. 12–21. Springer, Heidelberg (2005)
Xed online, http://diuf.unifr.ch/diva/xed
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bloechle, JL., Rigamonti, M., Hadjar, K., Lalanne, D., Ingold, R. (2006). XCDF: A Canonical and Structured Document Format. In: Bunke, H., Spitz, A.L. (eds) Document Analysis Systems VII. DAS 2006. Lecture Notes in Computer Science, vol 3872. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11669487_13
Download citation
DOI: https://doi.org/10.1007/11669487_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32140-8
Online ISBN: 978-3-540-32157-6
eBook Packages: Computer ScienceComputer Science (R0)