Abstract
Automatically determining the publication date of a document is a complex task, since a document may contain only few intra-textual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time.
In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
- Support Vector Machine
- Publication Date
- Cosine Similarity
- Training Corpus
- Support Vector Machine Parameter
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Albert, P., Badin, F., Delorme, M., Devos, N., Papazoglou, S., Simard, J.: Décennie d’un article de journal par analyse statistique et lexicale. In: DEFT 2010, TALN (2010)
Blandine, C., Silberzstein, M.: Dictionnaires électroniques du français. Langue française 87 (1990)
De Jong, F., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. In: Humanities, Computers and Cultural Heritage, p. 161 (2005)
Galibert, O.: Approches et méthodologies pour la réponse automatique à des questions adaptées à un cadre interactif en domaine ouvert. Ph.D. thesis, Université Paris-Sud 11, Orsay, France (2009)
Grouin, C., Forest, D., Paroubek, P., Zweigenbaum, P.: Présentation et résultats du défi fouille de texte DEFT2011. In: Actes TALN (2011)
Grouin, C., Forest, D., Sylva, L.D., Paroubek, P., Zweigenbaum, P.: Présentation et résultats du défi fouille de texte DEFT 2010: Oú et quand un article de presse a-t-il été écrit? In: Actes TALN (2010)
Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)
Kanhabua, N., Nørvåg, K.: Improving temporal language models for determining time of non-timestamped documents. In: Research and Advanced Technology for Digital Libraries, pp. 358–370 (2008)
Kanhabua, N., Nørvåg, K.: Using temporal language models for document dating. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS, vol. 5782, pp. 738–741. Springer, Heidelberg (2009)
Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., The Google Books Team, Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014), 176–182 (2011)
Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach - a case study in intensive care monitoring. In: Proceedings of ICML 1999, pp. 268–277. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Naji, N., Savoy, J., Dolamic, L.: Recherche d’information dans un corpus bruité (OCR). In: CORIA (2011)
Nørvåg, K.: Supporting temporal text-containment queries in temporal document databases. Data & Knowledge Engineering 49(1), 105–125 (2004)
Nunberg, G.: Google’s Book Search: A Disaster for Scholars. The Chronicle of Higher Education (August 2009) (Online, accessed April 13, 2011)
Oger, S., Rouvier, M., Camelin, N., Kessler, R., Lefèvre, F., Torres-Moreno, J.: Système du LIA pour la campagne DEFT 2010: datation et localisation d’articles de presse francophones. In: DEFT 2010, TALN (2010)
Rosset, S., Galibert, O., Bernard, G., Bilinski, E., Adda, G.: The LIMSI participation to the QAst track. In: Working Notes of CLEF 2008 Workshop, Aarhus, Danemark (2008)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing, pp. 44–49 (1994)
Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, Chichester (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Garcia-Fernandez, A., Ligozat, AL., Dinarelli, M., Bernhard, D. (2011). When Was It Written? Automatically Determining Publication Dates. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds) String Processing and Information Retrieval. SPIRE 2011. Lecture Notes in Computer Science, vol 7024. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24583-1_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-24583-1_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24582-4
Online ISBN: 978-3-642-24583-1
eBook Packages: Computer ScienceComputer Science (R0)