Abstract
A single logical entity can be referred to by several different names over a large text corpus. We present our algorithm for finding all such co-reference sets in a large corpus. Our algorithm involves three steps: morphological similarity detection, contextual similarity analysis, and clustering. Finally, we present experimental results on over large corpus of real news text to analyze the performance our techniques.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Lloyd, L., Kechagias, D., Skiena, S.: Lydia: A system for large-scale news analysis. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 161–166. Springer, Heidelberg (2005)
Lloyd, L., Kaulgud, P., Skiena, S.: Newspapers vs. blogs: Who gets the scoop? In: Computational Approaches to Analyzing Weblogs (AAAI-CAAW 2006), Technical Report SS-06-03, pp. 117–124. AAAI Press, Menlo Park (2006)
Kil, J., Lloyd, L., Skiena, S.: Question answering with lydia. In: 14th Text REtrieval Conference (TREC 2005) (2005)
Mehler, A., Bao, Y., Li, X., Wang, Y., Skiena, S.: Spatial analysis of news sources (submitted for publication, 2006)
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Boitet, C., Whitelock, P. (eds.) Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 79–85. Morgan Kaufmann, San Francisco (1998)
Mann, G., Yarowsky, D.: Unsupervised personal name disambiguation. In: CoNLL, Edmonton, Alberta, Canada, pp. 33–40 (2003)
Gooi, C., Allan, J.: Cross-document coreference on a large scale corpus. In: Human Language Technology Conf. North American Chapter Association for Computational Linguistics, Boston, Massachusetts, USA, pp. 9–16 (2004)
Ng, V., Cardie, C.: Improving machine learning approaches to coreference resolution. In: 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 104–111 (2002)
Bean, D., Riloff, E.: Unsupervised learning of contextual role knowledge for coreference resolution. In: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, USA, pp. 297–304 (2004)
Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on the Management of Data, San Jose, California, USA, pp. 127–138 (1995)
Cohen, W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Eighth ACM SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 475–480 (2002)
Philips, L.: Hanging on the Metaphone. Computer Language 7(12), 39–43 (1990)
Porter, M.: An algorithm for suffix stripping (1980), http://www.tartarus.org/~martin/PorterStemmer/def.txt
Taft, R.: Name search techniques. New York State Identification and Intelligence Systems, Special Report No. 1, Albany, New York (1970)
Borgman, C., Siegfried, S.: Getty’s synoname and its cousins: A survey of applications of personal name-matching algorithms. JASIS 43(7), 459–476 (1992)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the theory of NP-completeness. W. H. Freeman, San Francisco (1979)
Karypis, G., Kumar, V.: METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices (2003), http://www-users.cs.umn.edu/~karypis/metis
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lloyd, L., Mehler, A., Skiena, S. (2006). Identifying Co-referential Names Across Large Corpora. In: Lewenstein, M., Valiente, G. (eds) Combinatorial Pattern Matching. CPM 2006. Lecture Notes in Computer Science, vol 4009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780441_3
Download citation
DOI: https://doi.org/10.1007/11780441_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35455-0
Online ISBN: 978-3-540-35461-1
eBook Packages: Computer ScienceComputer Science (R0)