Abstract
Over the past 15 years the government has funded research in information extraction, with the goal of developing the technology to extract entities, events, and their interrelationships from free text for further analysis. A crucial component of linking entities across documents is the ability to recognize when different name strings are potential references to the same entity. Given the extraordinary range of variation international names can take when rendered in the Roman alphabet, this is a daunting task. This paper surveys existing technologies for name matching and for accomplishing pieces of the cross-document extraction and linking task. It proposes a direction for future work in which existing entity extraction, coreference, and database name matching technologies would be harnessed for cross-document coreference and linking capabilities. The extension of name variant matching to free text will add important text mining functionality for intelligence and security informatics toolkits.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Taft, R.L.: Name Search Techniques. Special Rep. No. 1. Bureau of Systems Development, New York State Identification and Intelligence System, Albany (1970)
Verton, D.: Technology Aids Hunt for Terrorists. Computer World, 9 September (2002)
Borgman, C.L., Siegfried, S.L.: Getty’s Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms. Journal of the American Society for Information Science, Vol. 43 No. 7. (1992) 459–476
Grishman, R., Sundheim, B.: Message Understanding Conference — 6: A Brief History. In: Proceedings of the 16th International Conference on Computational Linguistics. Copenhagen (1999)
DARPA. Tipster Text Program Phase III Proceedings. Morgan Kaufmann, San Francisco (1999)
National Institute of Standards and Technology. ACE-Automatic Content Extraction Information Technology Laboratories. http://www.itl.nist.gov/iad/894.01/tests/ace/index.htm (2000)
Fuhr, N.: XML Information Retrieval and Extraction [to appear]
Hermansen, J.C.: Automatic Name Searching in Large Databases of International Names. Georgetown University Dissertation, Washington, DC (1985)
Holmes, D., McCabe, M.C.: Improving Precision and Recall for Soundex Retrieval. In: Proceedings of the 2002 IEEE International Conference on Information Technology — Coding and Computing. Las Vegas (2002)
Navarro, G., Baeza-Yates, R., Azevedo Arcoverde, J.M.: Matchsimile: A Flexible Approximate Matching Tool for Searching Proper Names. Journal of the American Society for Information Science and Technology, Vol. 54 No. 1 (2003) 3–15
Patman, F., Shaefer, L.: Is Soundex Good Enough for You? On the Hidden Risks of Soundex-Based Name Searching. Language Analysis Systems, Inc., Herndon (2001)
Lutz, R., Greene, S.: Measuring Phonological Similarity: The Case of Personal Names. Language Analysis Systems, Inc., Herndon (2002)
Bikel, D.M., Schwartz, R., Weischedel, R.M.: An Algorithm that Learns What’s in a Name. Machine Learning, Vol. 34 No. 1–3. (1999) 211–231
Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: NYU: Description of the MENE Named Entity System as Used in MUC-7. In: Proceedings of the Seventh Message Understanding Conference. Fairfax (1998)
Baluja, S., Mittal, V.O., Sukthankar, R.: Applying Machine Learning for High Performance Named-Entity Extraction. Pacific Association for Computational Linguistics (1999)
Collins, M.,: Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 489–496
Zelenko, D., Aone, C., Richardella, A.: Kernel Methods for Relation Detection Extraction. Journal of Machine Learning Research [to appear]
Soon, W.M., Ng, H.T., Lim, D.C.Y.: A Machine Learning Approach to Coreference Resolution of Noun Phrases. Association for Computational Linguistics (2001)
Bontcheva, K., Dimitrov, M., Maynard, D., Tablin, V., Cunningham, H.: Shallow Methods for Named Entity Coreference Resolution. TALN (2002)
Hartrumpf, S.: Coreference Resolution with Syntactico-Semantic Rules and Corpus Statistics. In: Proceedings of CoNLL-2001. Toulouse (2001) 137–144
Ng, V., Cardie, C.: Improving Machine Learning Approaches to Coreference Resolution. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 104–111
McCarthy, J.F., Lehnert, W.G.: Using Decision Trees for Coreference Resolution. In: Mellish, C. (ed.): Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (1995) 1050–1055
Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (1998) 79–85
Ravin, Y., Kazi, Z. Is Hillary Rodham Clinton the President? Disambiguating Names Across Documents. In: Proceedings of the ACL’99 Workshop on Coreference and Its Applications (1999)
Schiffman, B., Mani, I., Concepcion, K.J.: Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (2001) 450–457
Bagga, A.: Evaluation of Coreferences and Coreference Resolution Systems. In: Proceedings of the First International Conference on Language Resources and Evaluation (1998) 563–566
Inxight. A Research Engine for the Pharmaceutical Industry. http://www.inxight.com
Hetzler, B., Harris, W.M., Havre, S., Whitney, P.: Visualizing the Full Spectrum of Document Relationships. In: Structures and Relations in Knowledge Organization. Proceedings of the 5th International ISKO Conference. ERGON Verlag, Wurzburg (1998) 168–175
Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46 No. 1 (2003)
InfoGlide Software. Similarity Search Engine: The Power of Similarity Searching. http://www.infoglide.com/content/images/whitepapers.pdf(2002)
American Association for Artificial Intelligence Fall Symposium on Artificial Intelligence and Link Analysis (1998)
i2. Analyst’s Notebook. http://www.i2.co.uk/Products/Analysts-Notebook (2002)
Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical Report RR99/04. U.S. Census Bureau, http://www.census.gov/srd/papers/pdf/rr99-04.pdf
Wang, G., Chen, H., Atabakhsh, H.: Automatically Detecting Deceptive Criminal Identities [to appear]
Fuhr, N.: Probabilistic Datalog — A Logic for Powerful Retrieval Methods. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (1995) 282–290
Fuhr, N.: Models for Integrated Information Retrieval and Database Systems. IEEE Data Engineering Bulletin, Vol. 19 No. 1. (1996)
Hoogeveen, M., van der Meer, K.: Integration of Information Retrieval and Database Management in Support of Multimedia Police Work. Journal of Information Science, Vol. 20 No. 2 (1994)
Institute for Mathematics and Its Applications. IMA Hot Topics Workshop: Text Mining. http://www.ima.umn.edu/reactive/spring/tm.html (2000)
KDD-2000 Workshop on Text Mining. The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston (2000) http://www-2.cs.cmu.edu/~dunja/WshKDD2000.html
SIAM Text Mining Workshop. http://www.cs.utk.edu/tmw02 (2002)
Text-ML 2002 orkshop on Text Learning. The Nineteenth International Conference on Machine Learning ICML-2002. Sydney (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Patman, F., Thompson, P. (2003). Names: A New Frontier in Text Mining. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C., Schroeder, J., Madhusudan, T. (eds) Intelligence and Security Informatics. ISI 2003. Lecture Notes in Computer Science, vol 2665. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44853-5_3
Download citation
DOI: https://doi.org/10.1007/3-540-44853-5_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40189-6
Online ISBN: 978-3-540-44853-2
eBook Packages: Springer Book Archive