Abstract
Word sense disambiguation is crucial in natural language processing. Both unsupervised knowledge-based and supervised methodologies try to disambiguate ambiguous words through context. However, they both suffer from data sparsity, a common problem in natural language. Furthermore, the supervised methods are previously limited in the all-word WSD tasks. This paper attempts to collect all publicly available contexts to enrich the ambiguous word’s sense representation and apply these contexts to the simplified Lesk and our M-IMS systems. Evaluations performed on the concatenation of several benchmark fine-grained all-word WSD datasets show that the simplified Lesk improves by 9.4% significantly and our M-IMS has shown some improvement as well.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Word sense disambiguation (WSD) is an open problem in natural language processing, which identifies word sense used in a given context. It’s considered as the fundamental cornerstone for machine translation, information extraction and retrieval, parsing, and question answer. What’s bad is that all methods on WSD highly depend on knowledge sources like corpora of texts which may be unlabeled or annotated with word sense [1]. Ineluctably these knowledge sources all suffer from data sparsity to varying degrees. Apart from the sparsity, a common agreement is that supervised methods are restricted in the all-word tasks as labeled data for the full lexicon is sparse and difficult to obtain [2], while knowledge-based methods only requiring an external knowledge source are more suitable for the all-word tasks [4]. In summary, this paper is chiefly involved with the data sparsity and the adaptability of supervised algorithms. Accordingly, two main contributions are summarized as follows:
-
We relieve the data sparsity by assembling almost all publicly available contextual texts from different corpora.
-
We modify It Make Sense (IMS) [7] by embedding a knowledge-based method to ensure the latter starts to work in case the former fails.
2 Methodology
2.1 Corpora Sources
The first main point of this paper lies in more corpora with massive instance sentences uniformly annotated by one sense repository. Here are five publicly available corpora annotated with WordNet: WordNet, SemCor [3], OMSTI [6], MASCFootnote 1, GMBFootnote 2. WordNet is not only a lexical dictionary as the sense repository here but also a source of example sentences.
2.2 M-IMS
Preprocessing and Feature Extraction. Preprocessing aims to convert various texts from different corpora into formatted instance sentences. In contrast to IMS, we include two additional procedures: Standardization and Sense Mapping. Standardization intends to unify the formats and preserve texts with POS, annotation and lemma. While Sense Mapping deals with the annotation version problem according to the sense key.
Feature Extraction is conducted on the massive contexts (MC) as how IMS does. A small modification to surrounding words feature here is that the surrounding words can be only in the current sentence, not including the adjacent sentences, because we disambiguate ambiguous words on sentence-level.
Classification. Another major contribution of this paper lies in the modification here. The Classification comprises three components: Supervised Classification, Decision Component, and Knowledge-based Classification.
Supervised Classification and Knowledge-Based Classification. The supervised classification part is almost the same as the classifier in the IMS. As for the knowledge-based, we select simplified Lesk as the knowledge-based algorithm to make the disambiguation. The overlapping way of simplified Lesk to calculate the similarity between gloss and context conforms to the characteristic of the MC.
Decision Component. The rhombus with a question mark inside in Fig. 1 represents the decision component. It determines whether or not the knowledge-based methods are introduced into the disambiguation. Here we recommend two boundary conditions for the decision:
-
Strict condition: Only if annotations for a word cover all senses of this word with the same part of speech, can the decision output yes/y, otherwise no/n.
-
Loose condition: As long as annotations for a word cover at least one sense of this word with the same part of speech, the decision outputs yes.
This paper adopts relatively loose setting: As long as annotations for a word cover at least two senses of this word with the same part of speech, we consider the trained model is helpful in a way.
3 Experiments and Results
The first experiment aims at showing the ability of massive contextual texts to relieve the data sparsity. The second makes a comparison among M-IMS, IMS, simplified Lesk etc. And we choose the concatenation of the five standardized datasets (Sem-Union) from [5] as the test dataset.
3.1 Results
In Table 1, we have found that contextual texts, like instance sentences, extremely suit for word matching pattern contemporarily. Furthermore, the increment by our MC offers more possibility for previously annotation-lacking senses and relieve the data sparsity to a certain degree.
In Table 2, it’s remarkable that simplified Lesk with MC obtains a much better performance and pushes the overlap-based algorithms to a new high. What’s more, M-IMS uniformly performs better than IMS both on SemCor and MC, but not with a significant margin implying that the performance of knowledge-based algorithms is required to be promoted in the future.
4 Conclusion
This paper mainly deals with the data sparsity in WSD with massive contexts and the adaptability of supervised methods. Note that this work is still in progress and we shall release MC in our later research work along with relevant API to enable various applications with detail documentations.
References
Borah, P.P., Talukdar, G., Baruah, A.: Approaches for word sense disambiguation-a survey. IJRTE 3(1), 35–38 (2014)
Chaplot, D.S., Salakhutdinov, R.: Knowledge-based word sense disambiguation using topic models. arXiv preprint arXiv:1801.01900 (2018)
Miller, G.A., Leacock, C., Tengi, R., Bunker, R.T.: A semantic concordance. In: Proceedings of the Workshop on Human Language Technology, pp. 303–308. ACL (1993)
Miller, T., Biemann, C., Zesch, T., Gurevych, I.: Using distributional similarity for lexical expansion in knowledge-based word sense disambiguation. In: Proceedings of the 24th COLING, pp. 1781–1796 (2012)
Raganato, A., Camacho-Collados, J., Navigli, R.: Word sense disambiguation: a unified evaluation framework and empirical comparison. In: Proceedings of the 15th Conference of ECACL, vol. 1, pp. 99–110 (2017)
Taghipour, K., Ng, H.T.: One million sense-tagged instances for word sense disambiguation and induction. In: Proceedings of the 19th CoNLL, pp. 338–344 (2015)
Zhong, Z., Ng, H.T.: It makes sense: a wide-coverage word sense disambiguation system for free text. In: Proceedings of the ACL 2010 System Demonstrations, pp. 78–83. ACL (2010)
Acknowledgements
This work was partially supported by the National Natural Science Foundation of China (61772288), and the Natural Science Foundation of Tianjin City (18JCZDJC30900).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Yf., Wei, J. (2019). Word Sense Disambiguation with Massive Contextual Texts. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11448. Springer, Cham. https://doi.org/10.1007/978-3-030-18590-9_60
Download citation
DOI: https://doi.org/10.1007/978-3-030-18590-9_60
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18589-3
Online ISBN: 978-3-030-18590-9
eBook Packages: Computer ScienceComputer Science (R0)