Abstract
The use of two statistical metrics in automatically identifying important keywords associated with a concept such as a gene by mining scientific literature is reviewed. Starting with a subset of MEDLINE® abstracts that contain the name or synonyms of a gene in their titles, the aforementioned metrics contrast the prevalence of specific words in these documents against a broader “background set” of abstracts. If a word occurs substantially more often in the document subset associated with a gene than in the background set that acts as a reference, then the word is viewed as capturing some specific attribute of the gene.
The keywords thus automatically identified may be used as gene features in clustering algorithms. Since the background set is the reference against which keyword prevalence is contrasted, the authors hypothesize that different background document sets can lead to somewhat different sets of keywords to be identified as specific to a gene. Two different background sets are discussed that are useful for two somewhat different purposes, namely, characterizing the function of a gene, and clustering a set of genes based on their shared functional similarities. Experimental results that reveal the significance of the choice of background set are presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This is also sometimes called the collection frequency of the term in the set of documents, and counts the total number of occurrences of the term in all the documents of the collection. It differs from the document frequency of a term in a collection of documents in that the document frequency just counts how many documents contain the term (with no distinction on the number of occurrences).
References
Andrade, M., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600–607 (1998). https://doi.org/10.1093/bioinformatics/14.7.600
Cherepinsky, V., Feng, J., Rejali, M., Mishra, B.: Shrinkage based similarity metric for cluster analysis of microarray data. Proc. Natl. Acad. Sci. USA 100(17), 418–427 (2003). https://doi.org/10.1073/pnas.1633770100
Dasigi, V., Karam, O., Pydimarri, S.: An evaluation of keyword selection on gene clustering in biomedical literature mining. In: Proceedings of Fifth IASTED International Conference on Computational Intelligence, pp. 119–124 (2010). URL: http://www.actapress.com/Abstract.aspx?paperId=43008
Hamdan, H., Bellot, P., Béchet, F.: The impact of Z-score on Twitter sentiment analysis. In: Proceedings of 8th International Workshop on Semantic Evaluation, pp. 596–600 (2014). https://doi.org/10.3115/v1/s14-2113
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979). https://doi.org/10.2307/2346830
Ikeda, D., Suzuki, E.: Mining peculiar compositions of frequent substrings from sparse text data using background texts. In: Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases, Springer Lecture Notes in Artificial Intelligence, vol. 5781, pp. 596–611 (2009). https://doi.org/10.1007/978-3-642-04180-8_56
Liu, Y., Navathe, S., Pivoshenko, A., Dasigi, V., Dingledine, R., Ciliax, B.: Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes. Int. J. Data Min. Bioinform. 1(1), 88–110 (2006). https://doi.org/10.1504/ijdmb.2006.009923
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988). https://doi.org/10.1016/0306-4573(88)90021-0
Acknowledgments
The authors acknowledge that the MEDLINE® data used in this research are covered by a license agreement supported by the U.S. National Library of Medicine. Thanks are also due to Professor Rajnish Singh (Kennesaw State University) for her assistance in relation to evaluating the keywords for the various genes, and for her help in other ways related to this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Dasigi, V.G., Karam, O., Pydimarri, S. (2019). Impact of Context on Keyword Identification and Use in Biomedical Literature Mining. In: Arai, K., Bhatia, R., Kapoor, S. (eds) Proceedings of the Future Technologies Conference (FTC) 2018. FTC 2018. Advances in Intelligent Systems and Computing, vol 880. Springer, Cham. https://doi.org/10.1007/978-3-030-02686-8_38
Download citation
DOI: https://doi.org/10.1007/978-3-030-02686-8_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02685-1
Online ISBN: 978-3-030-02686-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)