Abstract
GOT is a Python3 software toolkit for taxonomic content analysis of text collections. The structure of the toolkit follows an in-house methodology for processing a collection of texts using a domain taxonomy. The method includes the following steps: (1) computing matrix of relevance between texts and taxonomy leaf topics using a purely structural string-to-text relevance measure based on suffix trees representing the texts and annotated by substring frequencies, (2) obtaining fuzzy clusters of taxonomy leaf topics using a method involving both additive and spectral properties, and (3) finding most specific generalizations of the fuzzy clusters in a rooted tree of the taxonomy. Such a generalization parsimoniously lifts a cluster to its “head subject” in the higher ranks of the taxonomy, to tightly cover the cluster by minimizing the number of errors, “gaps” and “offshoots”. The efficiency of this methodology was illustrated in the analysis of research tendencies in Data Science: our findings led to insights on the tendencies of research that could not be derived by using more conventional techniques. The toolkit can be used either as a whole or with its individual modules including a visualization module. GOT toolkit provides for two usage scenarios: (a) console mode for using via command line and (b) import mode for using in Python3 source codes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Blei, D.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Bohr J., Dunlap R.E. Key Topics in environmental sociology, 1990-2014: results from a computational text analysis. Environ. Sociol. 3 4(2), 181–195 (2018)
Chernyak E. An approach to the problem of annotation of research publications. In: Proceedings of the eighth ACM international conference on web search and data mining, pp. 429–434. ACM (2015)
Chernyak, E., Mirkin, B.: Refining a taxonomy by using annotated suffix trees and Wikipedia resources. Annal. Data Sci. 2(1), 61–82 (2015)
Dubov, M.: Text analysis with enhanced annotated suffix trees: Algorithms and implementation. In: International Conference on Analysis of Images, Social Networks and Texts, pp. 308–319. Springer, Cham (2015)
Frolov D., Nascimento, S., Fenner, T., Mirkin, B.: Parsimonious generalization of fuzzy thematic sets in taxonomies applied to the analysis of tendencies of research in data science. Inf. Sci. 512, 595–615 (2020)
Frolov D., Taran Z., Mirkin, B.: A method for audience extending in programmatic advertising by using parsimonious generalization of user segments. In: International Conference on Human Interaction and Emerging Technologies, pp. 837-841. Springer, Cham (2019)
Krippendorff, K., Content analysis. In: Barnouw, E., Gerbner, G., Schramm, W., Worth, T.L., Gross, L. (eds.), International Encyclopedia of Communication, vol. 1, pp. 403-407. New York, NY, Oxford University Press (1989)
van Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007)
Mirkin, B., Nascimento, S.: Additive spectral method for fuzzy cluster analysis of similarity data including community structure and affinity matrices. Inf. Sci. 183(1), 16–34 (2012)
Mirkin, B., Frolov, D., Vlasov, A., Nascimento, S., Fenner, T.: A Hybrid approach to the analysis of a collection of research papers. In: International Conference on Intelligent Data Engineering and Automated Learning 2020 Nov 4, pp. 423-433. Springer, Cham (2020)
Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Frolov, D., Mirkin, B. (2021). GOT: Generalization over Taxonomies, a Software Toolkit for Content Analysis with Taxonomies. In: Rocha, Á., Adeli, H., Dzemyda, G., Moreira, F., Ramalho Correia, A.M. (eds) Trends and Applications in Information Systems and Technologies . WorldCIST 2021. Advances in Intelligent Systems and Computing, vol 1366. Springer, Cham. https://doi.org/10.1007/978-3-030-72651-5_49
Download citation
DOI: https://doi.org/10.1007/978-3-030-72651-5_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72650-8
Online ISBN: 978-3-030-72651-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)