Abstract
Corpus management systems are widely used to solve the problems of human-computer interaction. There are many developments associated with the management of language corpora, for example, Sketch Engine [1], Manatee [2], EXMARaLDA [3], etc. We developed the system which considers certain specific features of Turkic languages on the one hand and has new search functions and components from the other hand.
The corpus management system “Tugan Tel” (http://tugantel.tatar) is specifically designed to work with the National Corpus of Tatar and can be used to work with both the linguistic corpora of Turkic languages and the corpora of other languages. The corpus management system developed by the authors allows searching of lexical units, morphological and lexical searching, searching of syntactic units, searching of the n-gram, named entity extraction and others.
The semantic model of the Tatar language data representation is the core of the system. Storage and processing of corpus data, searching in corpus data are performed using open source tools (MariaDB DBMS, Redis data storage).
There are three basic stages of corpus management search engine development: the data model development, the system architecture development, and the database architecture development. The issues of collecting and processing of corpus data should also be considered.
The main task of our research is the identification and description of solutions for the corpus data storage, collection, and processing. The developed data model can be used for supervised and unsupervised document classification, as well as in corpus exploring. The proposed solutions have been implemented in the corpus management system which is currently used for data representation and processing for the National Corpus of Tatar “Tugan Tel”.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Suchomel, V.: The sketch engine: ten years on. Lexicography 1(1), 7–36 (2014)
Rychlý, P.: Manatee/bonito-a modular corpus manager. In: 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 65–70, December 2007
Schmidt, T., Wörner, K.: EXMARaLDA – creating, analyzing and sharing spoken language corpora for pragmatics research. Pragmat.-Q. Publ. Int. Pragmat. Assoc. 19(4), 565 (2009)
Memcached: A distributed memory object caching system. https://memcached.org/. Accessed 30 June 2018
MemcacheDB: Wikipedia. https://en.wikipedia.org/wiki/MemcacheDB. Accessed 30 June 2018
MemcacheDB: Bauman National Libriary. https://en.bmstu.wiki/MemcacheDB. Accessed 30 June 2018
Nelson, J.: Mastering Redis. Packt Publishing Ltd, Birmingham (2016)
How fast is Redis? – Redis. https://redis.io/topics/benchmarks. Accessed 30 June 2018
FoundationDB | Home. https://www.foundationdb.org/. Accessed 30 June 2018
Performance: FoundationDB 5.2. https://apple.github.io/foundationdb/performance.html. Accessed 30 June 2018
Sphinx | Open Source Search Engine. http://sphinxsearch.com/. Accessed 30 June 2018
Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine. O’Reilly Media, Inc., Newton (2015)
Bartholomew, D.: Getting Started with MariaDB. Packt Publishing Ltd, Birmingham (2013)
Nevzorova, O., Mukhamedshin, D., Gataullin, R.: Developing corpus management system: architecture of system and database. In: Proceedings of the 2017 International Conference on Information and Knowledge Engineering. CSREA Press, United States of America, pp. 108–112 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Mukhamedshin, D., Suleymanov, D., Nevzorova, O. (2020). Choosing the Right Storage Solution for the Corpus Management System (Analytical Overview and Experiments). In: Bouhlel, M., Rovetta, S. (eds) Proceedings of the 8th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT’18), Vol.1. SETIT 2018. Smart Innovation, Systems and Technologies, vol 146. Springer, Cham. https://doi.org/10.1007/978-3-030-21005-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-21005-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21004-5
Online ISBN: 978-3-030-21005-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)