Abstract
Initial indexing and structuration of information on Internet are the conditions for resolving of the task of an effective search of information that best relates to user’s query now. Mainly they deal with text-based time expensive processing methods. Hyper structured nature of the web is used as an alternate approach for this purpose, but websites also contain information in the non-text format: (images, movies, pdf-files etc.). These documents, first of all, are intended for perception by the person, but not for the automated processing. In this article, we propose the method for the decision of this problem on the way of semantic marking of non-text documents based on their context in hypertext clustering. At the same time, we develop the approach of the context independent semantic clustering of the website with using of web-analytics information, which utilizes internal hypertext structure, user’s behavior statistics and does not require full-text content analysis. For this purpose, we represent the hypertext structure of the site as a graph and apply flow simulation algorithms to produce web clustering. Then we make a semantic description of the clusters by sets of keywords. Non-text documents have hyperlinks to some web clusters, so we consider extracted keywords for relating cluster as its semantic marking. We have checked the suggested method on the example of site sstu.ru.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
The hyperlink to the image “_MG_0878.JPG” on the page http://photo.sstu.ru/main.php?g2_itemId=889.
- 3.
See, for example, https://www.ibm.com/blogs/policy/dataresponsibility-at-ibm/.
References
Manjaly, A.V., Priya, B.S.: Malayalam text and non-text classification of natural scene images based on multiple instance learning. In: IEEE International Conference on Advances in Computer Applications, ICACA 2016, pp. 190–196, Coimbatore, India (2016/ 2017). https://doi.org/10.1109/icaca.2016.7887949
Franzoni, V., Milani, A., Pallottelli, S., Leung, C.H.C., Yuanxi, L.: Context-based image semantic similarity. In: Proceedings of 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE, pp. 1280–1284 (2015)
Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3) (2009). Article 17
Sridevi, K., Umarani, R., Selvi, V.: An analysis of web document clustering algorithms. Int. J. Sci. Technol. 1(6), 275–282 (2011)
Kosala, R., Blockeel, H.: Web mining research: a survey. ACM SIGKDD Explor. Newslett. 2(1), 1–15 (2000)
MCL—a cluster algorithm for graphs, http://micans.org/mcl/. Accessed 20 Oct 2018
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)
Aggarwal, C.C., Wang, H.A.: Survey of Clustering Algorithms for Graph Data. Springer, Boston, pp. 275–301 (2010)
Ngomo, N., Schumacher, F.: Borderow: a local graph clustering algorithm for natural language processing. In: Computational Linguistics and Intelligent Text Processing, pp. 547–558 (2009)
Salin, V., Slastihina, M., Ermilov, I., Speck, R., Auer, S., Papshev, S.: Semantic clustering of website based on its hypertext structure. In: Proceedings of 6th International Conference, KESW 2015. Communications in Computer and Information Science, pp. 182–194 (2015)
Kumbaroska, V., Mitrevski, P.: Behavioural-based modelling and analysis of Navigation Patterns across Information Networks. Emerg. Res. Solut. ICT 1, 60–74 (2016). https://doi.org/10.20544/ERSICT.02.16.P06
Schaeffer, S.E.: Graph clustering by flow simulation. Comput. Sci. Rev. T(1), 27–64. https://doi.org/10.1016/j.cosrev.2007.05.001
Scikit-learn machine learning in Python. http://scikit-learn.org/stable/modules/clustering.html. Accessed 18 Apr 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Papshev, S., Sytnik, A., Melnikova, N., Bogomolov, A. (2019). Semantic Marking Method for Non-text Documents of Website Based on Their Context in Hypertext Clustering. In: Dolinina, O., Brovko, A., Pechenkin, V., Lvov, A., Zhmud, V., Kreinovich, V. (eds) Recent Research in Control Engineering and Decision Making. ICIT 2019. Studies in Systems, Decision and Control, vol 199. Springer, Cham. https://doi.org/10.1007/978-3-030-12072-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-12072-6_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12071-9
Online ISBN: 978-3-030-12072-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)