Semantic Marking Method for Non-text Documents of Website Based on Their Context in Hypertext Clustering

Papshev, Sergey; Sytnik, Alexander; Melnikova, Nina; Bogomolov, Alexey

doi:10.1007/978-3-030-12072-6_26

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 199))

Included in the following conference series:

International Conference on Information Technologies

633 Accesses

Abstract

Initial indexing and structuration of information on Internet are the conditions for resolving of the task of an effective search of information that best relates to user’s query now. Mainly they deal with text-based time expensive processing methods. Hyper structured nature of the web is used as an alternate approach for this purpose, but websites also contain information in the non-text format: (images, movies, pdf-files etc.). These documents, first of all, are intended for perception by the person, but not for the automated processing. In this article, we propose the method for the decision of this problem on the way of semantic marking of non-text documents based on their context in hypertext clustering. At the same time, we develop the approach of the context independent semantic clustering of the website with using of web-analytics information, which utilizes internal hypertext structure, user’s behavior statistics and does not require full-text content analysis. For this purpose, we represent the hypertext structure of the site as a graph and apply flow simulation algorithms to produce web clustering. Then we make a semantic description of the clusters by sets of keywords. Non-text documents have hyperlinks to some web clusters, so we consider extracted keywords for relating cluster as its semantic marking. We have checked the suggested method on the example of site sstu.ru.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Semantic Clustering of Website Based on Its Hypertext Structure

Exploiting Web Sites Structural and Content Features for Web Pages Clustering

Online Clustering Algorithm for Restructuring User Web Search Results

Notes

1.
http://www.internetlivestats.com/.
2.
The hyperlink to the image “_MG_0878.JPG” on the page http://photo.sstu.ru/main.php?g2_itemId=889.
3.
See, for example, https://www.ibm.com/blogs/policy/dataresponsibility-at-ibm/.

References

Manjaly, A.V., Priya, B.S.: Malayalam text and non-text classification of natural scene images based on multiple instance learning. In: IEEE International Conference on Advances in Computer Applications, ICACA 2016, pp. 190–196, Coimbatore, India (2016/ 2017). https://doi.org/10.1109/icaca.2016.7887949
Franzoni, V., Milani, A., Pallottelli, S., Leung, C.H.C., Yuanxi, L.: Context-based image semantic similarity. In: Proceedings of 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE, pp. 1280–1284 (2015)
Google Scholar
Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3) (2009). Article 17
Article Google Scholar
Sridevi, K., Umarani, R., Selvi, V.: An analysis of web document clustering algorithms. Int. J. Sci. Technol. 1(6), 275–282 (2011)
Google Scholar
Kosala, R., Blockeel, H.: Web mining research: a survey. ACM SIGKDD Explor. Newslett. 2(1), 1–15 (2000)
Article Google Scholar
MCL—a cluster algorithm for graphs, http://micans.org/mcl/. Accessed 20 Oct 2018
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)
Google Scholar
Aggarwal, C.C., Wang, H.A.: Survey of Clustering Algorithms for Graph Data. Springer, Boston, pp. 275–301 (2010)
Google Scholar
Ngomo, N., Schumacher, F.: Borderow: a local graph clustering algorithm for natural language processing. In: Computational Linguistics and Intelligent Text Processing, pp. 547–558 (2009)
Google Scholar
Salin, V., Slastihina, M., Ermilov, I., Speck, R., Auer, S., Papshev, S.: Semantic clustering of website based on its hypertext structure. In: Proceedings of 6th International Conference, KESW 2015. Communications in Computer and Information Science, pp. 182–194 (2015)
Google Scholar
Kumbaroska, V., Mitrevski, P.: Behavioural-based modelling and analysis of Navigation Patterns across Information Networks. Emerg. Res. Solut. ICT 1, 60–74 (2016). https://doi.org/10.20544/ERSICT.02.16.P06
Article Google Scholar
Schaeffer, S.E.: Graph clustering by flow simulation. Comput. Sci. Rev. T(1), 27–64. https://doi.org/10.1016/j.cosrev.2007.05.001
Article Google Scholar
Scikit-learn machine learning in Python. http://scikit-learn.org/stable/modules/clustering.html. Accessed 18 Apr 2018

Download references

Author information

Authors and Affiliations

Yuri Gagarin State Technical University of Saratov, Politehnicheskaya Street 77, Saratov, Russia
Sergey Papshev, Alexander Sytnik & Nina Melnikova
Institute of Precision Mechanics and Control, Russian Academy of Sciences, Rabochaya Street 24, Saratov, Russia
Alexey Bogomolov

Authors

Sergey Papshev
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Sytnik
View author publications
You can also search for this author in PubMed Google Scholar
Nina Melnikova
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Bogomolov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergey Papshev .

Editor information

Editors and Affiliations

Institute of Applied Information Technologies and Communication, Yuri Gagarin State Technical University of Saratov, Saratov, Russia
Olga Dolinina
Institute of Applied Information Technologies and Communication, Yuri Gagarin State Technical University of Saratov, Saratov, Russia
Alexander Brovko
Institute of Applied Information Technologies and Communication, Yuri Gagarin State Technical University of Saratov, Saratov, Russia
Vitaly Pechenkin
Institute of Applied Information Technologies and Communication, Yuri Gagarin State Technical University of Saratov, Saratov, Russia
Alexey Lvov
Department of Automation, Novosibirsk State Technical University, Novosibirsk, Russia
Vadim Zhmud
Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA
Vladik Kreinovich

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Papshev, S., Sytnik, A., Melnikova, N., Bogomolov, A. (2019). Semantic Marking Method for Non-text Documents of Website Based on Their Context in Hypertext Clustering. In: Dolinina, O., Brovko, A., Pechenkin, V., Lvov, A., Zhmud, V., Kreinovich, V. (eds) Recent Research in Control Engineering and Decision Making. ICIT 2019. Studies in Systems, Decision and Control, vol 199. Springer, Cham. https://doi.org/10.1007/978-3-030-12072-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-12072-6_26
Published: 29 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12071-9
Online ISBN: 978-3-030-12072-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics