1 Introduction

The web is regularly extending the growth of its vast repository of web resources. A web resource is any identifiable thing on the web (e.g. images, videos, scientific articles, selling items, etc.). The availability and accessibility of knowledge on the web have influenced the user search behaviour. Web users browse the web by observing and heeding one available web content to another. They usually explore the web without a planned search strategy. Thus, they tend to move on quickly from one web resource to another when their contents are not easily understandable, unintelligible and not directly useful [1]. The lack of a complete indexation or classification of web resources decreases their discoverability and findability. It has called the attention to the importance of extracting pertinent descriptive information from the extended set of shared web resources to enhance their classification. Therefore, it is relevant to describe each web resource with its descriptive “metadata” that express clear and meaningful information by pertinently summarizing its content. In traditional libraries, professional indexers or domain experts use controlled vocabularies to assign terms “experts keywords” which appropriately identify the main topic of a web resource. However, the owners of large sets of various web resources prefer using advanced automatic technologies for the classification process. The consulting of professional indexers requires a costly and intensive task to maintain the classification of the rapid spread-shared web resources. The need for automatic and semi-automatic processes of expressing web resources’ main topic has increased throughout this technological century. For instance, the process of extracting the main keywords from a resource’s textual content involves text mining techniques like the tools of natural language processing. Although, the expert’s terms and the content-based main keywords describing the web resource can be incomprehensible to the users. It has to contain also non-expert annotations, like the users’ freely chosen keywords called tags. The provided advantages of social annotation services (folksonomy) enable users to order, locate and re-find their web resources by themselves. The generated folks’ tags collaboratively classify the shared web resources. Folksonomy defines the process of using users’ tags for the classification of different types of web resources. It is known also as collaborative tagging, social classification and social indexing. For example, CiteULike users employ freely chosen tags to share and classify their reference lists. Rather than including only annotations of experts, the use of non-expert or novice users annotations leads to more comprehensive folksonomies [2]. Furthermore, the recent semantic web researchers believe that collaborative tagging is more reliable knowledge sources than free texts [3]. The popularity of tagging has been introduced by famous web-based systems such as Flickr, CiteULike, YouTube, del.icio.us and Instagram. The web users attribute tags to annotate various types of resources, including images, videos and audios. It is an effective technique that expresses the wisdom of the crowd [4]. Different aspects of folksonomy have been explored in information retrieval [5], social network analysis [6], data mining [7], recommendation systems [8,9,10,11,12], and others.

Regardless of its popularity, folksonomy lacks semantics [13]. The tags “folks’ keywords” are derived from an uncontrolled and unsupervised vocabulary. The social tagging brings up inconsistent and ambiguous tags. The attributed irrelevant tags lead to misapprehend web resources. Regardless of the misspelling, synonymy and polysemy of tags, and their infrequency and uncommonness, abbreviations also reduce and weaken the description of web resources. For instance, the abbreviation “Ca” has several significations like calcium and cancer. The word “plethora” means a large amount of something but expresses also an excess of a bodily fluid or blood in medicine. The lack of semantics paves the way to irrelevant annotations weakening the web resources’ semantic description and therefore their classification.

This article aims to pertinently describe the semantic of web resources by using collaborative tagging and ontologies. The purpose is to enhance the descriptive annotations of web resources by solving folksonomy’s weaknesses. Indeed, relevant annotations “metadata” will not only improve the semantic description of web resources but will also enhance their clustering and organization. The proposed approach combines semantic annotation strategies towards increasing the comprehension of web resources. It stands on constructing an emergent semantic of web resources by efficiently gathering their relevant descriptive metadata. This paper explores the advantages of folksonomy and ontology to extract relevant web resources’ descriptors “metadata”, namely relevant folksonomy tags, content-based main keyword and matching ontology terms.

The rest of the paper is organized as follows: Sect. 2 presents the motivating applications and the purpose of this paper within the overall research challenges. The related work is reviewed in Sect. 3. Section 4 depicts the proposed approach of the combined emergent semantic strategy to extract pertinent descriptive metadata of web resources. The experimental evaluation is described in Sect. 5. Section 6 presents different alternatives and perspectives of comparing the semantic similarity of web resources. Finally, the conclusion and future directions are delineated in Sect. 7.

Fig. 1
figure 1

General architecture of the overall research study

2 Research challenges and motivating applications

The main challenges of our research study are conducted within a global project deployed on three levels (see Fig. 1). One of the main motivations stands on constructing an emergent semantic of web resources (see Fig. 1, Level I). It consists of combining semantic annotation strategies by investigating collaborative tagging [13]. The purpose is to enrich the description of web resources with a combined semantic annotation. Instead of annotating the resources with ontology’s terms, we are aiming to investigate the extent to which the collaborative tagging can enhance the resources’ description, comprehension and categorization. The extraction of a descriptive semantic for each web resource will emerge from different types of descriptive metadata, namely the relevant tags from the folksonomy, the extracted content-based main keywords and the matching terms from a domain ontology. To illustrate this approach, we consider a healthcare scenario. Social media is a powerful tool for raising awareness and advocacy regarding public health issues [14]. Patients can benefit from using social media services through networking, exchanging relevant information and receiving medical support. Healthcare leaders are aware of the importance of sharing and spreading knowledge through social interactions. Physicians participate in online communities to communicate and interact with their colleagues and patients [15]. However, social media tools, like folksonomy, present potential risks to patients and healthcare professionals regarding the distribution of poor-quality information [16]. The unsupervised nature of folksonomy tags may reduce their effectiveness of describing interesting resources, thereby hindering the task of resources’ classification and indexing and users searching. Therefore, it would be convenient to increase web resource description by applying the proposed combined semantic annotation method that uses not only relevant folksonomy tags but also content-based main keywords and matching ontology terms. The descriptive semantic emerges from the wisdom of the healthcare professionals (Ontologies) and the folks’ interactions (Folksonomies) describing health-related resources. The emergent semantic of web resources will enhance their organization and clustering using semantic similarities. Consequently, it will increase the chances of discovering and finding interesting resources that users might not have come across yet through their searching. This semantic relatedness of web resources will improve the information filtering system, like recommender system, to assist users in selecting relevant resources that best meet their needs and preferences. The emergent semantic (see Fig. 1, Level I) will be used to enhance the context-aware recommender system (CARS) of web resources (see Fig. 1, Level II). A recommender system is a leading tool and technique available for users to speed up the information seeking by retrieving the most relevant items from the large information sets. The recommender systems usually employ the collaborative filtering (CF), content-based (CB) and hybrid-based recommendations methods [9]. The CF analyzes the behaviours of users (e.g. rating, tagging and liking items) to filter items of users with similar preference patterns. The CB filtering approach focuses on the content of items (e.g. its keywords, features and characteristics) to suggest similar items matching the user’s previously preferred items. The hybrid-based recommender system combines the two or more filtering recommendations approaches. The use of the context awareness in recommender systems filters items based on contextual information provided by the application domain. The context is any useful information that has an impact on the users’ interactions with the system [17]. The contextual information can be static (e.g. the user’s date and place of birth, gender and ethnicity) or dynamic (e.g. location, time, the user’s family status and his activities). The context information may precisely affect the recommendations. For example, in the touristic domain, a user will be interested in visiting a particular site depending not only on his preferences but also on the weather, the timing, the proximity, and even the year’s season. In healthcare domain, recommendations based on user’s preferences might contradict the user’s health conditions. The system should not recommend nearby candy stores for a diabetic person who likes sugary foods. The recommender system’s computational process incorporates the contextual information in the definition of features characterizing the item (or, resource) and the user profiles. For example, the contextual features can be the common location (longitude and latitude data) of both the available touristic places (static contextual information) and the user (dynamic contextual information). The contextual filtering strategy defines the contextual information as features joined to the emergent semantic describing each item to enhance its significance. Therefore, it will reduce the searching task of the item’s filtering by discarding a part of available items matching the user’s profile. The selection of the closet items to the user’s preferences is measured by computing user-user, item-item and item-user similarities, since the items and users profiles have the same dimensional features’ space (e.g. the user’s profile is described as a vector of his contextual information, the attributed tags exposing his preferences for certain items; the item’s profile is described as a vector of its contextual information and its emergent descriptive semantic (metadata: relevant folksonomy tags, main keywords and matching terms)). The emergent semantic of resources (or, items) can lead to construct and explore clusters of semantically related items annotated by a particular user, then extract his used tags describing them in order to maintain the specificity of the user profile vector corresponding to each domain. The CARS has a great impact on facilitating the process of decision making in many real-world applications. The use of semantic-based CARS will be deployed in education, tourism and healthcare application domains (see Fig. 1, Level III).

In tourism, establishing a semantic-based context-aware recommender system will enhance the valorization of the cultural heritage by suggesting historical places that suit the visitor’s interest. For example, the CARS filters items by considering the similarity of visitors (based on their same age, ethnicity, gender and the same assigned tags describing a visited place), the similarity of historical places (based on their similar descriptive annotations) and the geographic proximity (based on the contextual location information). In education, the collaborative tagging is an adequate meta-cognitive strategy that successfully engages learners in the learning process [18]. Folksonomy tags add semantics, comprehensible for learners, describing open educational resources (OER) (freely accessible and openly licensed texts, medias, e-books, online videos, tutorials, reading reports, etc.). The intake of using collaborative tagging to construct the emergent semantic of educational resources will advance their recommendations. The generated folksonomy will enhance the closeness between the user (the tags’ provider) and items (described by the user’s tags). In healthcare, the healthcare CARS recommend resources about symptoms and therapies enhancing the awareness and providing useful guidelines to the appropriate end users (the patient, his family and close friends). For health professionals, the health recommender system is a decision support system. The organization of patient’s electronic health record (EHR) annotated with relevant descriptive metadata (extracted content-based keywords, assigned tags by physicians, matching medical terms) will aid healthcare professionals in decision-making. The semantic annotations describing patients’ EHRs will cluster patients having the same health matters. For example, the emergent semantic (Level I) based CARS (Level II) will detect fitting similarities between patients and their archived EHRs, then generate meaningful recommendations for a diabetic patient’s case to prevent complications in diabetes mellitus. This paper mainly focuses on the first step of the proposed architecture (see Fig. 1, Level I and Fig. 2).

Fig. 2
figure 2

Combined emergent semantic annotation approach

3 Related work

The process of annotating web resources is performed with different main shortcuts descriptors “metadata” depending on whether the main topic originate from text contents (keywords), controlled vocabularies (terms) or collaborative tagging systems (tags) [19].

Each web resource usually holds a rich text content. Data mining algorithms can do the extraction of information to retrieve the resource’s relevant keywords. The advantage of using content-based annotation enables an automatic keyword extraction process independent of human involvement. The content-based annotation strategy relies on key phrases or keywords extraction methods that derive main keywords from the web resource’s text content. The existent online RESTFul APIs “semantic annotators” analyze a text to identify its relevant sequences of words and link them to pertinent Wikipedia pages. Though, they are unable to outperform keyword extractors [20]. The automatic keyword extraction approach “extractive summary” is classified into four categories, namely, simple statistical, linguistics, machine learning and hybrid approaches [21]. The keywords extraction method is improved by considering machine learning models that combine several features. For instance, the two competing methods: the hybrid genetic algorithm GenEx [22] “Genitor and Extractor” and KEA [23] “Keyphrase Extraction Algorithm” that generates and filters candidates based on their weights of features. More attention has been given to KEA for its open availability and simplicity of use [24, 25]. The keywords extraction methods have achieved impressive results but require training data. The unsupervised extraction techniques use heuristic filtering to compensate the lack of training data by using complex analysis like shallow parsing (deep analytics) or statistical-based methods based on an independent domain like KP-Miner [26]. The main disadvantage of content-based annotation method is the limitation consistency of the resulting keywords based only on the description given by the web resources’ authors. Even though it offers certain flexibility without a controlled vocabulary, it lacks semantics (e.g. unclustered synonyms).

An expert has an advanced and a high level of knowledge about a particular domain [27]. The terms assigned by professional indexers construct a controlled vocabulary (e.i. ontology and thesauri) depicting a strong knowledge representation by expressing semantic relations [28]. The controlled vocabulary-based annotation method is called term assignment or subject indexing method. The term assignment method uses a controlled vocabulary to select terms that best match the resources’ descriptive. The controlled vocabulary-based annotation process tries to find mappings between the web resource’s candidate terms and the concept’s terms in the controlled vocabulary. It expresses the web resource’s descriptive metadata extracted from knowledge-based concepts. Consequently, the classification of web resources can make use of semantic relationships in the ontology to accomplish enhanced categorization, like exploring the relationship among broader or more specific concepts. The term assignment method has been applied in different areas of knowledge organization and retrieval. The Gene Ontology (GO) provides the logical structure of the biological terms and their relationships. The bioinformatics initiative maintains the GO annotations relating a specific gene product to a specific ontology term [29]. The authors in [30] used a physician annotated corpus to identify, extract and rank medical terms from each electronic health record (EHR) notes of patients. The semantic-based recommender system HealthRecSys [10] provides relevant education health websites to complement the selected health videos. The algorithm selects candidate terms from diabetes-related videos’ textual content and cross-match them with Bio-Ontology terms. Recent automatic identification of the resources’ terms methods are based on large web knowledge repositories Wikipedia, either by constructing Wikipedia Hierarchical Ontology (WHO) [31] or based on probabilistic model based on DBpedia hierarchical model [32]. Other works [33, 34] relied on semantic technology to build a classification and indexing system of web resources (respectively, sports images and building information modeling (BIM) resources). They used ontology theory to semantically describe web resources, then facilitate their retrieval and searching process. However, users can only employ the provided concepts’ terms to describe their web resources. The use of terms extracted from the controlled vocabulary to annotate web resources can generate misapprehension and incomprehension for non-expert and novice users.

The social tagging has the advantage of producing a large scale of tags. The purpose of collaborative tagging approach is to generate tags matching the human understanding of the web resource “abstractive summary”. The authors in [35] consider the large numbers of users’ generated tags on social tagging systems to produce a social classification of web resources. Social tags are helpful to identify the users’ preferences and the resources’ characteristics. The exploration of tags’ information and their interaction dynamically adjusts the recommendations [36]. However, the collaborative tagging suffers from the inconsistency of tags: polysemous and synonymous tags [37]. A hybrid approach [38] exploited social annotations to describe resources by relating tags to concepts from WordNet and Wikipedia. This strategy associates tags with conceptual entities to improve web resources’ classification. Another alternative to address tags’ inconsistency problem is to use automatic tags’ suggestions [19]. Thus, tag recommendations limit the redundancy and the ambiguousness of tags. The recommender system of tags controls the wide variety of tags and requires less cognitive effort to assign them. The authors in [39] came up with a method based on user tagging status to improve the quality of tag recommendations. However, they investigated the archived tagging behaviours of users without considering the new user status. Most of tag recommendations’ techniques use the strategy of finding similar tagged resources, then ranking the selected collection of tags. This strategy restricts the suggestion only on pre-existing tags. Similar approaches have been adherent by combining multi-features “tag frequency, co-occurrence and document similarity” [40]. Almost none of the research of the tagging field have explored term assignment and keyword extraction methods to support failures of tagging methods.

Inside out this analysis, there are three approaches of assigning descriptive annotations (see Table 2). They address the descriptive semantic of web resources with different methodologies “keyword extraction, term assignment and social tagging” (see Table 1).

Table 1 The approaches of assigning descriptive annotations
Table 2 Comparison of related works

The current controlled vocabulary-based approaches employed background knowledge in the form of a hierarchical ontologies [10, 31, 32] or based on expert annotation corpus (thesaurus) [30] to improve the performance of text mining algorithms for extracting resources’ terms. However, maintaining and enriching an ontology within the rapid growth of shared web resources is expensive in term of time spending and professional indexers services expenses. Besides, web resources might have insufficient or absent textual content or inaccessible representative data [35]. Insufficient available resources’ descriptive data overburdens the automatic text mining tasks. In the folksonomy, multi-authors (folks) are producing collections of tags which represent the textual descriptive annotations of the large set of web resources. Moreover, the semantic web researchers have focused their discussions on social involvements, rather than coping with the extraction of knowledge from free texts [41].

Compared to this related works, we propose a combined annotation method that semantically enriches the description of web resources by exploring collaborative tagging (integrating human cognition) and bridging between the advantages of the discussed approaches.

4 Proposed approach: a combined emergent semantic annotation

The proposed approach retrieves relevant tags from the folksonomy, extracts main keywords from the resource’s text content with a reference of controlled vocabulary’s matching terms. The approach describes a combined semantic annotation of describing web resource’s content. Extracting keywords from web resource’s text content could be inconsistent. For instance, two authors might publish similar web resources described with different main keywords. Consequently, it is relevant to extract their set of matching terms using a controlled vocabulary represented by a lightweight ontology. The steps of the proposed methodology (see Fig. 2) are as follows.

4.1 Content-based main keywords and extracted ontology terms

The process of extracting main keywords aims to describe the main topic of a web resource. The automatic keyword extraction process is handled by machine learning methods as a supervised learning problem which needs a training dataset and classifiers. It has been extensively addressed using the open software KEA [23] which uses supervised machine learning method based on naive Bayes classifiers. KEA is used either to automatically extract keywords or key phrases from free text (content-based main keywords) or from a controlled vocabulary (matching terms). It has encouraged several researchers [42, 43] to adapt or extend KEA to perform the extraction of keywords from text content. Therefore, our approach considers an extension of the KEA’s classifier to extract content-based main keywords and ontology’s terms. The proposed approach explores folksonomy tags to build a model that learns the extraction strategy from the manually assigned annotations.

Step 1 The act of extracting main keywords consists of two stages [42]. The first stage involves generating candidates keywords by using stop words and tokenizing text into sentences then extracting candidates (one or more words). The extracted candidates are reduced to their roots by applying a stemmer (e.g. Lovins stemmer [44]). The second stage is about filtering candidates keyword that involves generating features for each candidate. The commonly used features are: The frequency of each candidate (TFxIDF score combines the word’s frequency with the inverse document’s frequency to select relevant frequent keyword); the occurrence (a candidate appears at least more than two times); The type of a candidate (noun phrase, not exceed trigrams); The positioning of the candidate in the text content (beginning and end). In the filtering stage, several features are computed for each candidate as inputs for the machine learning model to obtain the probability of being the main keyword indeed.

Step 2 The extraction of the set of terms matching the text content of a web resource relies on matching each candidate term to the descriptive of the ontology’ concepts [42]. It is operated by generating candidate terms from text content using techniques of normalization: collecting words that match the length of the longest term in the vocabulary, lowercasing, removing stopwords and stemming. Then, each candidate term is ranked based on their semantic relatedness computed by comparing its relatedness to all other candidates terms. The more a candidate is related to others, the more is significant. The filtering stage avoids disambiguation during the mapping. The use of a machine learning technique computes the probability “score” for each candidate keyword and candidate term of being respectively a content-based main keyword and a matching ontology term. The final set of main keywords and matching ontology terms are selected by setting a threshold (a limit number of the top ranked candidates).

Step 3 The main keywords and matching terms extraction strategies have many supervised extraction systems based on the KEA, like Maui [42]. However, the exploration of folksonomy tags has not previously been used in the extraction strategy. The Multi-purpose Automatic topic Indexing keyword extraction system (Maui) is KEA’s reincarnation that uses Wikipedia as a reference. Maui uses a supervised algorithm based on bagging decision trees classifier to rank candidates. The extraction strategy is learned from the manual annotation that uses the keywords assigned by the resources’ authors. The novelty of our proposed approach stands on exploring relevant extracted tags from the folksonomy: the manual annotation is created not only with the prerequisite authors’ keywords but also with the relevant folksonomy tags that additionally aliment the training data. The higher the size of the training data is the more accurate the performance of the classifier becomes. However, the approach considers only relevant tags among the amount of generated folksonomy tags. The use of both relevant tags and authors’ keywords in the manual annotation will improve the classifier’s accuracy, and consequently will enhance the extraction strategy of obtaining more accurate content-based main keywords and matching ontology terms.

4.2 Retrieving relevant folksonomy tags

The folksonomy tags are not only describing web resources but also summarizing their content “abstractive summary” by expressing the users’ understanding. None of the standard algorithms has achieved yet the abstractive summary done by humans [21]. Thus, the tags reflect users’ opinions, attract readers and invite them to bring their own tags. Besides, the keywords of the resources’ authors are often not sufficiently expressive for ordinary users. However, the folksonomy lacks semantics.

Step 4 Tag processing is required in order to handle low quality of the generated folks’ tags. The use of a spell checker tool and a blacklist of forbidden words will eliminate personal, misspelled and multi-word tags (e.g. “BreastCancer” and “Breast-Cancer”). The folksonomy suffers from inconsistent tags due to its uncontrolled vocabulary. Though, applying a stemmer will reduce words’ variation to their stems (e.g.“Infectious” and “Infection” are reduced to their root word “Infect”). The consistency of each tag can be assessed by finding it in a thesaurus, or it has to be used by at least two distinct users depending on the size of the community. To better solve the quality degradation of folksonomy, different tags quality measurements are possible by applying guidelines, rules and regulation [45]. The more experts assign a term as a quality tag, the more it is assumed to be relevant. Nonetheless, more comprehensive folksonomies emerge from non-expert or novice users’ tags than from experts’ tags only [46] . Therefore, the proposed approach considers the extraction of tags which are frequently used and understood by many users of the community. A community of users U =  {u\(_{h}\)} annotate a set of web resources R =  {r\(_{k}\)} with a set of tags T = {t\(_{i}\)}. Where, 1 \(\leqslant \) h \(\leqslant \) l ; 1 \(\leqslant \) k \(\leqslant \) m ; 1 \(\leqslant \) i \(\leqslant \) n and l, m and n are finite numbers.

We consider a resource r\(_{k}\) \(\in \) R described by a set of tags from T. The extraction of relevant tags describing this resource r\(_{k}\) is computed by considering the degree of frequency of each tag t\(_{i}\) (1), denoted by DF(r\(_{k}\),t\(_{i}\)).

$$\begin{aligned} DF(r_k,t_i) = \sqrt{FT(r_k,t_i)^2 + FU(r_k,t_i)^2} \end{aligned}$$
(1)

where FT (r\(_{k}\),t\(_{i}\)) is the Frequency of the tag t\(_{i}\) annotating the resource r\(_{k}\) (2);

FU (r\(_{k}\),t\(_{i}\)) is the Frequency of users who use the tag t\(_{i}\) to annotate the resource r\(_{k}\) (3).

$$\begin{aligned} FT(r_k,t_i)= & {} \frac{Number\ of \ times\ the \ tag \ t_i\ is\ used\ to\ describe\ the\ resource \ r_k}{Number \ of \ tags\ used\ to\ describe\ the\ resource\ r_k } \end{aligned}$$
(2)
$$\begin{aligned} FU(r_k,t_i)= & {} \frac{Number\ of \ users \ who\ use\ the \ tag \ t_i\ to\ annotate\ the\ resource\ r_k}{Number\ of\ users\ who\ annotate\ the\ resource\ r_k}\nonumber \\ \end{aligned}$$
(3)

The relevant tags are those with higher degree of frequency.

Step 5 The purpose of constructing a hierarchical graph of tags is to highlight the differences between tags having the same meaning (synonymous tags). It constructs taxonomic relationships (broader, narrower) among tags. The hierarchy of tags is built based on the inclusion index [3]. I\(_{i}\)(t\(_{i}\),t\(_{j}\)) measures the inclusion of the tag t\(_{i}\) regarding the tag t\(_{j}\) (4). For example, “I\(_{1}\)(t\(_{1}\),t\(_{4}\)) > I\(_{4}\)(t\(_{4}\),t\(_{1}\))” scales how general the tag t\(_{1}\) is compared to another tag t\(_{4}\) (i.e. the tag t\(_{1}\) is broader than tag t\(_{4}\)). Consequently, each tag t\(_{i}\) has its inclusion score S\(_{i}\)(t\(_{i}\)) that identifies how strongly the tag t\(_{i}\) is related to other tags (5).

For t\(_{i}\) , t\(_{j}\)\(\in \) T , t\(_{i}\)\(\ne \) t\(_{j}\)

$$\begin{aligned}&\displaystyle I_i(t_i,t_j) = \frac{Number\ of \ resources \ described \ by \ both \ tags \ t_i \ and \ t_j}{Number\ of \ resources \ described \ with \ the \ tag\ t_j} \end{aligned}$$
(4)
$$\begin{aligned}&\displaystyle S_i(t_i)=\sum _{j=1}^{n} I_i(t_i,t_j) \end{aligned}$$
(5)

Implicit relationships also play an essential role in enhancing the organization of web resources. Such as defining tags’ community clustered into groups of semantically close tags [47]. The association between tags, resources and users will enhance the precision of detecting relevant tags and their semantic relationships (Fig. 3). The generated folks’ tags semantic graph is considered as a undirected graph whose nodes represent the tags linked together by edges W(t\(_{i}\),t\(_{j}\)). The weight W(t\(_{i}\),t\(_{j}\)) identifies the semantic relationships among tags (6). It scales how strongly two tags t\(_{i}\) and t\(_{j}\) are semantically related regarding their commonly usage by distinct users W\(_{u}\)(t\(_{i}\),t\(_{j}\)) (7) and their joint assignment to describe web resources W\(_{r}\)(t\(_{i}\),t\(_{j}\)) (8).

$$\begin{aligned}&\displaystyle W(t_i,t_j) = \sqrt{ W_r(t_i,t_j)^2 + W_u(t_i,t_j)^2} \end{aligned}$$
(6)
$$\begin{aligned}&\displaystyle W_u(t_i,t_j) = \frac{Number\ of \ users \ who\ use\ both \ tags \ t_i \ and \ t_j}{Number\ of \ users \ in \ U} \end{aligned}$$
(7)
$$\begin{aligned}&\displaystyle W_r(t_i,t_j) = \frac{Number\ of \ resources \ described \ by \ both \ tags \ t_i \ and \ t_j}{Number\ of \ resources \ tagged\ with \ tags \ in \ T} \end{aligned}$$
(8)

Therefore, the emergent folks’ tags semantic graph (see Fig. 4) is beneficial to describe the relationship among web resources annotated with connected tags. For instance, the recommender system of tags will take advantage of the emergent folks’ tags semantic graph to recommend semantically close tags. It allows a graph-based reasoning about the relationships between tags attributed to describe different resources. The reasoning of the folks’ tags semantic graph can be extended by projecting tags on the ontology’s concepts descriptive. On the other hand, the ontology can benefit from the emergent semantic graph of folks’ tags by adding new terms (relevant tags) that clearly describe related contents. The folksonomy and ontology alignment will enhance the ontology’s concept descriptive with additional information provided not only from new frequently used tags but also from their semantic relationship. The enrichment of the ontology’s concepts is done due to mapping relevant tags to the matching concept’s attributes guided by the formalism of the Simple Knowledge Organization System SKOS [37]. The Vocabulary SKOS [48] is a common data model formulated on Resource Descriptive Framework. Its aim is to describe ontology’s concepts and their semantic relationships (broad, narrow and related).

Fig. 3
figure 3

Joint tagged resources driven tags’ graph

Fig. 4
figure 4

Folks’ tags semantic graph

Step 6 The preference of using a tag depends on the user’s motivation. There are two types of users involved in tagging: the categorizers who employ their mental models and personal preferences; the describers who summarize the resource’s content using mostly synonyms [49]. The users’ interest might change gradually with the passage of time and so for the significance of the used tags. Consequently, the relevance and significance of the generated tags are related to the closeness to the current period of time [50]. The influencing factors in the user tagging behavior have a direct impact on the folksonomy tags’ quality. Besides, the significant variations of tag usages describing a web resource are induced because of the lack of guidelines. In a matter of fact, our study proposes the recommendation of tags to enhance the quality of the generated folksonomy and improve the web resources’ attributed tags. The tag recommendations can enhance the convergence of the folksonomy to a common vocabulary constructed with more reliable descriptive and heterogeneous tags. Accordingly, it can alleviate the drawbacks of folksonomy mentioned before (synonymy and polysemy of tags). The recommender system of tags incentivizes users to annotate a large number of resources. The suggested tags will reduce the users’ cognitive load dealing with choosing the appropriate tags to describe a resource. The previously assigned web resources’ tags will influence users’ choices of assigning new descriptive tags [51]. However, little attention is given to new web resources, the suggestion of tags relies only on the previously assigned tags to the same or similar web resources. Consequently, it will be pertinent to recommend the main keywords and matching ontology terms of the new never-tagged resource in the cold start. Therefore, the database of the recommender system of tags will be alimented with relevant folksonomy tags also with the extracted main keywords and their matching ontology terms. The recommender system of tags will narrow the gap between the uncontrolled nature of tags and the conceptual terms of the ontology.

5 Evaluation and results

In order to evaluate the performance of the proposed approach, we collected 550 random bio-medical articles “web resources” (Figs. 5, 6) described with their authors’ keywords and annotated by tags from the “folksonomy” CiteULike [52]. We used the Medical Subject Heading (MeSH) as the controlled vocabulary “lightweight ontology”. MeSH terms [53], managed by the U.S National Library of Medicine, describe bio-medical research items. We set 548 bio-medical articles for training and 2 articles for testing, namely Article A [54] and Article B [55]. We compared the python implementation of the Rapid Automatic Keyword Extraction RAKE [56] against the Multi-purpose automatic topic indexing Maui [42]. The comparison of their performances is measured using the standard information retrieval measures [57]: Precision P is the percentage of correct annotations “keywords or terms” among those extracted (9); Recall R is the percentage of correctly extracted annotations “keywords or terms” among all correct ones (10); F-Measure F is the combination of both P and R (11). The manually assigned annotations are the correct annotations.

$$\begin{aligned}&\displaystyle P = \frac{Number\ of \ correct \ extracted\ annotations}{Number\ of \ all \ extracted \ annotations} \end{aligned}$$
(9)
$$\begin{aligned}&\displaystyle R = \frac{Number\ of \ correct \ extracted\ annotations}{Number\ of \ all \ correct \ annotations} \end{aligned}$$
(10)
$$\begin{aligned}&\displaystyle F = \frac{2 \times P \times R }{P + R} \end{aligned}$$
(11)
Fig. 5
figure 5

Words cloud of the corpus using the statistical software R

Fig. 6
figure 6

Words cloud of articles A and B using the statistical software R

For better performance, we consider ensemble machine learning algorithms to extract content-based main keywords and matching ontology terms. The multi-classifiers aim to enhance the precision of the model’s prediction. They train multiple models by using the same learning algorithm, where a set of weak learners (e.g. One level decision tree “J48”) are combined to obtain a strong learner (e.g. AdaBoostM1). We trained Maui by using two ensemble machine learning classifiers to rank candidates: Maui based on the bagging decision trees classifier (Maui Bag); And Maui based on the boosting classifier called AdaBoostM1 using classification trees as single classifiers (Maui Boost).

The highest measures’ values of precision P, recall R, and F-measure F are highlighted in bold (see Tables 3, 4). The extraction of 8 main keywords from the two bio-medical testing articles is performed using RAKE, Maui Bag and Maui Boost. The highest measures’ values are achieved using the manual annotation of “authors’ keywords with relevant tags” with Maui based on the bagging classifier (Maui Bag) (see Table 3). In the cold start, the use of the manual annotation of “authors’ keywords” to train the boosting classifier (AbaBoostM1) of (Maui Boost) provides better performances. The accuracy of extracting main keywords is improved by training Maui on manually chosen relevant tags added to authors’ keywords, which builds a model that learns the keyword extraction strategy based on bagging decision trees classifier. Whereas, RAKE shows limited accuracy due to the lack of normalization that excludes valid candidates.

Table 3 Comparing performances of keyword extracting tools (main keywords)

The SKOS version of the MeSH terms [58] is used as the lightweight ontology. The highest measures’ results are for the fourth category of manual annotation of “authors’ keywords with relevant tags” by using Maui Bag (see Table 4). The term assignment model matches each candidate term against the ontology MeSH terms. It extracts 10 MeSH terms for each testing bio-medical article.

Table 4 Comparing performances of keyword extracting tools (MeSH Terms)

The evaluation proves the relevancy of exploring relevant folksonomy tags to aliment the manual annotations. By gathering two types of manually assigned keywords “authors’ keywords and relevant tags”, we notice a better performance of both: extracting MeSH terms and content-based main keywords. These results demonstrate the effectiveness of our proposal that combines semantic annotation strategies towards pertinently describing a web resource.

Therefore, we consider that each web resource is represented by a vector of a set of attributes (12). The vector’s attributes are represented with the couple metadata and its computed score. We delineate the definition of a web resource’s description:

$$\begin{aligned} Description\ Web\ Resource= & {} \{(metadata,score)\} \nonumber \\ metadata= & {} \left\{ \begin{array}{l} Relevant\ Tags \\ Main\ Keywords \\ Ontology\ Terms \end{array}\right. \end{aligned}$$
(12)

A web resource’s metadata are the relevant folksonomy tags, extracted content-based main keywords, and matching ontology terms retrieved from the lightweight ontology.

Description Article A ={(cancer, 0.936); (breast, 1.409); (breast cancer, 0.488); (risk factors, 0.437); (sequence Analysis DNA, 0.199); (signature, 0.0003); (microarray, 0.0003); (human, 0.0003); (breast neoplasms, 0.870); (neoplasms, 0.854); (computational Biology, 0.544); (systems biology, 0.496); (gene expression, 0.309); (classification, 0.293); (network, 0.344); (gene expression profiles, 1.105); (lighting, 0.20)}

Description Article B ={(cancer, 1.004); (breast, 1.342); (breast cancer, 0.565); (signature, 0.804); (microarray, 0.421); (prognosis, 0.351); (human, 0.012); (gene expression, 0.818); (oncogenes, 0.304); (neoplasms, 0.288); (carcinogens, 0.304); (breast neoplasms, 0.860); (survival, 0.345); (hospitals urban, 0.274); (survival analysis, 0.391); (prognostic gene, 0.325); (menopause, 0.287); (classification, 0.003)}

Fig. 7
figure 7

Vectors describing the two articles A and B based on the relevant tags, main keywords, matching mesh terms

6 Semantic similarity perspectives and alternatives

The emergent descriptive semantic of a web resource is presented as a Vector Space Model. The similarity measurement between web resources is computed based on the similarity of their descriptive vectors. A relevant clustering of web resources can be calculated with the assumptions of similarities theory by comparing their descriptive vectors.

The semantic similarity between the two vectors describing the two web resources “Article A and Article B” is related to the analysis of the score of their overlap metadata (see Fig. 7). The similarity comparison (see Table 5) of the two vectors describing the corresponding web resources is based on their descriptive metadata using either their content-based main keywords, or extracted Mesh terms deriving from the ontology, or on both of them added to relevant tags. The measure of similarity between the two vectors is computed by applying extensively used similarity measures (see Table 5), namely, Cosine similarity, Euclidean, Manhattan and Jaccard similarity. For distance similarity measures, the more the distance is small, the higher is the degree of similarity between the web resources’ descriptive vectors. For cosine similarity, the number of common attributes is divided by the total number of possible attributes. Whereas in Jaccard Similarity, the number of common attributes is divided by the number of attributes that exist in at least one of the two resources’ vectors. In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself. If the cosine value is close to zero, it means that the web resources’ vectors are orthogonal and dissimilar.

The more the similarity distance measures’ value is big and the cosine similarity value is small comparing the similarity of the two vectors, the more the descriptive of these two vectors brings up consistent meaning (i.e. avoid mistakenly grouping two distinct web resources into a cluster). Comparing the similarity measures’ results of our case study, we notice that the relevant value of similarity measures are obtained based on web resources’ vectors described with the three types of metadata “relevant tags, main keywords and extracted MeSH terms”. These results demonstrate the effectiveness of considering a combined annotation approach to pertinently describe web resources. This emergent semantic of web resources will properly help in their clustering and organization.

Table 5 Comparing the similarity of the two vectors

However, the web resources’ descriptive metadata hold uncertainty provided by the folksonomy. The imprecision of the emergent semantic describing the web resources has an effect on their clustering. For instance, we cannot absolutely point out certainty that two web resources are strongly related based on their semantic descriptive. Therefore, the semantic similarity can be computed using the fuzzy logic approach that manages the uncertainty. It helps to evaluate the similarity of web resources’ descriptive vectors based on degrees of truth rather than considering unambiguously true or false boolean logic. The web resources’ comparison perspective will focus on the soft computing techniques, mainly fuzzy based semantic similarity of web resources. The choice of the use of fuzzy logic based similarity measurement relies on the uncertainty vagueness and impreciseness of tags describing web resources. Fuzzy logic assimilates the human way of thinking and judgments. The web resources will not just be objectively similar or not but instead will contain four level of similarities. The construction of the fuzzy rules statements is based on fuzzy inference system described as a collection fuzzy if-then rules that perform logical operations on fuzzy sets (see Table 6). The inputs are the overlap (co-occurrence) of the descriptive metadata and their score. The output is the similarity of the two web resources’ vectors. We used the Matlab Fuzzy Logic Toolbox based on the triangular membership function to illustrate those fuzzy rules (see Fig. 8). For instance, if the overlap of metadata and their score are high, then the degree of the similarity between the two vectors is very high (i.e. the percentage of similarity is between 80 and 100%). For this case, the two compared web resources are highly similar to each other.

Table 6 Fuzzy rules
Fig. 8
figure 8

Fuzzy surface view

The emergent semantic similarity of web resources illustrates the intensity of similarity among web resources. Therefore, the organization of web resources is achieved based on their expressed descriptive semantics. It accommodates the Linked Open Data (LOD) initiative that encourages the organization of shared web resources by expressing their semantics and interlinking. The effectiveness of the recommender systems is investigated by exploiting the Linked Open Data [59]. Our goal aims to explore the semantic relatedness of web resources in order to improve the recommendation process. Indeed, an effective classification and clustering of web resources will enhance the semantic-based context-aware recommender system by suggesting similar items fitting users’ preferences.

7 Conclusion and future works

The oncoming of the collaborative social web has raised an extended set of web resources. It has called the attention to the importance of extracting only web resources’ relevant descriptive information. Indeed, to achieve an optimal organization of the growing shared web resources, it is essential to pertinently retrieve their relevant semantic descriptors. This paper presents a combined semantic annotation approach to pertinently describe web resources by overcoming folksonomy’s weaknesses. Each web resource is described with its semantic descriptors “metadata” namely, relevant folksonomy tags, content-based main keywords and extracted matching ontology terms. Moreover, the proposal incorporates a recommender system of tags that aims to improve folksonomy’s quality by solving the cold start problem of tagging and guiding generation of new tags. The tag recommendations will raise up the users’ understanding, promote their contribution and enhances the description of the resources. The experimental evaluation has shown relevant results attesting the effectiveness of our approach. Future perspectives will focus on capturing, describing and exploring the context that arises from the application domains (healthcare, education, tourism). We aim to investigate the potential of using the LOD to increase semantics relatedness of web resources. Our future challenge will focus on the development of a semantic-based context-aware recommender system of web resources to address the needs of a community of users in a specific domain of interest (health community of practices, social learning, open university and the valorization of cultural heritage). The recommendations of relevant resources will feed users’ needs, increase their interests and improve their interactions.