1 Introduction

Web usage mining (WUM) also known as web log mining is the application of data mining techniques applied on web data to extract relevant data and discover useful patterns [1], with the aim of improving the usefulness of the various web based applications.

The process of web usage mining can be broadly divided into four phases—sourcing or collection of data, pre-processing or removal of ‘noise’, discovery of interesting patterns and the last analysis of the discovered patterns [2].

The first phase is simply sourcing of the data or information that is to be processed from various resources—which in our case will predominantly be the log files obtained from the Web servers, Web Proxy Servers and Client Browsers [3]. The quality of source log file is improved in the second phase by removing the extraneous, immaterial data termed as ‘noise’, to make the log file ready for further processing like segregating it into user and ‘sessions’.

Identification [4,5,6]. In the third stage, various statistical techniques like ‘association’, ‘classification’ and ‘clustering’ are applied to the pre-processed data to discover interesting arrangements or patterns [7,8,9].

In the last stage the identified patterns are subjected to various analytical tools and mechanisms [1, 10] to finally extract the sublimate, the ‘essence’ or knowledge which has applications in a wider variety of fields like commerce, improvement of web based applications, identification of criminals and international security, attracting new customers and their retention, increasing website visits etc.

Present age, is rightly referred to as the age of information and knowledge, as having useful information/knowledge at ‘right time’ gives one a huge advantage over others leading to appropriate and efficient decision making and plan execution.

But in the last two decades, with the advent of internet technology and open source resources, there has been a humongous surge in the amount of information leading to ‘information overload’. It becomes a challenging and time consuming task to sift through this huge volume of data to extract relevant information on a topic. To overcome this issue, certain techniques have been developed, which helps us to retrieve relevant results efficiently and accurately from the web. The plethora of these data mining techniques, methodologies, algorithms which are applied on the web data and web logs to extract relevant data and discover useful patterns, with the aim of improving the usefulness of the various web based applications is known as web usage mining.

Ontology is an explicit formal specification of the terms in the domain and relations among them [11]. Commonly it is defined to consist of abstract concepts and relationships only. In some rare cases, ontologies are defined to include instances of concepts and relationships [11].

In this paper it has been shown that how web log data can help to continuously update and improve the knowledge base of the existing ontology from time to time. Web mining techniques can be applied on web log files to find some suggestions to improve the existing ontology and some research has already been done in this field [11,12,13,14,15,16]. In this work the researcher has used protégé 5.0 (for ontology construction) and weka tool (for data mining algorithms). The thrust of this paper is facilitating information retrieval through a novel ontology management approach based using web log data.

2 Proposed methodology

The proposed methodology for novel ontology improvement approach is described in Fig. 1.

Fig. 1
figure 1

Steps of proposed methodology for ontology improvement

In Step: 1 of this methodology, the researcher has implemented each phase of preprocessing: data cleaning, user identification, and session identification. In session identification phase, the researcher has used the proposed Semantic-Time-Referrer based algorithm [17]. Side by side, the researcher has also constructed new string instrument ontology in the music domain using protégé 5.0 (the best and the most commonly used ontology Editor [18]) intended to enhance information retrieval and as illustrated in Step 1(a). In the next step (Step 2), the researcher has extracted two new features from the preprocessed log file and has implemented some web usage mining algorithms (Step 3) in order to extract some suggestions of the classes, concepts and relationship (Step 4) to update the knowledge base that has been build in Step 1(a). The algorithm corresponding to Fig. 1 is given below.

3 Ontology construction

The ontology includes machine-interpretable definitions of the basic concepts of a specific domain and the relations among those concepts and entities. Various ontology applications are in the field of E-science, Medicine, Organizing complex and semi-structured information, Military/Government and the Semantic web. Ontology data models retrieve information semantically and the Protégé 5.0 is the best tool to create ontology easily, quickly and efficiently for every domain [19]. In this paper, the ontology specific to the domain is constructed using the key concepts and words related to the string instruments in the domain of Music using Protege 5.0 [18]. The OWLviz visualization of string_instrument ontology is shown in Fig. 2.

Fig. 2
figure 2

OWLviz visualization of string_instrument ontology

For the maintenance of this ontology, it needs to be updated from time to time. Hence, the researcher has proposed a new approach according to which the log data of the various websites about the string instruments have been analyzed in order to retrieve new classes and concepts and relations between new and existing classes. During the analysis phase, the log file of 363 days of the year 2016 of a guitar selling website has been used. Now after ontology construction side by side the researcher has performed pre-processing on ‘guitar’ log file which has been discussed in next section.

4 Pre-processing

In this step, the focus has been to enhance the accuracy and quality of the log data to facilitate further analysis. The input ‘guitar’ log file is in the conditional log format collected from a website related to teaching, learning and selling of guitars and some other string instruments.

4.1 Data cleaning

The focus of data cleaning has been to improve the quality of the sourced log file by removing the irrelevant data termed as ‘noise’, to make the log file ready for further processing like segregating it into ‘sessions’ and finally, for user identification. For cleaning following steps have been followed:

  • Failed and corrupted requests have been removed.

  • Requests originated by web robots have been removed.

  • Requests made by other than GET method has been removed.

  • Requests, in which transfer bytes are nearly zero, have been removed.

The Table 1 shows the requests, in which transfer bytes were nearly zero, have been removed. The number of entries in original log file was 24,945 and number of entries in the cleaned log are 23,626, also removed 707 irrelevant URLs.

Table 1 Details of log files before data cleaning

To identify the unique users, ‘time constraints’ have been used simultaneously with the IP address and the agent fields i.e. if the access time (crosses the threshold value) is very long from the same IP and agent filed, the proposed model automatically creates a new session. The determination of the threshold value is on the basis of the average access time of all unique users identified in the log file. Average access time has been taken as 2 h and 25 min. As shown in Table 2, the total entries in the cleaned log are 23,626. During this process, 23,626 entries have been grouped into 12,334 unique users.

Table 2 User identification details for guitar log file

4.2 User and session extraction

In this phase, the identity of the ‘unique’ visitor or user has been established and his navigation pattern is extracted (which pages of the website have been accessed) using this IP address. User’s identification means to identify who accesses the website and more precisely which pages are accessed.

These 12,334 identified users have been used to extract the session. The appropriate session identification is a very important step in pre-processing of log data. Semantic-Time-Referrer method [17] is used to find out the number of sessions. From 23,626 entries and 12,334 users, 12,760 sessions have been extracted. Before applying data discovery techniques of web mining on these 12,760 sessions, in order to extract relevant information, the feature extraction is required to be done, this is discussed in next section of this paper.

5 Feature extraction

Before applying the data discovery techniques, first, the features have to be extracted according to our problem from the 12,760 identified sessions in pre-processing phase. Two new feature sets have been extracted as discussed here under.

5.1 File 1: similar page Group hits

This feature will group the web pages of website into some classes and sessions into similar types of visits. This feature will help us to identify that which type of pages has been accessed by which type of users. As a result of this feature extraction, the researcher has constructed a ‘File 1’ for further analysis which is discussed in next section (Sect. 6).

5.2 File 2: all keywords used to search in queries

In this feature, the user has retrieved only those sessions from main guitar log which contains a search engine in their referrer field and has created a new log namely “query” log. There are 563 transactions of search queries and out of these 417 and 423 are the unique users and sessions, respectively. From this data, the file for query analysis has been created. This file contains all queries of the query log. This feature will list the keywords which have been used for in the search queries related to the guitar or string instruments. The list is made session wise. The sample file is shown in Table 3. It shows the first four entries in the query log. An excel file has been made by splitting the query terms into cells assigning them values. If the keywords searched are less than five, then that attribute has no value. The resulted file (File 2) of this feature will include only those sessions which have at least one searched keyword.

Table 3 A sample file of query log in each session

In next section the web mining algorithms will be applied on these two file namely ‘File’ and ‘File 2’.

6 Experiments and results of the log analysis

Two files have been created for two new feature sets extracted in previous section, namely ‘File 1’ and ‘File 2’ on which further web mining techniques have been implemented to discover interesting and useful information.

6.1 Clustering results: on File 1

During the data cleaning phase, the guitar log file collected from the website with 24,945 entries has been cleaned. After removing the irrelevant data, the cleaned log file contains 23,626 entries shown in Table 1. The user and the session identification phase grouped the 23,626 entries into 12,334 users and 12,760 sessions, respectively. The session and page hit distribution on the different languages are shown in Table 4. The table shows that the English language of webpages of guitar website has been accessed more frequently than any other language. Hence, the web pages of English have been divided into more than one part and the webpages of other languages have been considered in a single.

Table 4 Linguistic distribution of the hits

The results that have been shown in the confusion Table 5 are obtained as a result of applying the crossed clustering method seven classes of the pages and the five classes of visits. The set of results obtained for the English language have been divided into 6 classes (E_HE, E1_M, E2_SC, E3_DD, E4_OS, and E5_SN). E_HE pages contain home page and the first page of the English language. E1_M contains the information like, sitemap, contact, mailing address etc. pages. E2_SC contain the webpages relating to the software and courses. E3_DD contains the audio, video, free trial and the downloaded pages. E4_OS contains the pages of an online store. E5_SN contain the pages related to the social networks. Olang contains all the web pages of other languages. This website is in four languages (Spain, Dutch, French, and English).

Table 5 Confusion table

The percentage of clicks on the pages of the English language is 94.79% and just 5.20% on all the other language pages. From this statistics one thing is very clear, that people using this website prefer English language webpages for learning or buying a guitar or its accessories.

As mentioned earlier the visits are divided into 5 classes or continents. V_Asia contain all the visits or sessions from the Asian users. It has only 92 clicks out of 6991 clicks which accounts for 1.31% of the other language webpages out of 6991 clicks. Whereas 98.68% clicks are on the English webpages.

Regarding the statistics about the visitors, 16.87% of the visits are from Australia, South America, and Africa and 83.97% of the visits are from Europe, Asia, and North America. From the analysis result, it has been concluded also found that maximum visits are from Europe (30.83%) than from Asia (29.83%) and North America (22.70%) respectively shown in Fig. 3. Considering the statistics country wise, the maximum visits are from France. On further analysis, it is found that the website has been developed and maintained by an eminent French guitarist, so it can be safely concluded that this could be the strong reason for the highest number of clicks from France. The software and the courses available on this website are written by Amar Guerfi who has been a famous guitar Player since the late 1970s. Also, the maximum number of files has been downloaded from the European visitors. These visitors have been mostly visiting the home page, with very low clicks on the other pages, Asian visitors have been in the second place in terms of visits. The least number of visits are from Australia. Maximum clicks have been on the pages with software and courses are from Asia (47.48%). Clustering analysis of the EM algorithms between the visits and the visited pages are also shown in Fig. 4.

Fig. 3
figure 3

Shows clustering analysis of page and visit classes

Fig. 4
figure 4

Clustering analysis of EM algorithm between visits and visited pages

Thus, it can be concluded that this guitar website receives most of the visitors on English web pages of the website and out of which, maximum are on the home page of the website. Also, the guitar log file contains visitors from 177 different countries out of which, the maximum visitors are from France, USA and India.

6.2 N-gram method: on File 2

In this section, the various attributes of the query data log files have been discussed along with the results obtained from the analysis which is presented by generating statistical description about the queries and user search sessions. The text analysis has also been performed on the queries submitted by the users to identify the most commonly used query and words specific to the domain and the possible relations between these words and the terms that have been used.

The cleaned ‘guitar’ log dataset consisting of 23,626 transactions has been shown in Table 6, out of which 786 transactions are the search queries and out of these 786, 222 are the empty queries, then left with only 563 queries for further analysis. Out of the 563 transactions of the search queries, there are 417 are unique users and 423 are the unique sessions. All the queries have been broadly classified into three groups—the unique, repeated and the blank queries. The string of terms/words used in a query which do not match the words of any other query, or which have been modifications of the previous queries, have been grouped under unique queries. Queries may also be repeated due to visiting/viewing of the result pages subsequently by the users. The blank or empty queries are those, which are without any terms. During this analysis, it has been found that 222 queries do not contain any keyword. Thus, these queries have been considered as empty queries. The Table 7 shows that out of the 786 queries, 145 (18.45%) are unique queries, 419 (53.30%) are repeat queries and 222 (28.24%) are the empty queries. The mean of the queries of the log is 1.66 (with a median of 1).

Table 6 Statistics on queries
Table 7 Statistics on query terms

Further, the analysis of the distribution of numbers has been done in order to do the detailed study of the number of queries per session. As shown in Fig. 5, 74.46% (i.e. 315) of the sessions contain only one search query; 20.56% (i.e. 87) of the sessions contain two, 3.07% (i.e. 13) of the sessions contains three search queries, 1.18% (i.e. 5) of the sessions contain four search queries and 0.71% (i.e. 3) of the sessions contain 5 search queries. From the results, it can be safely concluded that, in terms of the number of queries submitted, the distribution is bent or skewed towards the lower end. As 95.03% of the users submitted only one or two queries, there is a possibility that the users might be very clear about what they have been looking for and hence have been able to better formulate their queries and thus have been obtained the relevant result from their first query, so they need not resubmit it.

Fig. 5
figure 5

Query distribution among sessions

The in-depth analysis of the terms used in the search query can reveal very useful information like ‘HOW’ the user formulates their queries. The statistics regarding these have been discussed below and also shown in Table 7: there are 1341 query terms out of which 162 are unique and 1179 are the repeated. The longest query contains 10 terms with a frequency of 1. The median of the query terms is 3 and the mean is 3.16. The distribution of the number of terms per query is shown in Fig. 6. When the queries are classified according to the number of terms used the results are as follows- Single terms queries are 6.73, 20.39% contain two, and 46.45% contain three terms and in 93.43% of cases queries having five or fewer terms so, from the results obtained it can be concluded that the users generally prefer short queries.

Fig. 6
figure 6

Distribution of terms per query

To study the terms and their relationship with each other, the n-gram algorithm has been applied on the query log. The researcher has created 5 files namely 1-gram, 2-gram, 3-gram, 4-gram, and 5-gram. From this analysis, suggestions to update existing string instrument ontology have been obtained, which have helped to create new or update existing classes and their relationships.

It will help us to update following:

  • To add new leaf concepts in a hierarchy.

  • To add a sub tree of concepts in the hierarchy.

  • To add a new relationship between existing or new concepts.

These files show the single terms, terms in a pair, three word terms, four word term and five word terms with their frequency. The sample files in this chapter show the 10 most used single terms, terms in a pair, three word terms, four word terms and five terms with their frequency.

In the end, the comparison of these individual terms has been done with data set of the guitar ontology. If there is any term used by most of the users, which is not present in the ontology then the researcher finds its frequency in 2-gram, 3-gram, 4-gram and 5-gram also check that how and in which context it has been used and try to find out the co-relation of that term with other terms.

The sample of 1-gram file is shown in Table 8a. The word “guitar” has been used 370 (27.48%) times in queries; “online” has been used 247 (18.35%) times and so on. Most frequently used 2-gram terms have been shown in Table 8b.

Table 8 Top most frequently used 1 and 2 gram query terms with frequency

The results are almost similar to 1-gram. The top 2 terms of 1-gram have been used most frequently (i.e. 120 times in queries) in 2-gram. Similarly, the words “multi” and “instrument” used in 1-gram have been used together in bi-gram (66 times). Tri-gram and tetra-gram tables of 10 most frequently used terms have been shown in Table 9. The phrase “multi instrument tuner” has a maximum frequency (i.e. 66 times), “Learn electric guitar” has the second highest frequency i.e. 42 times (See Table 9a). In the tetra-gram table the “online metronome for guitar” has a frequency as 15; “metronome for guitar free” has a frequency as 13 shown in Fig. 9b. The 5-gram of terms is shown in Table 10, where “Online metronome for guitar free” is used 13 times in queries.

Table 9 Top 10 most frequently used 3 and 4 gram query terms with frequency
Table 10 Top 10 most frequently used 5-gram of query terms with frequency

The multi and instrument terms have been used 66(100%) times individually and together in 1-gram and 2-gram, respectively. That means whenever these words have been used they have been be used together. The term tuner has been used more than 70% of the times with multi and instruments in the tri-gram. The maximum chances are that tuner will be used with multi instruments. Therefore a new class ‘multi instrument tuner’ and tuner need to be added.

The word metronome has been used 48 times in 1-gram and in 2-gram, 3-gram, 4-gram, and 5-gram, it has been used for the maximum number of time either with the guitar or online or with both (see Table 11). The Table 12, below shows the frequency with which the word “guitar” has been used with other words in the search queries.

Table 11 Analysis of metronome concept
Table 12 Analysis of guitar concept

Complete queries have also been analyzed and their frequency has been calculated. The top ten queries used by the users have been shown in Table 13.

Table 13 10 most frequently queries used for search

From the Table 13 it is evident that the query “guitar online” and “multi instrument tuner” have been requested the most number of times. Most of the people are interested in searching for Electric guitar than any other type of guitar. People are also interested in metronome instrument used for guitar. As a result, the researcher has retrieved the following suggestions. The term metronome, multi, instrument, the tuner does not exist in our ontology. After studying the suggestions, the researcher got to know that metronome is a practice tool that produces a steady pulse (or beat) to help musicians play rhythm accurately. The metronome can be used for piano, drums etc. The terms multi and instrument has been used together. The terms multi instrument has been also used for guitar accessories like multi-instrument gig-bags, multi instrument cases or multi instrument tuner etc.

The term metronome and multi instrument do not exist in the present ontology as shown in the Fig. 2. Therefore these terms can be added to the ontology by creating new concepts. As per the suggestions, the researcher has added ‘metronome’, ‘multi-instrument-tuner’ and tuner to the ontology. The updated ontology is shown in Fig. 8.

6.3 Apriori algorithm: on File 2

The association rules generated by the Apriori algorithm have been analyzed to find the interest of the visitors. Therefore after the n-gram method, the researcher has applied the Apriori algorithm on query log in the file 2 which contains all the queries made by users in each of the sessions. They have been obtained on the complete log file with minimum supports.

The results of Weka Apriori [W2] program using query log are almost similar those obtained with the n-gram method shown in Fig. 7. It also shows the similar type of rules that have been interpreted as the First rule is—if the term ‘instrument’ (66 times) has been used then, ‘multi’ (66 times) has also been used with it. The confidence of this rule is 100%. Rule 9 says that if the terms ‘multi’ and ‘instrument’ have been used together then the confidence of using the third term ‘tuner’ is 100%. Rule 14 says that the confidence of using “learn”, “electric” and “guitar” together is 100%. Similarly, last rule is that the term ‘online’ is used with the ‘guitar’ term with a confidence level of 94%.

Fig. 7
figure 7

Shows the output from WEKA Apriori program using the query log

7 Ontology improvement

This work has given us the suggestion for three new classes ‘metronome’, ‘multi-instrument-tuner’ and ‘tuner’ to be incorporated in the ontology. The OWLviz visualization of the updated ontology has been shown in Fig. 8. After studying about these two suggested concepts, the researcher added a new class string_instrument_tuner to the ontology. Therefore muti_instrument_tuner has become the subclass of this class. The OWL visualization of the new class string_instrument_tuner is shown in Fig. 9. Object properties for this class are is_used_to_tune and tuned_by.

Fig. 8
figure 8

Updated ontology

Fig. 9
figure 9

OWLviz visualization of string_instrument_tuner class

The usage of these new object properties has been shown in Fig. 10. The new term ‘tuner’ has been added.

Fig. 10
figure 10

Usage of is_used_to_tune object property

8 Contributions

  • A novel method to update the general ontology from log data.

  • New feature selection for log analysis.

  • Shows the relationship between Semantics web and Web usage mining.

Hence the proposed methodology is computationally simple and easy to deploy.

9 Conclusion and future work

Constructing ontology and its continuous improvement requires knowledge integration and updating it from varied sources, but specifically from web content belonging to a particular domain, in case of Semantic Web. During this study, the researcher has attempted to show the potential impact and use of web usage mining on updating the ontology. The researcher illustrated such an impact in the string instrument ontology in musical domain by considering the site of online guitar selling website maintained by Amar Grifu from France. The researcher has already constructed a new string instrument ontology from base using protégé 5.0 ontology editor and showed how the knowledge discovered from the analysis of specific type of log file data (referrer filed) of this domain can be immensely useful to update this ontology time to time. To prove this clustering (EM), association rule (Apriori) and sequential pattern (n-gram) mining algorithms in particular have been applied on ‘guitar’ log file of online guitar selling website. The original ‘guitar’ log file contain 24,965 transactions, after cleaning left with 23,626 transactions and 12,334 and 12,760 unique users and sessions, respectively. On this cleaned log file the researcher has applied clustering by grouping pages and visits into 7 and 6 classes, respectively and got some golden nuggets. (1) The percentage of clicks on the pages of English language is 94.79%. (2) The maximum visits are from Europe (30.83% and mostly from France). (3) Maximum downloads are from European visitors (35.54%). (4) Maximum clicks on software and courses pages are from Asia (47.48% and maximum from India). Reasons of these results are discussed earlier in clustering analysis phase.

In second experiment the researcher has extracted only those sessions from cleaned log file which contain query from any search engine in their referrer field and has come up with 786 transactions out of which 222 have been the empty queries. So after removing those transactions the researcher is left with 563 queries and 417 unique users and 423 are the unique sessions. N-gram and Apriori algorithm have been applied on this data set to get some suggestions for ontology improvement. As a result the researcher has concluded that the term metronome, multi instrument and tuner does not exist in the string_instrument ontology in Fig. 2. After adding these concepts in ontology the updated ontology is shown in Fig. 8.

The objective of the research has been to accelerate and to improve the ontology development process by semi-automatically generating a hierarchal ontology. This work has been expanded to build the semantic web from the generated string ontology, which has helped in refining the web search on generic search engines in music domain.