1 Introduction

Since the inception of web technology, there is a need to understand usage patterns and this had been an area of research and addressed by several studies. Over a while due to changes in technology, we found that the methods published earlier are no more relevant. The essence of new research is to make the developed tool meaningful. Make the mining rely on data instead of user-submitted credentials. When a user connects to the server, it supplies the IP, user ID, timestamp, mode of request, status and several other information. An access protocol such as Hypertext Transfer Protocol (http) is responsible for content access. When a user wants to visit the site using http, they can use several devices. The visitor provides a Uniform Resource Locator (URL) or uses a page link. There are several off-the-shelf software’s available that could provide web browsing statistics. The information is retrieved using such software for getting an idea of the server traffic. As the volume of traffic increases, the complexity of data mining also increases. Increased traffic provides an opportunity for malicious attacks and requires a more advanced technique for web log mining [1].

During data mining, weblog data are collected from the server, client, or other sources. The collected data usually consists of different content. The data about the history of visits received from several sources give information about the browsing patterns of user during the overall usage. This data may include several types of users and corresponding visit patterns [2]. Weblog of a server does not portray sufficient insight about the visit behavior at the client side as they reflect the pages accessed. There is a need for pre-processing and cleaning web log data for further analysis. The essential component of pattern discovery and pattern analysis is to identify association rules. Many research works are available on this topic with several improvised techniques for data cleaning.

The web-based applications are growing at an alarming speed as a data exchange hub and a source for meeting day to day operations in all spheres of life. Web mining is the extraction of meaningful information from the logs generated during web surfing [3]. In general, weblog mining is divided mainly into content-based, usage based and structure based. The literature survey concerning the current work is limited to web mining based on the weblog’s page visit, dwelling time and unique IP’s. When a user accesses a web server, a log of visited sites gets generated. Analysis of the web logs can provide the user access history and about the content. The information gathered from the visit history could be used for planning e-commerce, m-commerce and other e-services such as banking, property renting and bill payment [4]. Accurate usage data could invite new users, also helps to maintain current customers, suggest new services, assess the impact of promotion alerts. Most of the time, marketing teams use this weblog mining information to create potential user profiles using navigation history, viewing time and page content [5].

The objective of the current work is to present a novel online dynamic session identification framework depending on the weblogs trace. The approach selected includes session detection with a user-defined schema using unique IPs, unique URLs, several sessions and average session length. The overall intent is to develop a scalable user and session identification paradigm for weblog data. The expected outcome is dependent on a scalable association rule mining technique. The mining objectives of the current work are as follows:

  • Distinguish one pattern from another.

  • Facilitate proper choice of the target segment of web content.

  • Facilitate effective tapping of the segmentation.

  • Crystallize the visit pattern of the target visitors.

  • Make the mining effort more efficient than published work.

  • Spot less used segments and succeed in reducing the impact on overall mining by such segments.

  • Brings benefits not only to the mining usage pattern but also to the infrastructure scaling decision.

Further, we will discuss the literature survey, the proposed framework, implementation details, results obtained and analysis of results. Comparison based on the results of current work with the published result from other researchers also included.

In the next section, a detailed literature survey for web usage mining is discussed. The objective is to identify the gaps in implementation that can be improvised thorough investigation.

2 Literature survey

The use of data exploration methods over the weblog is one of the focus areas of research for information aggregation [6]. This paper provided a detailed approach for discovering patterns in web log data. The authors used a transformation and interpretation technique to extract information from user activities. The published work on i-Miner has an optimized solution using a fuzzy clustering system to capture the web access details [7]. For fuzzy rules, it used chromosomal structure modelling and representation. For the best results with the lowest RMSE, it used an efficient hierarchical distribution structure.

Next study organises web pages into a two-dimensional map using Kohonen’s self-organizing map [8]. Rather than the content of the web pages, the organisation of the web pages is dependent only on the navigation behaviour of the users. The generated map serves as a visual analysis tool for webmasters to better understand the characteristics and navigation habits of web users visiting their pages. Another study [9] looks at various data processing methods. The limits of many web usage mining techniques applicable at various phases have been explored by the authors. Apriori, FP-growth, and Single-scan algorithms were also studied and compared.

Another approach is on analysis of real use patterns of the web site visitors. The objective here is to extract data from the weblogs and then use sophisticated algorithms for predictive modelling [10]. To extract relevant information from massive amounts of web traffic, the author adopted the soft computing paradigm. Further, to improve the trend analysis, the Fuzzy Clustering Method and Self-Organizing Map clustering approaches are used. It produced the best results and a high correlation coefficient.

To assess the performance of the clustering algorithm, the authors conducted a rigorous investigation on partitioning based and hierarchical based clustering algorithm [11]. Extensive experiments were carried out to investigate these clustering strategies. In anatomized trials, internal and external validity indices are utilised to evaluate the effectiveness of these two algorithms. Based on their research, they found that the K-means algorithm produces more promising results than the hierarchical approach. To detect navigation-related usability difficulties and improve usability, the authors developed a cost-effective ideal user interactive path (IUIP) model [12]. They also gave a comparison of real usage patterns and the IUIP models that match to them. The method’s applicability and effectiveness were also assessed in this study.

The next study [13] included a time and referrer component to a session reconstruction algorithm in order to generate actual sessions and avoid excessively long sessions. The authors provided a framework for creating a semantic based session creation using heuristic approach to achieve better results. Recent research study presented a new graph based approach to cluster a data by constructing a graph from data with Markov Stability [14]. This approach is evaluated and tested for its robustness and performance with other graph based approaches. It is also compared with other clustering algorithms to measure the quality of the partitions based on Normalized Mutual Information (NMI) and Adjusted Rand Score (ARI).

Another method uses weblog analysis to generate trends and track user behavior [15]. Based on the web structure, each URL in the web log data is parsed into tokens. To investigate the navigation patterns, sessions from various users are aggregated using the hierarchical agglomerative clustering technique. To extract and evaluate user behavior patterns, the authors proposed a weblog analysis using the Pyspark (WAP) algorithm [16]. For efficient cluster processing, large weblog data has been separated and dispersed over numerous parallel nodes. Another study involves exploration and development of a utility for log files using Big Data platforms. The data gathered were utilized to classify the websites visited [17]. Using the Hadoop approach, this study constructed a web log analysis tool. In addition, this research developed an improved method for analyzing terms and phrases found in Google searches.

The scalable weblog mining method uses a tree-based clustering algorithm. This work intended to explore the suitable elements from the knowledgebase and predict the users browsing patterns [18]. An improvised approach implemented for creating web browsing logs. This approach could permit to validate and enhance the applications with a low cost [19]. The authors in an article segregated the webserver log file, selected new data patterns, and analyzed the content over tree-based classification algorithms [20]. Web use exploration can be segregated into following phases as: (i) data discovery, (ii) usage pattern analysis, (iii) pattern presentation and visualization [21].

Going over the publications, we could understand that a high performing framework for web mining is required. Different authors have taken different approaches. But this study provided the necessary direction for our research. Based on this study, we could understand the importance of the user sessions and capturing every detail. One gap that came up is the dependency of size in defining the framework. The current work is to bridge that gap. Hence one of the focus areas is to find the mining outcome depending on the size of the weblog. Another area added to the investigation is the optimization of the weblogs to get rid of unwanted data. Based on the literature survey, we could find that different researchers have used different development environments. The implementation of current work is completed with a standard state of the art development platform.

3 Proposed framework

Figure 1 shows the dynamic session identification framework proposed in the present work. The overall process of the framework implementation involves five stages.

  • Identify Data: Knowing and accessing the data is the critical step for high-quality web usage mining. Knowledge about the business goal provides insight into finding the data source. Data is generally a weblog with millions of logs. It is essential that the data must be aligned to the goal of the investigation.

  • Prepare Data: This step involves removing the unwanted data, make formatting consistent, remove redundant data. This step is very important as the final data analysis is done based on the cleaned data. Any unwanted data or wrong formatting of the data type may lead to wrong analysis.

  • Prepare Model: A model is evolved by analyzing the trend and patterns of data. The essential component of the model is to fit new data into the model to provide a prediction of a future trend. Businesses can derive insight into the related activities and impact of changes by using the developed model. The models can be descriptive or predictive depending on the use. The developed models can be further refined with new data trends.

  • Evaluate Performance: Performance evaluation can be done using several data analysis algorithms, such as Decision Tree, Random Forest, Naive Bayes, AdaBoost, multilayer perception neural network. However, we used a simple statistical Sum of Squared Error (SSE) Clustering method.

  • Publish Result: Once the performance is evaluated the result is communicated. A comparison is done with the previously published result and compared. The parameters considered for measurement are Mean Absolute Percentage Error (MAPE) and hourly Root Mean Square Error (RMSE).

Fig. 1
figure 1

Dynamic session identification framework

4 Implementation

Table 1 shows the details of data used from available weblogs. The files Wblog1, Wblog3, Wblog4, and Wblog5 are the weblog files used for training and Wblog2 is used for performance measurement of the algorithm. The proposed model is tested on cleaned weblog data. Cleaning of data is done based on conventional data cleaning processes.

Table 1 Data source and pre cleaning and post cleaning parameters

The source of data is from University of California, Irvine (UCI), USA [22] and Kaggle.com [23]. Kaggle provides data to do data science work.

Figure 2 shows the steps to run the developed algorithm. Once the data has been cleaned, the algorithm will run on the predefined parameters for analysis and segmentation reporting.

Fig. 2
figure 2

Algorithm running steps

In the present work, an implementation is proposed to build a knowledge base using a Python-based application. The algorithm then uses the improved knowledge base to perform analysis based on Algorithm 1.

figure a

Figure 3 shows the progress bar of the developed application using Python. In current work, SSE is used. SSE is a metric that helps to choose the appropriate number of divisions for segmentation. Clustering is a mathematical approach evolved to figure out the “best fit” for the identified cluster (segment). This involves multiple iterations to bring the segments in proximity. If the respondents matched the segment scores exactly, then SSE would be zero = no error = a perfect match.

Fig. 3
figure 3

Application interface for showing data claning status

With real-world content, however, this is a difficult prospect to achieve. Further, the investigation continued with segmentation using  a lower SSE. This is due to the fact that a low SSE represents large number of similar users in the identified segment. A higher SSE indicates that the users within the segment have sizable differences in browsing pattern.

5 Results

The subsequent subsection detailed the result obtained for data cleaning and segmentation performance of the developed framework based on the predefined matrices.

5.1 Application cleaning performance

Table 2 shows the performance of cleaning of raw data from weblogs. There is a minimum of 4% optimization and a maximum of 27% optimization, according to the data. Figure 4 shows the performance for optimization during cleaning. The optimization is an essential part as this reduces unwanted data volume hence reduces processing time during segmentation.

Table 2 Data source and pre cleaning and post cleaning parameters
Fig. 4
figure 4

Data cleaning performance

5.2 Selection of data for segmentation

Figure 5 shows the regression analysis of average session. It shows that Wblog2's average session is the best fit, hence Wblog2 data was chosen for clustering. The analysis resulted in average session computation predictions as given by Eq. 1.

$${\text{y }} = { 14}.{472 } + { 11}.{\text{538x}}$$
(1)
Fig. 5
figure 5

Selection of testing data from average session

5.3 Developed model performance

There is a clear possibility for the developed framework to adopt a web mining technology that will help to solve critical data issues. A modelling framework using web mining is essential to escape from manual processes and get more accurate insight. There are many different file types on Kaggle Datasets, including CSV files. To select the list of datasets that are available as CSV files, select from the drop-down menu towards the top of the screen: [File types] > [CSV]. Similarly, UC Irvine Machine Learning Repository provides a dataset. Table 3 presents the derived performance indicators using data from UCI [22] and Kaggle[23].

Table 3 Key performance indicators

The different size of weblog chosen to analyze the variation of data exploration with volume of the weblog.

Set of dashboards are available in number of existing data analysis tools, and these can be used to identify solutions to a variety of problems. These platforms provide a big picture about the usage patterns and many more other relevant information. However, during the current research work, we developed a Python tool for statistical analysis. Table 4 shows the key measures obtained using the developed tool. The focus is to measure only the relevant information.

Table 4 Key measure

Results obtained using above parameters are plotted.

6 Analysis of result

The intercept is 14.472 and the coefficient is 11.538 based on available data using Eq. 1. This analysis provides a prediction of average session time for a weblog from the same web server.

This analysis is for planning a server’s scaling capacity. As the number of users increases, the requirement for computer resources also increases. Scaling planning can be done using any commercial software based on the findings of Eq. 1.

From Fig. 6. MAPE is initially inconsistent with the log’s session patterns, with high fluctuations (28%) for low volume. Therefore, we conclude that more data points stabilize the framework. Further, since the MAPE is 11% for 1 million records and RMSE 0.05 (approx.) is low, we have a stable paradigm for accurate rule-based mining.

Fig. 6
figure 6

MAPE for file Weblogs

From Fig. 7, it is evident that the errors computed from the model and deviation are initially not aligned, but by the end of the time, they are nearly aligned. Therefore, we can follow the Mean ± 1 standard deviation formula after carefully factoring in all the segments, IP and URL trends.

Fig. 7
figure 7

Optimization vs. deviation a absolute, b cumm absolute

7 Comparison analysis

The performance of the identical data set from the 10th and 17th of June has now been compared. Table 5 provides the details about the segments. Each segment has 62 entries.

Table 5 Data set segmentation for 10th and 17th June

SgCL: Allocated Segment for Cluster, #: Number of Entries, Clustering is done as per Table 5 and plotted as in Fig. 8.

Fig. 8
figure 8

Segmentation for 10th and 17th June

Figure 9 shows centroid of the clustering, Fig. 10 shows the SSE by number and segments. The obtained minimum MAPE is 7%. For a weblog with 1 million log entries, MAPE is 11% which is significant. The framework could achieve an hourly RMSE of 0.0500 and is better than the published 0.0639 [10].

Fig. 9
figure 9

Centroid analysis for 10th and 17th June segments

Fig. 10
figure 10

SSE per segment (Seg)

8 Conclusion

The authors presented a method for obtaining high-quality data all through preprocessing phase. A cleaning procedure was carried out by the algorithm. The algorithms employ this data to uniquely identify users, which aids in the discovery of user sessions. In terms of identification, the obtained minimum MAPE is 7%. MAPE is 11% for a weblog with 1 million log entries, which is significant. The framework achieved an hourly RMSE of 0.0500, which is better than published ANN (0.0639) paradigms [10].

9 Future scope

The proposed revolutionary algorithm is more efficient and can be developed into a tool in future for administering exhaustive formulas during the preprocessing step. This would aid in the automation of visitor behaviour analysis based on the number of visitors, bandwidth consumed, and their interests in order to forecast future e-service patterns [24].