1 Introduction

The Linked Open Data (LOD) cloud provides a global information space with a wealth of structured facts. The LOD cloud offers for example Geo-location factsFootnote 1 and cross-domain information (e.g. DBpedia, YAGO [32] and WIKIdata). Currently, it is estimated that more than thirty billion facts have been published over the LOD [33] cloud. The data in the LOD cloud is represented using the Resource Description Framework (RDF)Footnote 2 and RDF Query Language (SPARQL)Footnote 3 is used as querying language. The rapid expansion of LOD use in academia and industry evidences the efficient retrieval of data as one of its major challenges. Although every LOD cloud supports SPARQL queries to access data from its publicly available interfaces, a central problem is the lack of trust regarding these endpoints due to network instability and latency. Therefore, the typical solution is to dump the data locally and maintain endpoints to process these data. Recent investigations [3, 5, 10, 13, 14, 17] have shown that the content of LOD is dynamic over time and continuously evolving. However, the data stored at the local endpoints are not up-to-date and require constant updates, therefore, accurately hosting the endpoints requires expensive infrastructure support.

In recent years, many efforts have been made to circumvent the problem of effectively querying LOD [24, 31, 35], among these, caching [22, 23] is the most popular technique to reduce query time by serving the requests from a cache (also called cache hits) [28]. In the literature, two types of caching have been proposed, client and server-side caching. In client-side caching, requests are immediately served from the nearest cache to reduce the latency and network traffic. However, client-side caching is not fully explored [24] and is still in the early days of research. Server-side caching is not flexible to support different querying patterns and design of server-side caching usually depends on the database. As cache has limited space, it is important to fill it with valuable content by replacing unnecessary content. Many cache replacement techniques have been developed for relational databases, such as LRU [8] and LFU [18]. The underlying structure of the LOD is different from relational databases. The caching algorithms designed for relational databases are not fully applicable in LOD scenarios [21]. To the extent of our knowledge, we believe, there is very limited work addressing the problem of efficiently querying and retrieving data from LOD cloud [22, 29, 36,37,38].

In this paper, we propose an Adaptive Cache Replacement strategy (ACR) in order to accelerate overall query processing. ACR works as a proxy between the querying agents and the SPARQL endpoint. We adopt client-side caching as it is a domain-independent approach that does not require underlying knowledge of the LOD. Typically, the queries issued by the end user are repetitive and follow similar patterns that only differ in specific element. The major challenge of this task is to find similar queries, as it is possible that two queries are structurally similar but may differ in content. To tackle this problem, we propose the use of a bottom-up matching approach to find similar queries. For the structural similarity, we first compute the distance between the triple patterns and prefetch the results of similar queries to be placed in cache. A cache has limited space, therefore it is advantageous to replace it with frequently-accessed data. In this work, our cache replacement utilizes exponential smoothing forecasting to calculate the frequency of the accessed data and replace content based on access frequencies. More specifically, we propose a full-record replacement strategy, in which at every new query the hit frequency of accessed triples is calculated using exponential smoothing and the cache is replaced with the highest access queries. The motivation behind adaptive cache replacement is to improve the querying efficiency and reduce the burden on the SPARQL endpoint. Repeated queries are cached locally, and the results of these queries are immediately answered to the user. Our approach optimizes the results of predicted potential queries and less-valuable queries are replaced from the cache.

We now summarize the key contributions of this paper.

  • We propose a client-side caching that works as a proxy between the querying agent and SPARQL endpoint. To accelerate the querying answering process, our approach can either be deployed within SPARQL endpoint or querying agent to eliminate the burden on these endpoint.

  • Our approach introduces a distance-based query similarity metric, which considers both content-wise and structural similarities for more accurate comparison between queries.

  • We propose an exploratory prefetching to retrieve contents possibly requested at the future queries by identifying the concepts of the previously issued queries and issuing a single query for all required contents. Its benefit is to reduce the transmission overhead and improve the hit rate and query time by retrieving contents for future queries at once.

  • We propose a frequency-based cache replacement method to rank each query according to its estimated access frequency. The most frequently accessed queries are kept in the cache. Thus, our work benefit the triple stores in replacing the cache.

  • Comprehensive evaluation on real-world LOD datasets showcase the effectiveness of our approach. The evaluation result outperforms the state-of-the-art approaches such as (LRU) Least Recently Used, (LFU) Least Frequently Used and (SQC) SPARQL Query Caching [24] in terms of higher hit rate, shorter query time, and less overhead.

The remainder of this paper is organized as follows. Section 2 explains the background of representing and querying LOD and briefly explains the related work in comparison to the proposed approach. Section 3 explains the main phases of the approach to find similar-structured queries and perform cache replacement. The evaluation of the proposed approach is performed in Sect. 4. This article is concluded in Sect. 5.

2 Background

In this section, we briefly discuss the background needed to understand LOD, which includes the data representations and the querying of the LOD. Moreover, we also discuss an overview of related work in the area of semantic caching and query suggestion, and highlight the differences of the proposed approach with respect to state-of-the-art.

2.1 Data representation

The Semantic Web [2] is an extension of the Web of Data [1]. As described by Tim Berners-Lee [2], the Semantic Web enables the machine in such a way that data can be searched, interpreted and reused. LOD is another important concept in the Semantic Web, enabling the machine to browse the web, such as DBpedia.Footnote 4 LOD is collaboratively built from web corpus and represents knowledge in structured form. Facts are stored inside the Knowledge Base (KB) and an inference engine is used to assert conditions based on these facts. There are a number of different publicly available KBs, such as Freebase [4], DBpedia [20] and Yago [32]. KBs are further categorized into curated and open KBs. In curated KB, factual information is represented in the form of entity-relationship (e.g. https://en.wikipedia.org). In an open KB, the facts are automatically collected from web pages and are often linked together, e.g., by using LOD. With the current evolution of the Semantic Web 3.0, the LOD enables data to linked between sources, as shown in Fig. 1. The LOD cloud contains almost 570 datasets from different domains that are interlinked with each other.

Fig. 1
figure 1

Showing the LOD diagram containing interlinked data from multiple domains

The data representation in curated KBs follows a predefined schema, entities and relationships are modeled using a Resource Description Framework (RDF). The RDF is considered the standard representation for a curated KB, where the relationships are represented in many sets of triples, i.e., (subject, predicate and an object). These sets of triples are a fundamental part of a KB and are also known as a knowledge graph. The knowledge graph allows the sharing of data and the further linking of these data sets in the LOD cloud. However, in an open KB, facts are also represented but they do not follow a strict schema, and these data are available in various formats, such as N-triples and the turtle format.

Fig. 2
figure 2

Showing the example of a SPARQL query

2.2 SPARQL

SPARQLFootnote 5 is a widely used graph based standard query language to retrieve and manipulate data that are stored in the RDF format. SPARQL is a structured query language standardized by the W3C for querying RDF triple stores.Footnote 6 The syntax of SPARQL contains different and disjoint query types such as SELECT, CONSTRUCT, ASK and DESCRIBE. To extract the values from the endpoints, SELECT is widely used. As an example of SPARQL query illustrated in Fig. 2, the query pattern contains SELECT statement that limit the projection to the certain variables used in the query such as ?philosopher1 and ?philosopher2. This query used the UNION and OPTIONAL as a basic operations for modifying the content of SPARQL query. LOD cloud also provides SPARQL endpoint for their datasets. However, querying SPARQL endpoints is troublesome due to network instability, and the connection to these endpoints can be temporarily lost, which affects the query efficiency. These endpoints do not provide any information about dataset modification. Therefore, long-running applications that use a cache must resubmit the queries to keep the local data cache up-to-date.

2.3 Related work

A number of works have dealt with issuing related to query LOD. This section present related work organized under two topics: (1) query suggestion, e.g., techniques to find the similar-structured queries, (2) semantic caching, e.g., the main idea of semantic caching is to maintain previously accessed data in a cache. We summarize the state-of-the-art approaches and briefly discuss their advantages and disadvantages in Table 1.

Table 1 A comparison of related work

2.3.1 Query suggestion

Recently, query suggestion has been introduced into SPARQL processing. It plays a vital role in improving the overall processing of the query. The suggestion is made based on the mining of similar queries from logs. Graph Edit Distance (GED) [30] is normally applied to measure the structural similarity between SPARQL queries. However, GED is very computationally expensive, and the use of structure similarity is insufficient. It is possible that two SPARQL queries are same but differ in their result. To overcome this drawback, Shu et al. [31] proposed a content-aware approach that utilized query containment to estimate whether the queries can be answered from the caches. However, this approach is not widely utilized by the semantic web community since the containment checking approach produces very significant overhead. Lorey et al. [23] proposed a query augmentation approach to alter SPARQL queries to detect frequently recurring patterns. The benefit of their approach is to answer the query from the cache without accessing the LOD. However, the major limitation is that it considers only the queries requested by the same agent and the hit rate of the template-based approach is only 39% [15]. In contrast to the aforementioned query suggestion methods, our approach considers both content-wise and structural similarities based on a simple distance score, which results in a higher hit rates, shorter query time, and less spatial overhead.

2.3.2 Semantic caching

Semantic caching was originally proposed for the Database Management System (DBMS) [7, 24] and the purpose of the DBMS is to reduce the overhead of retrieving data from the cloud. Godfrey et al. [12] proposed the notion of semantic overlaps and introduced a caching approach that utilized client-server systems. To extend this idea, Dar et al. [7] proposed a semantic region-based caching technique and introduced a distance metric to update the cache such that the cold (e.g., less frequently access) regions are removed from the cache. Martin et al. [24] proposes to selectively invalidate cache objects on updates of the knowledge base by identifying the affected query results. However, their work does not consider query similarity for cache replacement. Yang et al. [35] proposed server-side caching to decompose the query into the basic graph patterns and cache their intermediate results. To prefetch similar-structured queries, Lehman et al. [19] proposed a supervised machine learning approach that performed analysis on the user’s previously issued queries. Their approach filters the range of possible answers and utilizes a learning technique to ensure that no prior knowledge of the underlying schema of LOD is required. Nishioka et al. [25] proposed a periodic crawling strategy that predict whether the change occurs in RDF triples. However, Lehman et al. [19] and Nishioka et al. [25] did not consider the system overhead as their performance measure. Recently, a proactive policy for maintaining local cache is proposed by Chun et al. [6] that alleviates the expensive job of copying the LOD at idle time. In summary, only a few works have been reported to deal with the problems related to the semantic caching for SPARQL queries. We propose a client-side adaptive strategy to utilize caching for SPARQL query processing. The goal of our research is to keep track of the access queries and evicts the less valuable content from the cache in an overhead-efficient way and regardless of system idle time.

3 Proposed methodology

Web-users increase the burden of SPARQL endpoints by issuing similar queries repeatedly and suffer high search latency due to the inherently distributed nature of LOD. To alleviate the burden and latency, we propose a query similarity based caching method and access frequency based cache replacement policy. In contrast to existing approaches [26, 31], our proposed method is computationally efficient and more versatile since the similarity measure utilizes a simple distance score and considers content-wise similarity as well as structural similarity. These advantages are utilized to propose our performance-effective and overhead-efficient prefecthing policy with query template and offline processing. Based on our novel similarity metric, more specifically, we utilize exploratory prefetching to search and gather contents possibly requested at the future queries by identifying the concepts of the previously issued queries. Our cache replacement module is adaptive when the number of queries reaches a predefined limit, the cache replacement process is triggered, replacing infrequently accessed data in the cache. Our replacement is effective as compared to the existing approaches [6, 24] in terms of less overhead and higher hit rates. Figure 3 illustrates the architecture of the proposed overall client-side cache replacement approach. Our methods are discussed in more detail in the following sections.

Fig. 3
figure 3

Block diagram of the proposed approach

3.1 SPARQL query similarity

The existing approach [24] relies on the structural similarity of the query for improving the performance of the triple stores. We argue that the structural similarity is based on the ordering of the symbols and it is not sufficient as two queries may represent the same structure of ordering but share a different content.

To overcome this drawback of existing methods, we propose a query similarity metric considers both content-wise and structure similarity. Consider the two queries illustrated in the Fig. 4. Two queries \(Q_{1}\) and \(Q_{2}\) have a similar-structured if the ordering of their symbols is the same. To determine the similarity between two query patterns, we first compute the Levenshtein distance between their query patterns. Where the Line number 6, 7 and 8 shows the triple patterns exits in the query. Here, the most similar triple patterns can be determined by computing the minimum distance between the \(\varDelta (s_{1}, s_{2})\), \(\varDelta (p_{1}, p_{2})\), and \(\varDelta (o_{1}, o_{2})\). The composition of the SPARQL query contains a number of different patterns. To find the similarity between queries, we need to decompose the SPARQL queries into subgraph patterns. The triple pattern distance is the minimum number of edit operations, such as addition, deletion, and insertion, to transform one graph to another. We introduce three functions AND, UNION and OPTIONAL. These patterns take the input graph and decomposed it into three sets of non-empty patterns. As an example, consider the SPARQL query in Fig. 2 that contains the following graph patterns represented as \(Q_{AND}\), \(Q_{OPTIONAL}\), and \(Q_{UNION}\) and if no such triple patterns exist the result is \(\emptyset \).

Fig. 4
figure 4

Showing the example of the structure similar SPARQL query

$$\begin{aligned} Q_{AND}= & {} \left\{ \begin{array}{llll} ?philisopher1 &{} \quad foaf:name &{} \quad ``AugusteComte'' &{}. \\ ?philisopher1 &{} \quad ?relationshipWith &{} \quad :Paris &{}. \end{array}\right\} \end{aligned}$$
(1)
$$\begin{aligned} Q_{OPTIONAL}= & {} \left\{ \begin{array}{llll} ?philisopher2 &{} \quad foaf:givenName &{} \quad ``jean-Baptise\_ say'' &{} . \\ \end{array}\right\} \end{aligned}$$
(2)
$$\begin{aligned} Q_{UNION}= & {} \lbrace Q_{AND}, Q_{OPTIONAL} \rbrace \end{aligned}$$
(3)

More formally, we defined a QueryDecomposition to deduce whether the decomposition of query patterns exits, as shown in Eq. (4).

$$\begin{aligned} Query Decomposition:={\left\{ \begin{array}{ll}\varTheta _{UNION}\left( Q\right) , &{} iff \; \varTheta _{UNION}\left( Q\right) \ne \emptyset \\ \varTheta _{OPTIONAL}\left( Q\right) , &{} iff \; \varTheta _{OPTIONAL}\left( Q\right) \ne \emptyset \\ \varTheta _{AND}\left( Q\right) , &{} else. \end{array}\right. } \end{aligned}$$
(4)

To calculate the similarity between two query patterns, we use the Levenshtein distance that is a string metric for assessing the difference between the two sequences. For example, the Levenshtein distance [9] of the two similar-structured queries is in the range of [0, 1]. The overall distance of the triple pattern is calculated by aggregating the individual score of the subjects, predicates, and objects. The general formula for the distance score is defined as:

$$\begin{aligned} Distancescore:={\left\{ \begin{array}{ll} 0, \; \\ \frac{Levenshtein \left( Q_{1}, Q_{2} \right) }{ max(stringlength(Q_{1}), stringlength(Q_{2}))} \\ 1, \end{array}\right. } \end{aligned}$$
(5)

By using this distance score in Eq. (5), we can determine the matching between the triples. Consider the triple matching between the two Basic Graph Patterns (BGP) as shown in Fig. 4. Here the most similar patterns for \(L_{6}\), \(L_{7}\), \(L_{8}\) in \(BGP_{1}\) are \(L_{7}\), \(L_{8}\), \(L_{6}\) in \(BGP_{2}\), respectively e.g., L represents the Line number correspond to Fig. 4a, b. For example, the minimum value for the edit distance is calculated as aggregating the individual distance score of subject, predicate and an object as follows: \(\varDelta (L6_{BGP1},L7_{BGP2})= (0+0+\frac{4}{16})=0.75\). Complete matching is only possible in the case of bipartite graphs, where triples occur with the same number as for the triples patterns. Maximum matching can be determined in polynomial time. The matching of the triple is a computationally expensive process and existing approaches [31, 37] do not consider the cost while performing the matching between two queries. In contrast, we use a cost threshold to cut off too expensive matching and utilize the classical algorithm called Hungarian Method [39] to solve the maximum matching of triples with minimum cost. This algorithm compute an optimal solution in a finite time. More specifically, we consider only contents of which the minimum matching cost does not exceed one. Thus, the maximum matching of the triples \(\lbrace (L6_{BGP1},L7_{BGP2}), (L7_{BGP1},L8_{BGP2}) \rbrace \) has a cost \(\frac{0.75+\infty }{2}\), which shows that these \(BGP_{1}\) and \(BGP_{2}\) are unfit to match with each other.

3.2 Query prefetching

Similar queries occur frequently in real-world SPARQL query logs. This has also been reported previously [34]. The query prefetching approach is suitable for alleviating the burden on SPARQL endpoints by extracting the results of subsequent queries. In the common keyword-based search engines, the user is often not aware of the most suitable keyword to optimally extract information from the resource. In several iterations, the user is more likely to formulate their own keyword to find the correct answer. Similarly, in a LOD user might query for additional details based on the initial results, after making incremental changes the initial queries.

Definition 1

(Query cluster) To identify the similar-structured queries, we propose a query cluster. Consider \(T_{Q} = \left\{ Q_{1}, Q_{2},\ldots ,Q_{n}\right\} \) be the set of the SPARQL queries with corresponding query patterns \(\left\{ PQ_{1}, PQ_{2},\ldots ,PQ_{n}\right\} \). A query cluster is defined based on the pairwise matching with three constraints between the triple patterns such as \(\varDelta (s_{i},s_{j})\le 1\), \(\varDelta (p_{i},p_{j})\le 1\), and \(\varDelta (o_{i},o_{j})\le 1\).

Fig. 5
figure 5

Showing the example of query cluster of similar-structured queries

Given this definition, the query cluster only derived the query using parameter \(\varDelta _{max}=0\) that can only be derived if the two queries are identical. Therefore, a query cluster consists of those query patterns that are structurally the same, based on the corresponding mapping \(m \sqsubseteq \varTheta \left( PQ_{1}\right) \times \left( PQ_{2}\right) \). To represent the query patterns as a feature, we first cluster queries based on the content-wise and structural similarities as shown in Fig. 5 and distances between each pair of queries are computed by adopting the k-medoids algorithm [27]. We use this algorithm to cluster the training data of the query. This algorithm chooses the data points and allows us to utilize the distance function. To calculate the query distance, we utilize the distance score in Eq. (5) and define the similarity score of the query cluster as shown in the Eq. (6).

$$\begin{aligned} Similarity Score (T_{Q}, Q_{c})= \frac{ 1 }{1+Distancescore(T_{Q},Q_{c})} \end{aligned}$$
(6)
Fig. 6
figure 6

Showing the example of the SPARQL query prefetching

figure a

Furthermore, we introduce an exploratory prefetching. More specifically, we identify a query cluster that previously issued queries belong to and construct a query template using all queries in the cluster. Then, we prefetch all contents that were used to answer the queries in the template since those queries are more likely to be requested by the same user in the future. This prefetching is completed by issuing one single query which includes all queries in the template. For example, we modified the content of the queries previously issued in Fig. 4 and retrieve all relevant contents that are useful for the future queries as shown in the Fig. 6. This query retrieves the additional information based on the central concept, instead of issuing the many similar-structured queries, prefetching retrieve all the relevant information by issuing a single query.

For extracting the additional information for the specific resource, we propose an Algorithm 1 called central Concept Fetching (CCF) to generate the central concept from a query as shown in the Fig. 6a. In CCF algorithm, we first discovered the frequency of the subjects in all query patterns in Line 7 and aggregate whether the subject is already included in the S.Count. We further increase the count of the subject and add to S.Count in Line 11 and analyzed all the triple patterns. This algorithm will analyze all the triple and give a good indication of a common theme in all the SPARQL queries.

3.3 Cache replacement

We propose an offline process for cache replacement to calculate the access frequency. Logging every record produces the most accurate result, however, it is computationally expensive. Existing approach [21] utilize forward scanning to identify record access with a time slice \(\left[ t_{n}, t_{n+1}\right] \).

However, this is not an efficient process and decreases the system performance, as it requires scanning and storage of the entire record. Moreover, the forward scanning approach requires a significant amount of time to classify the record. To optimize this process, we maintain partial records within a specific time period. We parallelize this task by splitting the logs into n consecutive periods and use a hash function to store the frequency estimation for each query as shown in Fig. 7. Where \(Q_{1}\) represents the query and \(R_{1}\) denotes the results of the query. The estimation of the record is calculated by Eq. (7) and this algorithm ranks each query by its access frequencies. The storage of the access log is placed in a separate hash table. When each of the parallel executions finishes, the results of the most highly accessed frequencies are returned immediately and infrequently accessed queries are removed from the cache.

The overall flow of ACR is described in Algorithm 2, which explains the details of updating the cache by analyzing the access logs. The ACR algorithm takes previously access logs to calculate the access frequencies and provide the list of the updated cache triple, where \(LA_{t}\) represents the last access time of the triple and \(CA_{t}\) represents the current access time. ACR algorithm scans the records and updates the frequency. In the case of a cache miss, the algorithm first checks for the case of a record in the cache and updates the \(LA_{t}\). Based on the access frequency ACR decides whether the new triples need to be added to the cache.

Fig. 7
figure 7

Showing working example of the ACR algorithm maintaining cache and query access frequency

figure b

We have calculated the frequency of the data access using the exponential smoothing technique [11]. This method is widely utilized to predict economic data in financial applications. The traditional approach [22] contains all the accessed queries in the cache. In our work, ACR serves each query according to the estimated frequency, the query with the highest frequencies are kept in the cache for future access. The general formula of exponential smoothing is as follows:

$$\begin{aligned} E_t = \alpha *x_t + (1- \alpha ) *E_{t-1} \end{aligned}$$
(7)

As shown in Eq. (7), where \(E_t\) represents on exponentially moving average of access frequencies up to time t and \(x_t\) represents an access frequency observed at time t in discrete time with the smoothing constant \(\alpha \in \left( 0,1 \right) \). The high value of \(\alpha \) gives significance to the new observations. By using the Eq. (7), we satisfies our requirement of selecting the highly accessed queries. We further modified Eq. (7) to represent the time of the last hit. In Eq. (8), \(t_{prev}\) represents the time of the last query hit and \(X_{t_{prev}}\) represents the frequency estimate of the previous query at \(t_{prev}\). For example, assume that \(\alpha = 0.05\), \(t=12\), \(t_{prev}=3\), \(x_{t} = 0.6\), and \(x_{t_{prev}} = 0.5\). The value of \(E_{t}\) is calculated by \(E_{t} = 0.05*0.6 + 0.05(1-0.05)^{12-3}*0.5=0.046\).

$$\begin{aligned} E_t = \alpha x_t + \alpha {(1- \alpha )}^{t-t_{prev}} x_{t_{prev}} \end{aligned}$$
(8)

In case of the queries that are not similar to the previous ones stored in the cache, the result of these queries is served from the LOD and ACR algorithm store the access frequencies by using Eq. (8). When the cache becomes full, replacement is based on the access score; the top queries are kept in cache and less-frequent queries are removed from the cache.

4 Experimentation and results

This section is devoted to show the effectiveness of the proposed approach. We performed an evaluation on real-world datasets. The major goal of the experiment is to examine the hit rates and overhead comparison of the proposed approach with the current state-of-the-art cache replacement approaches.

4.1 Experimental setup

We conducted the experiments on an OpenLink Virtuoso Server 07.10 with a 4x AMD A8-7650K Radeon R7 graphics card, 64bit Ubuntu 16.04.2 LTS, and 32 GB of RAM. We utilized the DBpedia3.6 and Linked Geo Data (LGD) query logs provided by the USEWOD 2014 challenge.Footnote 7 The query log contains a number of requests received by the SPARQL endpoint. The log is formatted in the form of the Apache common log format and contains the information about the query session that is used to retrieve the data from the endpoint. It is possible that in a single query session, two queries are issued by the same user over time. The requests included in the DBpedia3.6 query logs include the timestamp. The query log contains IP address, timestamp, query and userID. The valid queries were extracted from the query logs and the syntax of the query was checked according to the SPARQL1.1 specification.

The DBpedia3.6 dataset contain the structure information extracted from the WIKIpedia and published over the LOD cloud. This dataset is obtained directly from the USEWOD query logs for DBpedia 2013, 2014 and 2015 as shown in the Table 2. The DBpedia3.6 KBs contain 3.0M entities about the general knowledge. We first extracted the textual information from the log to get the previously issued queries then parse each query using Apache Jena.Footnote 8

Table 2 Showing the size of the query logs used in our evaluation
Fig. 8
figure 8

Showing the patterns of the queries in a DBpedia and b LinkedGeoData, c the Lorenz curve for the impact of a unique query on query execution

The Linked Geo Data (LGD) dataset holds the geographic sensor information mainly related to the OpenStreetMap and it is currently available as RDF format. We utilized the LGD 2013 and 2014 that consist of more than 10 billion triples. From the available LGD query logs, our evaluation contain repetitive and unique queries.

In both datasets, the majority of the queries are the SELECT queries in the DBpedia, and LinkedGeodata logs and within these SELECT queries, we identified the occurrences of BGPS, as in Fig. 8, which shows SELECT, CONSTRUCT, DESCRIBE and ASK. Most of the queries in both datasets are SELECT queries (95% in DBpedia and 89.3% in LinkedGeoData) and the most widely used features are ASK (4.4% in DBpedia and 8.4% in LinkedGeoData) followed by CONSTRUCT (1.2% in DBpedia and 2.3 % in LinkedGeoData).

Figure 8c shows the impact of the unique query account for the query execution. Our aim is to ascertain the impact of the unique and frequently executed queries on the overall execution. We analyzed the execution using the DBpedia and LGD query logs. In DBpedia, 70% of the unique queries account for 30% of the overall executions, which shows that most of the execution instances involved the frequently accessed queries. Similarly, the impact of the unique queries on the overall execution is low, as almost 90% unique queries account for the 20% of the total query executions.

4.2 Performance evaluation

In this evaluation, we compare ACR with existing approaches, such as (LRU) Least Recently Used [16], (LFU) Least Frequently Used [18], and (SQC) SPARQL Query Caching [24] and measure the efficiency in terms of average hit rate and space overhead.

We evaluate the impact of existing cache replacement algorithms to improve the performance in terms of hit rates and overhead. Therefore, we compare ACR with three well-known cache replacement approaches (1) LRU: to replace the cache Least Recently Used (LRU) is applied to remove the items from the cache in order to provide space for the new item. This approach is simple to implement, especially when the objects are uniform. (2) LFU: with this method, the Least Frequently Used (LFU) resources are removed from the cache and the cache item is replaced with a new resource. However, LFU does not consider the size of the objects and CPU memory utilization. (3) SQC: SPARQL Query Caching (SQC) [24] improves the performance of triple stores by the selective invalidation of cache objects. This approach eliminates the cache objects that do not contain the predefined timestamp.

Figure 9a shows the hit rates achieved by the existing approaches. This experiment feeds the access logs of 3M to the ACR algorithm, whose job is to rank the access frequencies of the queries based on the exponential smoothing technique. It is noted that ACR outperforms existing approaches. However, the LFU technique remains accurate for a cache with a small size. The choice of the \(\alpha \) effects the performance of the hit rate. We have set the value to 0.05 due to the higher accuracy of the results obtained, as the optimal value of \(\alpha \) is almost certainly inversely proportional to the size of the cache, and perhaps related to the size of the database. If the cache is smaller, \(\alpha \) should probably be larger as shown in Fig. 9b. Upon varying the size of the cache, our proposed approach outperform the other approaches, as shown in Fig. 9c. On average, our approach outperforms existing approaches in terms of higher hit rates, up to (80.65%).

Figure 10a depicts the space overhead used by the cache replacement algorithms for varying data set sizes. We measure the maximum space consumption of each approach based on the maximum number of records that each algorithm stores. It is observed that existing approaches consume more space to maintain the records. Figure 10b shows the time overhead average of the proposed ACR technique compared with state-of-the-art solutions. The existing solutions take a long time; on average the hit checking time of our approach takes (280 ms), which is almost 10 times better than other approaches.

Fig. 9
figure 9

Hit Rate achieved by ACR as compared to the LRU, LFU and SQC algorithms

Fig. 10
figure 10

Space and time overhead of existing as compared to ACR

5 Conclusion

In this paper, we proposed an Adaptive Cache Replacement (ACR) to improve SPARQL query processing on the LOD cloud. ACR algorithm parallelizes the task to calculate the access frequencies. To find similar queries, ACR utilizes the edit distance to identify clients similar querying patterns and place the frequently accessed queries in the cache to reduce the burden on SPARQL endpoints. Through experimental evaluation, we found that our approach outperforms the state-of-the-art approaches in terms of better query response time and less space overhead without losing the cache hit rate. This shows that on average, we achieve hit rates of 80.66%, which accelerates the querying speed by 6.34%. Specifically, our ACR technique is capable of classifying the access log with better space efficiency as compared to LFU, LRU, and SQC. In the future, we plan to investigate the effect of prefetching on system performance, this may lead to an improvement of ACR by parallelizing the algorithm to run on a separate machine as an offline process during system idle time.