1 Introduction

A data stream is a massive unbounded sequence of item continuously received at a rapid rate and it appears in a variety of applications, such as network monitoring, financial monitoring, Web logging, etc. Substantial analytical studies have been devoted to the data streams, such as clustering [9], classification [26], and mining frequent patterns [5, 30]. Finding frequent items [6, 11, 12, 15, 17, 18, 21, 23, 25, 28] has received considerable attentions in the data stream analytical tasks. This problem has been served as an important building block for different data stream mining problems, such as mining frequent itemsets [1, 2, 5, 19, 30] and computing the entropy of a data stream [4].

In typical data stream scenarios, the item arrival rate is very high so that not all received items can be kept in the main memory. Thereby, typical solutions scan every arrival item once (i.e., sequential access) and drop unpromising items (e.g., less frequently occurring items) when the main memory becomes full. Complying with these constraints, prior studies mostly focus on how to answer the data stream problem approximately with an error bound.

Early solutions [11, 12, 15, 16, 18, 23, 25] of this problem are developed based on two traditional streaming models, the landmark model (i.e., the frequent items can be any item in the entire stream) and the sliding window model (i.e., the frequent items can only be the items of the current window). While the landmark model preserves better data completeness, it ignores the importance of newly arrival items. In many applications, recent data in the stream is more meaningful. For instance, in an athlete ranking system, more recent records typically should be given more weight [16]. One way to address this problem is to use a sliding window model. However, the sliding window model partially address the item freshness but the items not occurring in the current window are completely ignored.

To address these, answering frequent items in time decayed data streams has received substantial attention from the community [7, 8, 13, 20, 24, 32, 33]. Under the time decay model, the weight of the received items is decreased over time and the frequent items are then computed based on the time decayed counts. This model preserves better completeness (i.e., every item is considered) and item freshness (i.e., recent items are more important) than the prior streaming models.

In the landmark and the sliding window models, the count of an item always increases by one when the item comes from the stream. Hence, given an ordered list of the frequent items (maintained by a linked list or a heap structure), we can swap the affected item with its neighbors to maintain the order consistency. Under the time decay model, the weight of every item is updated over the time subject to the decay function. Such huge number of updates makes the frequent item problem challenging since the ordered list becomes difficult to maintain.

To address this challenge, we propose a new heap structure, named Quasi-heap, which maintains the order of unpromising items in the heap by a lazy manner. Based on the Quasi-heap, two approximation algorithms are studied to solve the frequent item problem on time decayed data stream. We briefly list our main contributions as follows.

  • We propose a new heap structure, named Quasi-heap, to maintain the frequent items based on their time decayed counts.

  • We invent two approximation algorithms, Space Saving with Quasi-heap (based on SS [25]) and Filtered Space Saving with Quasi-heap (based on FSS [16]) to find the frequent items in time decayed data streams. Our improved algorithms answer the frequent item problem with reasonable memory space and guaranteed error bound. In addition, we theoretically analyze the algorithm complexity.

  • Extensive experiments are conducted to demonstrate the superiority of our algorithms in terms of the running time and the estimation accuracy. More specifically, our algorithms reduce response time up to 80 % compared with the ordinary heap solutions and provide better estimation quality than prior studies.

A preliminary report of this work is published in [31]. The new contents in this manuscript include (1) studying a new count-min-min (CMM) sketch with higher estimation accuracy, compared to the previously well-known count-min sketch; and (2) supplementing some lemmas, theorems, and proofs which are left out in the preliminary conference version [31]; and (3) conducting enhanced experimental evaluation that investigates the efficiency of the proposed Quasi-heap, and adding several sets of experiments that shows the effectiveness of the proposed CMM sketch (CMM sketch can estimate the count of an item with almost error free).

The remainder of this paper is organized as follows. A survey of related work is presented in Section 2. Section 3 formulates the frequent items problem in time decayed data streams. Section 4 discusses and analyzes the Quasi-heap. Sections 5 and 6 depict two improved algorithms SSQ and FSSQ, respectively. Section 7 gives the theoretical analysis for our CMM sketch. Section 8 evaluates the proposed algorithms, and we conclude this paper in Section 9.

2 Related work

2.1 Landmark model and sliding window model

In the landmark and the sliding window models, there are a great deal of work proposed to find the frequent items from a data stream. These work can be classified into two main streams [10], counter-based and sketch-based.

The counter-based algorithms [15, 18, 23, 25] are deterministic algorithms which only monitor a subset of items from the data stream. These algorithms maintain a set of counters to track the frequent items over the subset. Space Saving [25], Lossy Counting [23], Frequent [15, 18] are the representative algorithms in this stream.

Another line of work is the sketch-based algorithms which use a set of array counters to estimate the frequency of the items. Different from the counter-based algorithms, each item is projected into a set of corresponding sketches by some hash functions. The frequency of an item is estimated from the counter of its corresponding sketches. To minimize the collision probability of the hash functions, we can increase the granularity of sketch (i.e., more counters are used). However, this will lead to huge memory consumption. CountSketch [6], Count-Min Sketch [11, 12], and FSS [16] are the representative algorithms in this stream.

However, these work either treat the stale and the fresh data the same (i.e., the landmark model) or remove the stale item by a subjective window length (i.e., the sliding window model). In real world applications, it is more desirable if the frequent item problem not only considers every arrival item but also treats the fresh items more important than the stale items.

2.2 Time decay model

Finding frequent items in a time decayed data stream has received remarkable attentions from the community recently [7, 8, 13, 20, 24, 32, 33]. Zhang et al. [32] proposed two 𝜖-approximation algorithms called Frequent-Estimating (FE) and FE with Heap (FEH). FE updates the frequent item result for an item arrival in O(𝜖 −1) time by a linked list and FEH updates the result in O(log𝜖 −1) by a heap structure. Chen et al. [8] proposed another 𝜖-approximation algorithm, called Frequent-item Counting algorithm (FC), which finds the frequent data items based on the fading factor. It takes O(1) time to maintain the answer for each arrival item by a hash function. Mei and Chen [24] proposed to estimate the frequency of items by multiple hash functions. However, their work did not give the analysis of the memory consumption and the result accuracy.

Recent developments attempted to improve the estimation accuracy by either exploiting the decay function or employing new data structure. Lim et al. [20] proposed a new 𝜖-approximation algorithm, TwMinSwap, which takes O(𝜖 −1) time to process each arrival item. The basic idea is to drop the minimum item (with the smallest counter) when the memory becomes full, where the counters are updated over time by multiplying the decay rate. λ-HCount algorithm [7] employs a double linked list to record the frequent items and improves the frequency estimation accuracy by multiple hash functions. The items monitored in the double linked list are arranged in the descending order of their recently updated time. Since the items are organized in a double linked queue structure, the algorithm can reallocate an item entry to the end of the list in O(1) time.

All the above algorithms are based on a backward decay function where the item importance is decreased over time. The main challenge under the backward decay function is that the weight of the existing items is constantly changed. To address this, Cormode et al. [13] studied an alternative decay function that is a monotone increasing function to the age of an item (i.e., the subtraction of the arrival time and the origin time). In this model, the item weight is fixed when the age of an item is decided. In other words, the weight of the existing items becomes stable and the problem of finding the frequent items becomes easier. However, the forward weight of an item will become very large (due to the age) if the system has been running for a long time. One possible solution is to reset the origin time periodically but it needs extra effort to recompute the frequent items. The effectiveness of the forward decay model on the frequent item problem is unknown.

For clarity, in this work we focus on finding the frequent items based on the backward decay function as it is widely adopted in prior studies.

3 Definitions and preliminaries

Table 1 summarizes the notations to be used in the rest of this paper. We use a standard stream model with discrete timestamp labeled as <0,1,2,3,...> and only one item a i arrives at every timestamp. The current data stream D n is the set of items that have been received so far, i.e., D n =<I 1,I 2,...,I n >.

Table 1 Summary of notation

While processing a long stream, it is reasonable to treat a recent item more important than an old item. In this work, we adopt a backward time decay model that is used to gradually decrease the effect of obsolete items.

Definition 1 (Decayed Count of An Item, C t (a i )[5])

Given a time decay rate τ (0<τ≤1), C t (a i ) is the decayed count of item a i at time t, i.e.,

$$ {C_{t}}(a_{i}) = {C_{t-1}}(a_{i})\times\tau + {W_{t}}(a_{i}) $$
(1)

where C 1(a i ) = W 1(a i ) and W t (a i ) is a function that indicates the arrival of item a i at time t. Specifically, if I t = a i ,W t (a i )=1; otherwise, W t (a i )=0.

Based on Definition 1, we should update the decayed count of every item for each time step. We observe that (1) is a monotone decreasing function (i.e., decreased by a factor of τ) until the item is received again from the data stream. Accordingly, at the current time t, the decayed count of an arrival item a i can be updated by

$${C_{t}}({a}_{\textit{i}}) = {{C}_{\textit{ut}}}({a}_{\textit{i}})\times \tau^{(t-ut)} + 1 $$

if the last updated time ut of a i is recorded.

The following Definition 2 defines our frequent item problem in this work which aims at returning a set of highly occurring items subject to a frequency threshold ϕ. Specifically, an item a i is in the result set if and only if its normalized decayed count is higher than ϕ and the normalization factor is the sum of all items |D n | (=(1−τ n)/(1−τ), which approaches to 1/(1−τ) when n).

Definition 2 (ϕ-Frequent Item)

Given a stream D n of <I 1,I 2,...,I n >, and a frequency threshold ϕ,0<ϕ≤1. If the decay weight C n (a i ) is higher than ϕ×|D n |, then a i is a ϕ-frequent item.

We need huge memory to maintain the ϕ-frequent items exactly in a long running data stream, where the space complexity is Ω(D) space when the number of distinct items in the stream is D [10]. Thereby, typical solutions focus on answering this problem approximately subject to an error bound 𝜖. Formally, the approximation version of the problem is defined as follows.

Definition 3 (𝜖-Approximate Decayed Frequent items)

Given a stream D n of <I 1,I 2,...,I n >, the 𝜖-approximate frequent item set contains all items where their decayed counts are higher than (ϕ𝜖)×|D n |.

4 Quasi-heap

To find the frequent items in a data stream, a group of counters is used to record the candidate items. We assume that there are m counters available (subject to the memory budget) and these counters are organized into a linked list or a heap structure. To address the memory budget, prior studies [25] replace the smallest item by the newly arrival item when the memory becomes full.

A heap is a partially sorted complete binary tree which is usually compactly stored in an array data structure, as shown in Figure 1. Although a heap is not completely in order, it conforms to a sorting principle: every node has a value less than or equal to both of its children (actually, the value of every node can also be larger than or equal to both of its children, but here we only use the heap with smaller value on the upper nodes, i.e., min-heap). More specifically, a file of keys K 1,K 2,...,K m is a heap if K j/2⌋K j , for 1≤⌊j/2⌋<jm, thus K 1K 2,K 1K 3,K 2K 4, etc. It implies in particular that the smallest key appears on top of the heap, i.e., K 1 = m i n(K 1,K 2,...,K m ). An efficient approach to the heap creation has been suggested by R. W. Floyd [14].

Figure 1
figure 1

An example of an ordinary min-heap

In the landmark model, the count value always increases by 1 for each coming item. This fact ensures the updated time for each coming item is constant by using the linked list or heap data structure, since we only need to move counters between neighboring parent buckets in linked list or neighboring nodes in heap to keep correct order.

According to Definition 1, the decayed count of an item a i is increased when a i is received from the stream. Thereby, the position of a i in the counters should be changed in order to keep the order correct. Suppose we use a typical structure to keep these counters, the complexity of each update takes O(m) (for the linked list) and O(logm) (for the heap structure) time which is definitely too time consuming in a data stream environment. Hence, we propose a new data structure called Quasi-heap which aims at postponing unpromising sorting operations when the decayed count of an existing item is increased. In other words, we allow certain inconsistency in the Quasi-heap structure. An example of the Quasi-heap is given in the following Example 1.

Example 1 (An example of Quasi-heap)

In Figure 2, we demonstrate the running procedures of the Quasi-heap. Each node consists of the item name and its decayed count. We adopt τ=1 for ease of presentation. Figure 2a shows a Quasi-heap that contains 12 items where the order is identical to that of the ordinary heap. Upon receiving the next sequence <a,b,a,b,e,e>, the corresponding counts of a, b, e are increased. Instead of running heapify to maintain the heap structure, we only mark these items as delayed (e.g., these items are marked by thick lines in Figure 2b) since they are the old items in the Quasi-heap. While receiving a new item n, we start to run the heapify partially to those delayed nodes starting from the root. After the heapify process, item c becomes the root node as it is the smallest item in the Quasi-heap (shown in Figure 2c). Then, item c is replaced by item n (cf. Figure 2d) where the estimated count of n is set to 4 (i.e., the count of the evicted item c+1) as followed the suggestion of other counter-based algorithms [25].

Figure 2
figure 2

A running example of Quasi-heap

According to the discussion in Example 1, the main operation, delayedSorting, is to execute heapify partially on the Quasi-heap so that the minimum node can be properly identified and removed.

Algorithm 1 describes the delayedSorting operation in detail. The information of an item is updated in line 1. If the delayed flag of an item is not marked, in line 2, then its count must be the minimum count in its subtree according to Lemma 1 (being discussed shortly). If the delayed flag of an item is marked, the order of this item may be inaccurate so that we need to execute the delayedSorting operation on its each child (lines 4–5). After the recursive calls, if the count of the root is larger than the children, we swap the root with its child in lines 7–8. If the counts are identical, we swap the root with its child only when the child has larger estimated error than c (i.e., the estimated error is decided when the node is inserted into the Quasi-heap, cf. line 9 of Algorithm 2 and Example 1).

figure a
figure b

We give the properties of Quasi-heap in the following Lemma 1 and Theorem 1.

Lemma 1

Let p and q be two counters in a Quasi-heap and p is an ancestor of q. if p.delay=0, then the decayed count of p is no larger than q.

Proof

When the Quasi-heap is not full, the Quasi-heap is identical to an ordinary heap so that p.c n tq.c n t due to the heapify.

When the Quasi-heap becomes full, there are two cases. If the item p is not received from the data stream again, then p.c n tq.c n t is still held no matter whether q has been updated or not due to the monotonicity of the decayed count function (cf. (1)). If the item p has been received from the data stream again, the delay flag of p must be set to 1. The delay flag is reset to 0 only when the subtree of p is refined by the heapify (cf. Algorithm 1) so that p.c n tq.c n t is still held. □

According to the Lemma 1, we can get another property. If c.d e l a y=0, the decayed count of c is no larger than that of any descendent in its subtree. So, c has the smallest decayed count in its subtree.

Theorem 1

After executed delayedSorting (i.e., Algorithm 1) on item c, c must have the minimum decayed count among its subtree.

Proof

We prove it by induction. Let ht be the height of the subtree with c as its root. When h t=1, the statement is obviously true.

Suppose the statement is true for h t−1, now we prove that it is also true for ht. If c.d e l a y=0, then c is already the minimum decayed count in its subtree. If c.d e l a y=1, Algorithm 1 is called for its two children counter. After the execution of algorithm 2 for the two children counter, they both have the minimum decayed count in their own subtrees by the assumption. By comparing and swapping counter c with the smaller of two children counter, the item kept in the counter c becomes the minimum decayed count in its subtree. □

Time complexity analysis

When the newly arrival item is in the Quasi-heap, we only update the decayed count of this item and mark the delay flag. Hence, processing an existing item take O(1) time.

When the newly arrival item is not in the Quasi-heap, we replace the minimum item of the Quasi-heap by this new item. To find the minimum item in the Quasi-heap, we execute Algorithm 1 to ensure the correctness of the order. The cost of Algorithm 1 is O(m) as it may traverse the entire tree in the worst case. However, this case is very rare to happen in real datasets. In addition, a frequent item is likely kept in the Quasi-heap (as their count is high) and it is more frequently received form the data stream than other items. The response time of processing the existing items is dramatically reduced from O(logm) to O(1). In our experiments, the Quasi-heap can reduce response time up to 80% as compared with the ordinary heap.

5 Space saving algorithm with Quasi-heap (SSQ)

We study a counter-based algorithm, SSQ (that is based on the SS algorithm [25]), to find the 𝜖-approximate decayed frequent items. Algorithm 2 depicts the SSQ in detail. If the new arrival item c is already in the Quasi-heap (lines 2–3), we update its statistics and mark the delay flag as 1. Otherwise, we first check whether the Quasi-heap is full or not. If the Quasi-heap is not full (lines 12–15), we simply execute heapify to maintain the consistence of the Quasi-heap. Otherwise, we run the delayedSorting from the root of the Quasi-heap and replace the refined root by the new item c. Similar to the SS algorithm, the estimated count of a new item c is derived from the count of the removal item r.

We present two properties of our SSQ algorithm, which are based on the properties proposed in the Space-Saving algorithm [25] and the FE algorithm [32] with some minor modifications.

Lemma 2

Among all m counters, the minimum counter value μ is no greater than (1−τ n )/m(1−τ).

Proof

The sum of estimated counts of all m items in the monitored list is no greater than the sum of decayed counts of all n items in the data stream, i.e., Σ i c n (a i )≤(1−τ n)/(1−τ).

$$m\mu \leq {\Sigma}_{i}{\textit{c}_{\textit{n}}}(\textit{a}_{\textit{i}}) \leq (1-\tau^{n})/(1-\tau) $$

so, μ≤(1−τ n)/m(1−τ) □

Based on the Lemma 2, the SSQ algorithm can use confined space (i.e., setting m=⌈1/𝜖⌉) to find 𝜖-Approximate frequent items by securing the error ratio at most 𝜖×|D n | [25].

Theorem 2 (No False Negative)

For any item a i with decayed count C n (a i ) greater than μ (the minimum counter value in Quasi-heap) is present in the Quasi-heap.

Proof

We prove the theorem by contradiction. Assume in the current time t, an item a i with decayed count C t (a i )>μ is not in the Quasi-heap. Then, a i must be evicted sometime in the past. Suppose a i was last evicted at time unit t in the past. When a i was evicted, its decayed count was \(C_{t^{\prime }}(a_{i}) = C_{t}(a_{i})/ \tau ^{t-t^{\prime }}\), which is larger than \(\mu /\tau ^{t-t^{\prime }}\) (according to the assumption C t (a i )>μ). Let \(\mu _{t^{\prime }}\) be the minimal counter value at time unit t , then \(\mu _{t^{\prime }}\) is no larger than \(\mu /\tau ^{t-t^{\prime }}\). We can get \(C_{t^{\prime }}(a_{i}) = C_{t}(a_{i})/ \tau ^{t-t^{\prime }} > \mu /\tau ^{t-t^{\prime }} \geq \mu _{t^{\prime }}\) . So, clearly, \(C_{t^{\prime }}(a_{i}) \geq \mu _{t^{\prime }}\). This fact means that estimated count of item a i was greater than the minimum counter value when it was evicted at time unit t . This contradicts the fact that the SSQ algorithm evicts the item with the minimum counter value. □

6 Filtered space saving algorithm with quasi-heap (FSSQ)

In this section, we propose a sketch-based algorithm, Filtered Space Saving algorithm with Quasi-Heap (FSSQ) (that is based on the Filtered Space Saving algorithm [16]), to find the frequent items. The FSSQ algorithm employs two data structures, (1) the Quasi-heap and (2) a sketch (i.e., a two-dimensional array with the width w and the depth d). The count-min sketch data structure used in this work is similar to that of the Count-Min algorithm [10]. The structure is conceptually described in Figure 3. Each entry of the sketch is composed of an estimated count and the time of last update, denoted as (cnt, ut). To update the value of the sketch entries, we need d pairwise-independent hash functions: h 1,...,h d :{1,2,...,D}→{1,2,...,w}. FSSQ improves the estimation accuracy since the count of a new item is estimated by d sketch entries instead of the minimum item in the Quasi-heap (cf. SSQ).

Figure 3
figure 3

The data structures used in FSSQ algorithm

figure c

Algorithm 3 depicts the FSSQ, whose idea is similar to that of Algorithm 2 except the situation that a new item is not tracked in Quasi-heap and the Quasi-heap becomes full, hence we omit to describe the similar parts. We first run delayedSorting from the root (line 7) in order to find the minimum item. Next we update the corresponding sketch entries by c (lines 9–11), and estimate its minimum value among d corresponding sketch entries (line 13). If the estimated minimum count is larger than the root of Quasi-heap, then we replace the root by the new item c (lines 19–20) and update the corresponding sketch entries by the evicted item r (lines 15–17).

The properties of the FSSQ algorithm are given in Lemmas 3–5 and Theorem 3.

Lemma 3

At any moment, for each item a i , the minimum count μ in the Quasi-heap is no less than the minimum entry that the hash function values of a i associates, i.e., μ≥min 1≤j≤d s[j,h j (a i )].

Proof

In the initialization phase, all the hits in stream are reflected in the counters of Quasi-heap, and all the entries in sketch have value of 0. Hence, the conclusion is trivially true.

Now we consider the situation when the Quasi-heap has been full. For a new coming item, if it is being monitored in the Quasi-heap, its counter in Quasi-heap increases and all the entries in the sketch remain unchanged. With the same decay rate and initially μm i n 1≤jd s[j,h j (a i )], the inequality is still true.

If a new coming item is not being monitored in the Quasi-heap, the increment of the entry in the sketch may lead to the situation that the minimum entries become larger than μ. However, at this case, a replacement is taken place. Hence, the conclusion is still true. □

Lemma 4

For any item in the Quasi-heap, its overestimated error is no greater than μ.

Proof

For any item a i in the Quasi-heap, its maximum overestimated error is always assigned the minimum decayed count in the entries that a i associates, when a replacement takes place. This value is no greater than μ according to Lemma 3, so the maximum overestimated error is no greater than μ. □

Lemma 5

Assume w=⌈e/𝜖⌉ and d=⌈ln(1/δ)⌉, in which e is the Euler’s constant, i.e., the base of natural logarithms. For any item in the Quasi-heap, its count error is no greater than 𝜖/(1−τ) with probability at least 1−δ.

Proof

We introduce indicator variables I i,j,k as follows.

$$I_{i, j, k} = \left\{\begin{array}{ll} 1 & \quad \text{if } a_{i} \neq a_{k} \wedge h_{j}(a_{i}) = h_{j}(a_{k})\\ 0 & \quad \text{if } otherwise \end{array}\right. $$

By pairwise independence of the hash functions, then the expectation of variables I i,j,k is

$$E(I_{i, j, k}) = Pr[h_{j}(a_{i}) = h_{j}(a_{k})] \leq 1/w \leq \epsilon/e $$

Let X i,j k=1,...,D (I i,j,k ×C n (a k )). Since all C n (a k ) are non-negative, X i,j is a non-negative variable. By construction, s[j,h j (a i )] = C n (a i ) + X i,j . Thereby,

$$min_{1\leq j\leq d} s[j, h_{j}(a_{i})] \geq C_{n}(a_{i}) $$

By pairwise independence of h j , and linearity of expectation, we observe that

$$\begin{array}{@{}rcl@{}} E(X_{i, j}) &=& E({\Sigma}_{k=1, ..., D}(I_{i, j, k}\times C_{n}(a_{k}))) \\ &=& {\Sigma}_{k=1, ..., D}(E(I_{i, j, k})\times C_{n}(a_{k})) \\ & \leq& \epsilon/e {\Sigma}_{k=1, ..., D}C_{n}(a_{k}) \leq \frac{\epsilon}{e\times (1-\tau)} \end{array} $$

By the Markov inequality,

$$Pr[X_{i, j} > eE(X_{i, j})] < E(X_{i, j})/eE(X_{i, j}) < 1/e $$

By combining these,

$$Pr[\forall j, X_{i, j} > eE(X_{i, j})] < e^{-d} (1\leq j \leq d) $$
$$\begin{array}{@{}rcl@{}} Pr[c_{n}(a_{i})> C_{n}(a_{i})+ \epsilon /(1-\tau)]&=& Pr[\forall j, s[j, h_{j}(a_{i})] > C_{n}(a_{i})+ \epsilon /(1-\tau)] \\ &=& Pr[\forall j, C_{n}(a_{i})+X_{i, j} > C_{n}(a_{i}) + \epsilon /(1-\tau)] \\ &=& Pr[\forall j, X_{i, j} > \epsilon /(1-\tau)] \\ &\leq& Pr[\forall j, X_{i, j} > eE(X_{i, j})] < e^{-d} \leq \delta \end{array} $$

Theorem 3

For any item with C n (a i )>μ is present in the Quasi-heap.

Proof

We first prove that a i is present in the Quasi-heap. If the last arrival time of a i is t 1 time unit ago, then the estimated count of a i in the sketch t 1 time units ago is no less than \(C_{n}(a_{i})/\tau ^{t_{1}}\). Thereby, \(min_{1\leq j\leq d}s[j, h_{j}(a_{i})]/ \tau ^{t_{1}}\geq C_{n}(a_{i})\tau ^{t_{1}} > \mu /\tau ^{t_{1}}\). This means that a i was inserted in the monitored counters t 1 time units ago. We also need to show there is no false negative result. The proof is identical to that of Theorem 2 if a i is present in the Quasi-heap. □

7 Count-Min-Min (CMM) sketch

In this section, we will introduce our Count-min-min (CMM) sketch which is an improved sketch based on the best well-known count-min (CM) sketch. CMM sketch provides a tighter bound of frequency estimation error.

In recent years, several different sketches [6, 11, 29] have been proposed in the data stream context to solve large-scale computation. These sketches in general consume reasonable space overhead and offer high accuracy result. Our CMM sketch lies in the same framework, and finds inspiration from these previous sketches. The common framework has been described in Figure 3 of Section 6.

Two well-known sketches are count sketch [6] and count-min sketch [11]. These two sketches both use multiple hash functions to define a projection from incoming items to a set of array counters. Each item is hashed by some hash functions into one or more values, which can be used to index the counters to update. The count sketch uses the mean or median of these estimates to achieve an estimation count. However, the count-min sketch uses the minimum count of all corresponding counters to estimate the count of an item.

In the count-min sketch, an array of d×w counters (each counter is initialized with zero) is maintained, along with d hash functions. When a new item c comes, it is projected by d hash functions into d corresponding counters in the array and each of which is incremented, shown in Figure 4. A query for the frequency of any item in the data stream reports the minimum count of all corresponding d counters.

Figure 4
figure 4

The update in the CM sketch

According to Lemma 5, it is known that the counts estimated by the count-min sketch are overestimated. Now, when a new item c comes, we only increase the minimum count of these d corresponding counters, not each of them. Even so, it is still guaranteed that the frequency estimation of every item is overestimated. We call the improved count-min sketch for count-min-min (CMM) sketch, shown in Figure 5. We argue that the accuracy of the frequency estimation will be improved.

Figure 5
figure 5

The update in the CMM sketch

Noted that, we will increase all the counts when multiple corresponding counters all have the minimum value.

On theoretical aspect, the following Lemma 6 will show that the counts estimated by the count-min sketch are still overestimated and the estimation error is tighter.

Lemma 6

In CMM sketch, the estimate c n (a i ) has the following guarantees: C n (a i )≤c n (a i ), and with probability at least \(1-\delta ,c_{n}(a_{i}) \leq C_{n}(a_{i}) + \frac {\epsilon }{(1-\tau )\times ln(1/\delta )}\).

Proof

We introduce indicator variables I i,j,k as follows.

$$I_{i, j, k} = \left\{\begin{array}{ll} 1 & \quad \text{if } a_{i} \neq a_{k} \wedge h_{j}(a_{i}) = h_{j}(a_{k}) \wedge s[j, h_{j}(a_{k})]= min_{1\leq j\leq d} s[j, h_{j}(a_{k})] \\ 0 & \quad \text{if } otherwise \end{array}\right. $$

By pairwise independence of the hash functions, then the expectation of variables I i,j,k is

$$E(I_{i, j, k}) = Pr[h_{j}(a_{i}) = h_{j}(a_{k})\wedge s[j, h_{j}(a_{k})]= min_{1\leq j\leq d} s[j, h_{j}(a_{k})]] \leq \frac{1}{wd} \leq \frac{\epsilon}{eln(1/\delta)} $$

Let X i,j k=1,...,D (I i,j,k ×C n (a k )). Since all C n (a k ) are non-negative, X i,j is a non-negative variable. By construction, s[j,h j (a i )] = C n (a i ) + X i,j . Thereby,

$$min_{1\leq j\leq d} s[j, h_{j}(a_{i})] \geq C_{n}(a_{i}) $$

By pairwise independence of h j , and linearity of expectation, we observe that

$$\begin{array}{@{}rcl@{}} E(X_{i, j}) &=& E({\Sigma}_{k=1, ..., D}(I_{i, j, k}\times C_{n}(a_{k}))) \\ &=& {\Sigma}_{k=1, ..., D}(E(I_{i, j, k})\times C_{n}(a_{k})) \\ & \leq& \frac{\epsilon}{eln(1/\delta)} \times {\Sigma}_{k=1, ..., D}C_{n}(a_{k}) \\ & \leq& \frac{\epsilon}{eln(1/\delta)\times (1-\tau)} \end{array} $$

By the Markov inequality,

$$Pr[X_{i, j} > eE(X_{i, j})] < E(X_{i, j})/eE(X_{i, j}) < 1/e $$

By combining these,

$$Pr[\forall j, X_{i, j} > eE(X_{i, j})] < e^{-d} (1\leq j \leq d) $$
$$\begin{array}{@{}rcl@{}} Pr[c_{n}(a_{i})> C_{n}(a_{i})+ \frac{\epsilon}{(1-\tau)\times ln(1/\delta)}] &=& Pr\left[\forall j, s[j, h_{j}(a_{i})] > C_{n}(a_{i})+ \frac{\epsilon}{(1-\tau)\times ln(1/\delta)}\right] \\ &=& Pr\left[\forall j, C_{n}(a_{i})+X_{i, j} > C_{n}(a_{i}) + \frac{\epsilon}{(1-\tau)\times ln(1/\delta)}\right] \\ &=& Pr\left[\forall j, X_{i, j} > \frac{\epsilon}{(1-\tau)\times ln(1/\delta)}\right] \\ &\leq& Pr[\forall j, X_{i, j} > eE(X_{i, j})] < e^{-d} \leq \delta \end{array} $$

8 Experimental study

In this section, we empirically evaluate the efficiency of SSQ and FSSQ using both real and synthetic datasets. We compared our proposed solutions with the state-of-the-art solutions, TwMinSwap [20] (counter-based) and λ-HCount [7] (sketch-based). Most of the existing works arrange the items in the monitored list according to their updated time and remove the item with the least updated time when the monitored list is full. Hence, all implemented algorithms in our experiments also follow this convention. All methods were implemented in C++ and compiled using Microsoft Visual Studio 2012 compiler. All experiments were conducted on a 3.20 GHz Pentium PC machine with 8GB main memory running Windows 7 Professional Edition.

In the sequel, we first present the experimental setup in Section 8.1, and report the experimental results and our findings in Section 8.2, and a comparison between Quasi-heap and an ordinary heap is made to investigate the performance gains from Quasi-heap in Section 8.3, and then several sets of experiments are conducted to evaluate the effectiveness of our proposed CMM sketch in Section 8.4.

8.1 Experimental settings

We employed both synthetic and real datasets in our experiments. The synthetic datasets are generated based on Zipfian distributions. Table 2 shows every parameter and their values used in the experiments. (An appropriate Zipfian parameter is chosen so that the data is not overly skewed, which will make it very easy to distinguish frequent items. But at the same time it also guarantees that the number of frequent items which is above the threshold is not very little). We also used two real datasets, e.g., Kosarak and Retail, that are widely evaluated in data stream research [22, 27]. The Kosarak dataset is an anonymized click-stream on a Hungarian online news portal.Footnote 1 It consists of transactions, each of which has several items, expressed as integers. The Retail dataset contains retail market basket data from an anonymous Belgian store [3]. In our experiments, we consider every single item in sequential order. The detail statistics of the real datasets is listed in Table 3.

Table 2 Parameters
Table 3 Statistics of the real data streams

For fairness, we used common subroutines for similar tasks (e.g., hash tables) to increase comparability and allocated identical memory budget to all the algorithms. Based on FSS algorithm proposed by Homem and Carvalho [16], m 2, the number of counters in sketch-based algorithms, is almost half of m 1, the number of counters in counter-based algorithms. The budgeted memory for all algorithms are the same, i.e., 40×m 1=40×m 2+16×d×w (40 bytes for every counter, 16 bytes for a unit in the sketch structure, the depth d of the sketch and the width w of the sketch). A large number of experiments also show that the conflict can be down to very low when d is 4. And in all experiments, the default value of m 1 is 800. We verify the performance of algorithms with respect to:

  • Time: Each algorithm is run for 20 times and their average response time is reported.

  • Precision: The fraction of the items identified by the algorithm that are actually frequent.

  • Recall: The fraction of the actual frequent items that the algorithm identified. In all algorithms, all the frequent items are detected due to the overly estimated count (cf. Theorem 2).

8.2 Experimental results

To verify the scalability of the algorithms, we varied one parameter in each set of experiments while setting other parameters to their default values. In our experiments, we showed the response time and precision of the algorithms as a function of stream size (n), data skew (z), frequency threshold (ϕ), space consumed and decay rate (τ) on three datasets, respectively.

Performance Overview

In terms of the response time, SSQ yields better performance than state-of-the-art solutions, due to the Quasi-heap data structure proposed which can save the response time by up to 80 %. In terms of the precision, FSSQ performs the best among all methods due to the sketch structure which can get a better bound on the estimation count error.

Figure 6 shows the response time and the precision by varying the cardinality of the items. Notably, the response time of all methods increases as the stream size becomes larger. We find that the counter based algorithms are faster than the sketch based algorithms; however, the counter based algorithms is less accurate than the sketch based algorithms as discussed in Section 6.

Figure 6
figure 6

Varying stream size on Synthetic dataset

It is obvious that the response time decreases when the data becomes more skewed (cf. Figure 7a). That is because that the cost of processing an existing item in the Quasi-heap is less than that of processing a new item in the Quasi-heap. The probability of receiving an existing item becomes higher when the data is more skewed. In other words, the skewness of data can simplify the problem as there are fewer frequent item candidates. Therefore, the response time of all methods decreases when the data become more skewed.

Figure 7
figure 7

Varying data skew on Synthetic dataset

Figures 8 and 9 show the experiments conducted ϕ and τ, respectively. Due to space limit, each figure only reports two sets of experiments as all methods preform similarly on these three datasets. For instance, SSQ is 35–40 % faster than TwMinSwap on average for the frequency threshold ϕ and the decay rate τ in all three datsets. In Figure 9, we observe that the trend of the response time and precision almost keep unchanged as the decay parameter changes. There should not be mush significant dependence on this parameter, since the underlying problem is the same, just the input weight are being implicitly modified.

Figure 8
figure 8

Varying frequency threshold on Synthetic dataset (a and b) and Kosarak datasets (c and d)

Figure 9
figure 9

Varying decay rate on Kosarak dataset (a and b) and Retail datasets (c and d)

Figure 10 shows that the precision increases significantly when we have more space for the Quasi-heap and sketch based algorithms are more accurate than counter based algorithms. As an example in Figure 10b, SSQ and TwMinSwap find 86 % true frequent items when the space is set to 40000 bytes while FSSQ and λ-HCount find 99 % true frequent items using the same memory budget.

Figure 10
figure 10

Varying memory space on Synthetic dataset (a and b) and Retail datasets (c and d)

8.3 Effectiveness of Quasi-heap

We also performed an experiment to investigate the advantage of the Quasi-heap (SSQ) as compared to the ordinary heap (SS). We reported the number of comparisons performed in the heap and response time.

Figure 11a shows that the performance between the Quasi-heap and the ordinary heap. For example, when stream size is 106 and data skew is 1.5, the number of comparisons in Quasi-heap is 278K which is 35 % smaller than 429K in the ordinary heap. The advantage of our Quasi-heap becomes more significant when the data becomes more skewed. Figure 11c shows that the number of comparisons in Quasi-heap is only 2K when data skew is set to 3.0 which is 118 times smaller than 236K in the ordinary heap. From Figure 11b and d, we can find that the Quasi-heap reduce the response time by up to 80 % since huge amount of unpromising comparisons are delayed by the delaySorting method.

Figure 11
figure 11

The number of comparisons varying data skew in Ordinary heap and Quasi-heap, respectively

8.4 Effectiveness of CMM sketch

We theoretically discussed the estimation error of our improved count-min-min sketch (CMM) and the count-min sketch (CM) in Lemma 6 and Lemma 5, respectively. Next, we verify it by a set of experiments. For a more direct comparison, we use the CM sketch and CMM sketch to estimate the count of an item. The sketch depth is set to d=4 and the width to w=2/ϕ, based on the analysis of the CM sketch[11].

Firstly, we define the estimation error rate of these two sketches by the following equation:

$$ EER = \frac{Estimated~count~by~sketch - True~count}{True~count}. $$
(2)

Next, we define the reduced error rate by our CMM sketch, compared to CM sketch by the following equation:

$$ EER^{\prime} = \frac{Estimated~count~by~CM - Estimated~count~by~CMM}{Estimated~count~by~CM}. $$
(3)

We used the CM sketch and CMM sketch to estimate the count of an item, and the results of the estimated counts and the error rate on Retail dataset and Kosarak dataset are shown in Tables 4 and 5, respectively. It is exciting that the estimated counts by our CMM sketch are always lower than that by the CM sketch. For example, from Table 4, the true count of the item ’1’ is 266, the estimated count by CM sketch is 3695 and the estimated error rate E R R=(3695−266)/266=1289 %, which is vastly overestimated. However, the estimated count by our CMM sketch is only 272 and the estimated error rate E R R=(272−266)/266=2.25 %, which is more accurate and acceptable in a data stream environment. The estimated error is reduced by E R R =(3695−272)/3695=92.64 %, which is a significant improvement. Even more, we can see that the estimated error rate of our CMM sketh EER sometimes reaches 0.

Table 4 Estimation error of CM sketch and CMM sketch on Retail
Table 5 Estimation error of CM sketch and CMM sketch on Kosarak

In summary, our CMM sketch can estimate the count of an item with little, and often with no error at all.

9 Conclusion and future work

In this paper, we focused on the problem of finding frequent items in data streams with a time decay model. In order to reduce the maintenance cost of the ordinary heap, we proposed a Quasi-heap data structure with a delayed sorting operation and invented two algorithms based on it. In order to improve the estimation accuracy, we propose a count-min-min (CMM) sketch structure based on the best well-known count-min sketch. We extensively evaluated our methods on three datasets. Our algorithms with the Quasi-heap reduce response time up to 80 % compared with the ordinary heap solutions and the proposed CMM sketch can estimate the count of an item with almost error free.

In the future, we intend to further study how to extend the proposed approaches to a distributed environment to handle greater scales of data streams, when a single machine is no longer capable of managing the large volumes of data and computation.