Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Today, more than ever, there is a critical need for organizations to share data within and across the organizations so that analysts, decision makers and control systems can make effective decisions. However, in order for analysts and decision makers to produce an accurate analysis and make effective decisions and take actions, data must be trustworthy. Therefore, it is critical that data trustworthiness issues, which also include data quality, provenance and lineage, be investigated for organizational data sharing, situation assessment, multi-sensor data integration and numerous other functions to support decision makers and analysts. Almost all application domains that we may think of require the ability to assess data trustworthiness; notable examples include: sensor networks (Lim, Moon, & Bertino, 2010; Lim, Ghinita, Bertino, & Kantarcioglu, 2012), social networks (Dai, Rao, Truta, & Bertino, 2012), location-based applications (Dai, Rao, Ghinita, & Bertino, 2011) critical infrastructures, e-health, and peer marking for massive open online courses (MOOCs).

The problem of providing trustworthy data to users and applications is an inherently difficult problem that requires articulated solutions combining different methods and techniques, ranging from iterative filtering (IF) algorithms (Laureti, Moret, Zhang, & Yu, 2006) to semantic integrity and ontology-based reasoning to digital signature techniques—just to name a few. It is however important to notice that technology has made possible to collect data from many different, possibly independent, sources. The advent of the Internet of Things (IoT) will further push such capabilities. The availability of multiple observations and data pertaining to the same event or phenomenon in both the cyber space and the physical space represents an important opportunity for methodologies, referred to as data aggregation methodologies, aiming at assessing data trustworthiness by comparing and aggregating such multiple observations. Such methodologies can also include the use of IF algorithms resulting in iterative data aggregation methodologies. However, a major problem of data aggregation methodologies is that data items representing such observations are often inconsistent. Such inconsistencies arise because of errors, such as human and application errors or sensor calibration errors, or may be a result of deliberate attacks by malicious parties aiming at injecting deceiving information.

The use of provenance techniques may help in addressing such a problem. Provenance tracing makes it possible to trace back the source of a data item and the path that the data item followed in a given system in order to reach the intended recipient. Such provenance information can be used as a factor for assessing data trustworthiness in that it allows one to assign different weights to data items based on the source. An approach that combines IF with provenance has been proposed by Lim et al. (2010) in the context of sensor networks. Such approach is efficient and effective and has been widely extended. However, a major drawback of such approach is that it is not robust against collusion attacks. A collusion attack is one by which multiple malicious parties cooperate in order to inject deceiving information. Under such an attack, the data aggregation methodology will assess data as trustworthy whereas the data is not.

The problem of designing data aggregation methodologies that are robust against collusion attacks has been recently addressed by a novel IF methodology by Rezvani, Ignjatovic, Bertino, and Jha (2015). Such methodology is applicable to both numerical and non-numerical data, and, compared with the “classical” IF algorithms of Laureti et al. (2006), Yu, Zhang, Laureti, and Moret (2006) and De Kerchove and Van Dooren (2007, 2008, 2010) greatly improve the numerical stability of data aggregation as well as robustness against the collusion attacks.

In this paper we provide a survey of IF methodologies for assessing data trustworthiness and introduce a research roadmap to guide future research. In what follows, we first survey the methodology by Lim et al. (2010), Laureti et al. (2006), Yu et al. (2006) and De Kerchove and Van Dooren (2007, 2008, 2010) to introduce the basic concepts and IF with provenance. We then show a collusion attack against such methodology and survey the IF methodology by Rezvani et al. (2015). Experimental results show that this methodology is highly effective against collusion attacks. We then discuss relevant research directions and finally outline a few conclusions.

2 Provenance-Based Data Trustworthiness Assessment

A cyclic and provenance-aware trust computation framework was proposed by Lim et al. (2010) in the context of sensor networks. The proposed framework is based on a heuristic that the more trustworthy data a sensor node reports, the higher the node’s trust score is. Moreover, the trustworthiness of a data item depends on the trust scores of the nodes which passed it towards the server node. The nodes through which a data item has been passed in the sensor network represent the provenance of such data item. By taking into account such interdependency relationship between the trustworthiness of data items and sensor nodes, a cyclic trust computation has been proposed in which the trust scores evolve gradually. This framework which we briefly review now can be employed as an online trust computation method. In what follows, we first introduce the network model underlying this framework, and the relevant notions of provenance. We then describe the cyclic framework, and finally report results from the experimental evaluation in Lim et al. (2010).

2.1 Background Notions

A sensor network is represented by m sensor nodes n i , \( i=1,\dots, m \) with identifier i for node n i . In such a network, all sensor nodes are responsible for monitoring one event (i.e. nodes report multiple independent observations for one event). The sensor network is modeled as a graph G(N, E), where \( N=\left\{{n}_1,{n}_2,\dots, {n}_m\right\} \) is the set of nodes and E{e i, j } denotes the set of edges, with e i, j an edge connecting nodes n i and n j . Figure 1a shows an example of a sensor network. As one can see in this figure, network nodes in N can be categorized into three types according to their roles in the network: a terminal, an intermediate, or a server node.

Fig. 1
figure 1

Sensor network and data provenance examples. (a) Sensor network example. (b) Simple path example. (c) Tree path example

Definition 1 (Lim et al. (2010)).

A terminal node is a sensing node which generates a data item and sends it to one or more intermediate or server nodes (black filled nodes in Fig. 1a). An intermediate node receives data items from one or more terminal or intermediate nodes and passes them to another intermediate or a server node; it may also perform an aggregation function over the received data items and send the aggregate value to an intermediate or a server node (gray filled nodes in Fig. 1a). A server node (or base station) receives data items and evaluates continuous queries based on those items (white nodes in Fig. 1a).

Without loss of generality, it is assumed that there is only one server node in the network, denoted by n s . Moreover, a data item d is represented by a single numeric value v d .

In data management, the provenance concept represents the path of provisioning a data item. The provenance of a data item d, denoted by p d , records where and how the data item d has been generated and how it has been passed through the sensor network towards the server n s .

Definition 2 (Lim et al. (2010)).

The provenance p d of a data item d is a rooted tree satisfying the following properties: (1) p d is a subgraph of the sensor network G(N, E); (2) the root node of p d is the server node n s ; and (3) for two nodes n i and n j of p d , n i is a child of n j if and only if n i has passes the data item d to n j through a direct link.

According to the tree nature of the data provenance, intermediate nodes are categorized into two categories: simple and aggregate.

  • A simple node is an intermediate node having only one child. For example, in Fig. 1b every intermediate node is a simple node. Accordingly, a data provenance with only simple nodes can be represented by a simple path and this type of provenance is called a simple provenance.

  • An aggregate node is an intermediate node with more than one child nodes. Figure 1c shows an intermediate node n i which is an aggregate node and generates a new data item d by aggregating multiple data items \( \left[{d}_1,{d}_2,{d}_3,{d}_4\right] \) received from nodes \( \left[{n}_1,{n}_2,{n}_3,{n}_4\right] \) and passes d to the server n s . A data provenance with at least one aggregate node is represented as a tree rather than a simple path and this provenance is called an aggregate provenance.

As an example of the sensor network, we can assume that a number of different sensors are distributed in a battlefield to collect the enemy locations (Tang et al., 2010). The sensors continuously watch the areas day and night to detect approaching enemies and send alarms to a server node. Moreover, the sensors are using a multihop routing scheme where each sensor may pass through the data of other sensors towards a server node.

2.2 Cyclic Trust Computation Framework

The main idea behind the trust computation approach by Lim et al. (2010) is to model the interdependency relationship between the trustworthiness of data items and their corresponding network nodes (as shown in Fig. 2). As one can see in this figure, the trust scores are assigned to both data items and network nodes, in an interdependent manner. The trust score of a data item is partially measured by the trust scores of the network nodes within its provenance. On the other hand, the trust score of a network node depends on the trustworthiness of data items that are generated by or passed through the node.

Fig. 2
figure 2

Interdependency between data and node trust scores

Figure 3 shows how the cyclic framework proposed in Lim et al. (2010) uses this interdependency to compute the trust scores of data items and network nodes. As shown in the figure, there are three different types of trust scores, current, intermediate, and next, for every data item and network node. The dashed line has separated the trust computation modules for data items and network nodes; the solid lines are traversed from one computation module to the next one.

Fig. 3
figure 3

An cyclic framework for computing trust scores

For a set of data items received for a same event in the current window, the methodology by Lim et al. (2010) computes the current and intermediate trust scores for each data item in the first and second steps, respectively. The current trust score for a data item depends on the current trust scores of the nodes in its provenance, while its intermediate trust score is computed based on the latest set of data items reported for a same event in the current streaming window. In the third step, the next trust score for each data item is computed by aggregating the current and intermediate trust scores of data items.

As shown in left side of Fig. 3, the intermediate trust score for each network node is calculated based on the trust scores of its related data items (step 4). After that, the next trust score for a network node is obtained by combination of its current and intermediate trust scores. Finally, the next trust scores in the current streaming window are copied to the current scores in the next window (step 6). Note that the cyclic trust computation process needs initial trust scores for sensor nodes which are set to one for all nodes at a very beginning of the process.

Computing Node Trustworthiness As we described, the current trust score of a network node n, denoted by s n , is equal to the next trust score obtained in the previous streaming window for that node. Thus, one needs to compute its intermediate and next trust score in the current window, denoted by \( {\hat{s}}_n \) and \( {\bar{s}}_n \), respectively.

The intermediate trust score of a network node n is computed based on the trustworthiness of its corresponding data items, which is a set of data items that are generated or passed through such a node during the current streaming window, denoted by D n . The intermediate trust score \( {\hat{s}}_n \) is simply computed as the average of the trustworthiness of its related data items, as follows:

$$\displaystyle{ \hat{s}_{n} = \frac{\sum \nolimits _{d\in D_{n}}\bar{s}_{d}} {\left \vert D_{n}\right \vert }, }$$
(1)

where \( \left|{D}_n\right| \) is the number of nodes in the set D n , and the \( {\bar{s}}_d \) indicates the current trust score of data item d obtained in the first step of the proposed trust computation framework (see ➀ in Fig. 3).

As we described, the next trust score of a network node is computed by the aggregation of its current and intermediate trust scores (see ➄ in Fig. 3). These trust scores are aggregated using a weighted sum as follows:

$$ {\bar{s}}_n={c}_n{s}_n+\left(1-{c}_n\right){\hat{s}}_n $$
(2)

where c n , 0 ≤ c n  ≤ 1 is a constant which represents the relative impacts of trustworthiness from the current streaming window versus the previous one. In other words, if c n is small, the trust scores of network nodes can change fast; if c n is large, the trust scores will change more slowly from one window to the next.

Computing Data Trustworthiness The trustworthiness of a data item d depends on its value v d and provenance p d . Moreover, there are three trust scores for a data item d: the current, the intermediate, and the next scores, denoted by s d , \( {\hat{s}}_d \), and \( {\bar{s}}_d \), respectively.

Current Trust Score s d The current trust score of a data item d is obtained by aggregating the current trust scores of nodes within its provenance. In the proposed approach, the minimum of the current scores of the nodes in p d is used as the current trust score. This can be explained by the fact that the trustworthiness of a data item can be dominated by the minimum trustworthy node among all nodes which such a data item has passed through.

If the data item d has a simple provenance, the current trust score s d is simply computed using the minimum value of current trust scores of nodes in p d . However, when the data item has an aggregate provenance, it is needed to take into account the nodes with more than one child in p d . To address this problem, the average of the current trust scores of child nodes is used as their aggregate score. Therefore, these child nodes can be considered as a single child node with a trust score equal to the average of the original child nodes. Using this method, an aggregate provenance is formed as a simple provenance for the trust computation.

Intermediate Trust Score \( {\hat{s}}_d \) An intermediate trust score of data item d, denoted by \( {\hat{s}}_d \) is computed based on the data value similarities and its provenance similarities with other data items reported for the same event. it is assumed that D is the set of data items reported for the same event with d.

In order to compute the value similarity for a data item d with value v d , the proposed approach uses the assumption that the data values in D are normally distributed and the mean and variance are μ and σ 2, respectively. Therefore, the cumulative probability of the normal distribution is employed to compute the similarity of data value v d with other values within D. Basically, the computation gives high trust scores to the values close to the mean. Thus, the initial \( {\hat{s}}_d \) is computed as follows:

$$ {\hat{s}}_d=2\underset{v_d}{\overset{\infty }{\int }}f(x)dx $$
(3)

As shown in Fig. 4a, the shaded area represents the trust score \( {\hat{s}}_d \) obtained from Eq. (3). Clearly, the intermediate trust score is obtained by considering only the data value similarity. Thus, it is needed to adjust the computation to reflect the provenance similarity of the data item as well. The impact of provenance similarity on the trust score computation is computed based on some intuitive observations, listed in Table 1. For example, it is clear that different provenances of similar data values may increase the trustworthiness of data items. Accordingly, a normalized adjustable similarity value is defined for the similarities of the provenance of a data item d with all other data items in D, denoted by \( {\bar{\rho}}_d \). More details can be found in a previous work on provenance-based trustworthiness assessment (Lim et al., 2010).

Fig. 4
figure 4

Computing the intermediate trust score \( {\hat{s}}_d \). (a) Intermediate trust score. (b) Intermediate trust score adjusted with provenance

Table 1 Impact of provenance similarity on adjusting \( {\hat{s}}_d \)

The adjusted similarity value \( {\bar{\rho}}_d \) reflects the impact of the provenance p d on the trust computation of the data item d. Thus, it is used to adjust the data value v d to a new value \( {\bar{v}}_d \) as follows:

$$ v_d= \min \left\{{v}_d-\rho _d\left({c}_{\kern-0.15em p}.\sigma \right),\mu \right\} $$
(4)

where c p is a constant value greater than 0.

Now, the data value v d in the Eq. (3) is replaced by the \( {\bar{v}}_d \) to adjust the intermediate trust computation for data item d. Thus,

$$ {\widehat{s}}_d=2{\displaystyle {\int}_{{\overline{v}}_d}^{\infty }f}(x)dx=1-{\displaystyle {\int}_{2\mu -{\overline{v}}_d}^{{\overline{v}}_d}f}(x)dx $$
(5)

Figure 4b shows how the adjusted similarity value \( {\bar{\rho}}_d \) reflects the value similarity computation.

Next Trust Score \( {\bar{s}}_d \) After computing the current and intermediate trust scores for a data item d, a weighted summation of these two trust values is used to compute the next trust score of data items, denoted by \( {\bar{s}}_d \) (see ➁ in Fig. 3), Thus,

$$ {\bar{s}}_d={c}_d{s}_d+\left(1-{c}_d\right){\hat{s}}_d $$
(6)

where c d is a constant, \( 0\le {c}_d\le 1 \), which defines how fast the data trustworthiness evolves as the cycle is repeated.

2.3 Experimental Evaluation

In this section, we briefly summarize the evaluation results from Lim et al. (2010) concerning the effectiveness of the proposed trust computation approach. The experiments were conducted by simulating the sensor networks and generating synthetic data. For observing the impact of provenance similarity, an interleaving factor was defined which means the interval between the assigned leaf nodes for generating data items in the simulated sensor network. In order to evaluate the effectiveness of the proposed solution, Lim et al. (2010) simulated the injection of false data items into the network and investigated how the proposed cyclic approach reflects this situation in the computation of the trust scores.

Figure 5a (from Lim et al., 2010) shows that when the false data items are injected, the trust scores change rapidly for smaller interleaving factors. This can be explained by the principle that different values with similar provenances rapidly reduce the trust scores (see Table 1). On the other hand, one can see in Fig. 5b that when the correct data items are injected again, the trust scores are increased more rapidly for larger interleaving factors. The reason is that similar values with different provenances result in a large positive effect (see Table 1).

Fig. 5
figure 5

Change of the trust scores for false data items. (a) With false data items, (b) with trustworthy data items

2.4 Summary

This concludes our brief summary of the cyclic trust computation framework proposed in Lim et al. (2010). In Lim et al. (2012) the authors have proposed a game-theoretical defence strategy to protect sensor nodes from attacks and to guarantee a higher level of trustworthiness for sensed data. However, such approach can be compromised with collusive (collaborative) attacks which target the sample mean and variance of the data. In Sect. 4 we demonstrate this and then propose a safer solution based on Iterative Filtering algorithms.

3 IF Algorithms of Laureti et al. and De Kerchove et al.

A relevant class of algorithms for the assessment of information trustworthiness is presented by the iterative filtering (IF) algorithms. Pioneering algorithms of such kind were first proposed by Laureti, Moret, Zhang and Yu in their papers appearing in 2006 (Laureti et al., 2006; Yu et al., 2006). Their work was a motivation for the subsequent work of C. De Kerchove and P. Van Dooren in 2007 de Kerchove and Van Dooren (2007) and later in De Kerchove and Van Dooren (2008, 2010). Independently Ignjatovic rediscovered IF algorithms in 2007 (published in 2008, Ignjatovic, Foo, & Lee, 2008) and later introduced other novel algorithms in Lee et al. (2009), Lee, Rodrigues, Kazai, Ignjatovic, and Milic-Frayling (2009), Ignjatovic, Lee, Compton, Cutay, and Guo (2009), Chou, Ignjatovic, and Hu (2013).

The aims of IF-based data aggregation methodologies should be

  1. 1.

    to provide an aggregate value with a provably minimal variance due to stochastic errors of the sources;

  2. 2.

    to insure robustness against non-stochastic errors ranging from hardware faults to collusion attacks from some of the sources, with provable estimates of the level of robustness in terms of the fraction of misbehaving sources.

Moreover, such methodologies should be applicable to both numerical and non-numerical data.

We now explain the essence of IF algorithms using an example of a conference Chair. While such a problem is clearly not among the most pressing ones in the area of data aggregation, its familiarity to the reader makes it a very convenient example to explain both the challenges and our methods.

Let us assume that you are the Chair of a conference, and your referees have done their job: each paper has been reviewed by several referees and every referee has reviewed several papers and you got the scores. However, you suspect that some of the referees might have been unreasonably harsh with their marks; some others might have been sloppy, barely having looked at the papers and thus likely to have made large random errors. Worse, you are worried that some of your referees might have colluded in order to promote the papers of their friends and trash the papers of those against whom they might hold grudges. How should you aggregate the referee’s scores and decide which papers to accept in the fairest possible way?

To analyze such a problem, let us assume that there are R referees marking P submitted papers, and, for the sake of simplicity of formulate, let us assume an unusual situation in which each referee marks every single paper. We denote by M(r, p) the mark given by a referee r to a paper p. The main feature shared by most of IF algorithms is that they simultaneously produce approximations of the final aggregate values \( \overrightarrow{\upmu}=\left\langle \mu (p):1\le p\le P\right\rangle \) (in the present case marks of papers) as well as trustworthiness ranks for the sources \( \overrightarrow{\tau}=\left\langle \tau (r):1\le r\le R\right\rangle \) (in this case referees), in a single iterative procedure.

An IF algorithm would typically start by giving all referees the same initial trustworthiness τ (0)(r) = 1 and obtain the initial approximation of the aggregate mark for each paper p as the simple mean of the marks of all referees, \( {\mu}^{(0)}(p)={\displaystyle {\sum}_{r=1}^RM}\left(r,p\right)/R \). Now, in turn, each referee can be judged on how accurate her marks are, by computing how close her marks are to such an initial approximation of the aggregate marks \( {\overrightarrow{\mu}}^{(0)} \). Thus, we compute for each referee r the Euclidean distance \( {d}^{(0)}(r)=\sqrt{{\displaystyle {\sum}_{p=1}^P\left(M\right(}r,p)-{\mu}^{(0)}(p)){}^2} \) between her marks \( \langle M\left(r,p\right)\kern0.3em :\kern0.3em 1\le p\le P\rangle \) and the aggregate values \( {\overrightarrow{\mu}}^{(0)}=\left\langle {\mu}^{(0)}(p):1\le p\le P\right\rangle \).

Since the trustworthiness of each referee should be inversely related to her distance (or deviation) d (0)(r), we pick a monotonically decreasing penalty function F(d) and define the new estimate of trustworthiness of referee r as \( {\tau}^{(1)}(r)=F\left({d}^{(0)}(r)\right) \). In the next round of iteration we obtain a new estimate \( {\overrightarrow{\mu}}^{(1)} \) of the marks of papers as a weighted average of the marks of all referees, with the marks of a referee r taken with a weight w (1)(r) proportional to a referee’s trustworthiness \( {\tau}^{(1)}(r) \). In this way the outliers will be penalized, because their distance to the coarse, initial approximation \( {\overrightarrow{\mu}}^{(0)} \) of the aggregate marks will be the largest and thus their trustworthiness and corresponding weight the smallest (but no outlier is ever completely excluded!). This process is iterated until it has, hopefully, converged, i.e., for a given precision threshold \( \varepsilon \),

while \( \sqrt{{\displaystyle {\sum}_{1\le p\le P}{\left({\mu}^{\left(n+1\right)}(p)-{\mu}^{(n)}(p)\right)}^2}}>\varepsilon \) repeat:

$$ \begin{array}{c}{d}^{(n)}(r)=\sqrt{{\displaystyle \sum_{1\le p\le P}\left(M\right(}r,p)-{\mu}^{(n)}(p)){}^2};\hfill \\ {}\kern4em \hbox{-}\ \mathrm{computing}\ \mathrm{the}\ \mathrm{distance}\ \mathrm{between}\ {r}^{\prime }s\kern0.5em \mathrm{marks}\ \mathrm{and}\ \mathrm{estimate}\kern0.5em {\overrightarrow{\mu}}^{(n)}\end{array} $$
(7)
$$ \begin{array}{ccc}{\tau}^{\left(n+1\right)}(r)=F\left({d}^{(n)}(r)\right);& & \hbox{-}\ \mathrm{computing}\ \mathrm{the}\ \mathrm{new}\ \mathrm{trustworthiness}\ \mathrm{of}\ \mathrm{r}\end{array} $$
(8)
$$ \begin{array}{c}{w}^{(n+1)}(r)=\frac{\tau^{(n+1)}(r)}{{\displaystyle {\sum}_{1\le {r}^{\prime}\le R}{\tau}^{(n+1)}}({r}^{\prime })};\hfill \\ {}\kern5em \hbox{-}\ \mathrm{computing}\text{\ }\ {r}^{\prime }s\ \mathrm{weight}\ \text{\ by\ normalising}\ {r}^{\prime }s\ \mathrm{trustworthiness}\end{array} $$
(9)
$$ \begin{array}{cc}{\mu}^{(n+1)}(p)={\displaystyle \sum_{1\le r\le R}{w}^{(n+1)}}(r)\ M(r,p),& \hbox{-}\ \mathrm{computing}\ \mathrm{new}\ \mathrm{estimate}\ \mathrm{of}\ \mathrm{the}\ \mathrm{marks}\ \overrightarrow{\mu}\end{array} $$
(10)

When such iteration terminates after, say, t many rounds of iteration, we get not only the aggregate values of marks of papers μ(p) = μ (t)(p) but also an estimate of the trustworthiness of the referees \( \tau (p)={\tau}^{(t)}(r) \). As we will see, choosing “the best” function F(x) which provides an inverse relationship between distances and trustworthiness ranks is a tricky problem; the most commonly used functions are:

$$ \begin{array}{ccc}(i)\kern0.75em F\left(d(r)\right)=\frac{1}{d^2(r)};& (ii)\kern0.75em F\left(d(r)\right)={e}^{-d(r)};& (iii)\kern0.75em F\left(d(r)\right)=1-k\cdot d(r),\end{array} $$

where k appearing in the third function is allowed to be different for each round of iteration, and is chosen so that if \( {r}^{\prime } \) is the referee with the largest (square of the) distance \( {d}^{(n)}\left({r}^{\prime}\right) \), then \( F\left({d}^{(n)}\left({r}^{\prime}\right)\right)=0 \). We now briefly discuss the performance of the above algorithm with the first, reciprocal penalty function; other choices suffer from their own problems.

If (in a simulation experiment) each referee produces true marks plus some independent Gaussian noise with no bias and with variance v r , then the performance of the above algorithm depends on the distribution of the variances v r of the referees. For some distributions the algorithm produces an unbiased estimate of the true values with a variance which is remarkably low and essentially equal to the lowest possible variance as dictated by Information Theory, reaching the Cramer-Rao lower bound (CRLB). Note that in such a case the Maximum Likelihood Estimator (MLE) also reaches the CRLB; however, unlike the MLE, the above algorithm does not require prior knowledge of the variances of the referees; in fact, this particular form of the algorithm with the reciprocal function can be seen as alternating between estimations of variances of the referees (step 7) and applications of MLE with such estimated approximate variances (step 10).

4 Collusion Attacks

Although the above IF algorithm exhibits better robustness compared to the simple averaging techniques, for some distributions of variances the performance of this algorithm is very bad, with the algorithm producing an estimate of the true marks equal to the marks assigned by one of the referees. The reason for such a behavior is that the penalty function \( F(d)=1/{d}^2 \) has a pole at d = 0, and thus the marks of referees act as attractors for the iterative procedure: if in the process of iteration the estimated marks get sufficiently close to the marks of any particular referee, the iterative procedure converges in only a few additional steps to the marks provided by that particular referee.

Worse, we have shown Rezvani, Ignjatovic, Bertino, and Jha (2013), Rezvani et al. (2015), such behavior makes the algorithm extremely vulnerable to a collusion attack. Assume that there are R referees among whom C are colluders. The colluders first do their best to estimate the true marks t p ; then C − 1 of them report heavily skewed marks s p while the last colluder reports values \( \left(\left(R-C+1\right){t}_p+\left(C-1\right)\kern0.3em {s}_p\right)/\left(R-1\right) \) as his marks. In such a case the first iteration of the procedure, which takes the mean of all marks, is very likely to produce aggregate marks very close to the marks proposed by the last attacker, causing the algorithm to quickly converge to the marks of the last attacker whose marks are still considerably skewed.

5 Data Aggregation with Protection from Collusions

In order to overcome such instability of the above IF algorithm and make it applicable to compressive sensing in wireless sensor networks in the presence of sensor faults, Chou et al. proposed Chou et al. (2013) to modify the penalty function by adding a small regularisation constant a > 0 and define \( F(d)=1/\left({d}^2+a\right) \). While this does make the algorithm more robust, it also has a serious drawback: if a is sufficiently large to make the algorithm stable, then the values returned by the algorithm might not differ significantly from the simple mean of the marks of all sources.

In trying to solve this problem in a more satisfactory manner, Rezvani et al. have proposed Rezvani et al. (2015) a better way to provide an initial approximation \( {\overrightarrow{\mu}}^{(0)} \). Clearly, without knowing the true values, the algorithm cannot determine the error of each source; however, denoting again the true value of item p (in our example the true mark of a paper p) as t p , we have that for every pair of sources r 1, r 2 (in the above example referees),

$$ \begin{array}{ccc}{\displaystyle \sum_{1\le p\le P}\frac{{(M({r}_1,p)-M({r}_2,p))}^2}{P}}& =& \kern-3em \sum_{1\le p\le P}\frac{{((M({r}_1,p)-{t}_p)-(M({r}_2,p)-{t}_p))}^2}{P}\\ {}& =& {\displaystyle \kern-1em \sum_{1\le p\le P}\frac{{(M({r}_1,p)-{t}_p)}^2}{P}}+{\displaystyle \sum_{1\le p\le P}\frac{{(M({r}_2,p)-{t}_p)}^2}{P}}\\ {}& & \kern-3.5em +2{\displaystyle \sum_{1\le p\le P}\frac{(M({r}_1,p)-{t}_p)(M({r}_2,p)-{t}_p)}{P}}.\end{array} $$
(11)

The first two terms on the second line are estimators for the variances \( {v}_{r_1} \) and \( {v}_{r_2} \), and, assuming that the errors of the sources are reasonably uncorrelated, the last term on the second line should be small. In this way we obtain \( {\displaystyle {\sum}_{1\le p\le P}\left(M\right(}{r}_1,p)-M\left({r}_2,p\right)){}^2\approx {v}_{r_1}+{v}_{r_2} \), which results in \( R\left(R-1\right)/2 \) equations in R variables \( {v}_1,{v}_2,\dots, {v}_R \), that can be solved in the sense of the Least Squares. We can now take as the initial approximation \( {\overrightarrow{\mu}}^{(0)} \) of the marks the MLE estimation with the obtained approximations of the variances v r , i.e.,

$$ {\mu}^{(0)}(p)=\frac{{\displaystyle \sum_{1\le r\le R}}\frac{M\left(r,p\right)}{v_r}}{{\displaystyle \sum_{1\le r\le R}}\frac{1}{v_r}}. $$
(12)

Remarkably, experiments have demonstrated that, even when the errors are significantly correlated, such initial value dramatically improves the stability of the algorithm without any sacrifice in performance. It also improves its robustness against a collusion attack, because the attackers have no way of estimating the variances of other referees (Rezvani et al., 2015). However, in general, the above algorithm can have several fixed points (de Kerchove & Van Dooren, 2010); for that reason, since it does not provide a unique solution, it is not suitable for a real life deployment. Moreover, the algorithm has another serious drawback: it is not applicable to non-numerical data because it crucially depends on using a distance function, d(r).

For that reason the present authors have looked for IF algorithms which are both provably convergent and also applicable to non-numerical data. This was partly addressed by Allahbakhsh and Ignjatovic (2015), Allahbakhsh et al. (2015), Allahbakhsh, Ignjatovic, Benatallah, and Motahari-Nezhad (2013) by altering the main feature of the previously introduced IF algorithms, namely by separating the process of assessment of the trustworthiness of the sources from the actual data aggregation process. We explain the main idea using a Q&A website example.

At a typical Q & A website each question is open for new answers for a certain period of time, say 30 days, before the question is closed; users are allowed to vote for the best answer to a particular question for an additional period of time, say 10 days, before the votes are counted and the best answer is declared. In general, there are other, concurrently open questions on the same topic and, as it can be easily observed on such websites, users with the same interest tend to vote for the best answer to a number of questions in the same field, open during the past 30 days or so. For that reason, the following policy of such a social website would not be very restrictive: only the votes of members who are “active” at the time are taken into account, and a member is considered active if he or she has cast her vote for the best answer to a certain number of questions Q > 1 which were recently closed. This gives an opportunity to make vote aggregation significantly more robust by deciding simultaneously which are the best answers to all questions which have been recently closed, using the following algorithm proposed in Allahbakhsh and Ignjatovic (2015), Allahbakhsh et al. (2015), Allahbakhsh et al. (2013) by the present CI and his student.

Assume that there are Q recently closed questions; for each question q i we have a corresponding list \( {\varLambda}_i \) of n i answers, \( {\varLambda}_i=\langle a\left(i,1\right),a\left(i,2\right),\dots, a\left(i,{n}_i\right)\rangle \). We also assume that there are V voters \( {v}_1,{v}_2,\dots, {v}_V \). Again, for the simplicity of presentation, we assume that each voter has chosen her best answer for every question; for a sparse pattern of votes all quantities involved can be appropriately normalized, according to the total number of questions each voter has participated in choosing the best answer for, see Allahbakhsh and Ignjatovic (2015), Allahbakhsh et al. (2013), Allahbakhsh et al. (2015). The algorithm for vote aggregation is again iterative, and it simultaneously evaluates the ratings ρ(i, k) of all answers to each question posed in the given interval of time as well as the trustworthiness τ(m) of each voter v m who participated in voting during that period of time, in the following manner:

Let p be a real number, p ≥ 1, and let us denote by \( m\to i,k \) the fact that voter v m has voted for the answer a(i, k) as the best answer to question q i . In the initial round of iteration, for each question q i and all of its answers \( a\left(i,k\right),\kern2em 1\le k\le {n}_i, \) we simply count the number ν(i, k) of votes which a(i, k) has received. We now obtain the initial ranks of answers as the normalized number of votes, \( {\rho}^{(0)}\left(i,k\right)=\nu \left(i,k\right)/\sqrt{{\displaystyle \sum_{1\le j\le {n}_i}\nu }{\left(i,j\right)}^2} \); thus, for all answers a(i, k) to a question q i we have \( {\displaystyle {\sum}_{1\le k\le {n}_i}{\rho}^{(0)}}{\left(i,k\right)}^2=1 \). We are now again in a position to judge for every voter v m how good his choices are, namely, to what degree their voting is in agreement with the community sentiment, and assign to them his initial trustworthiness \( {\tau}^{(0)}(m)={\displaystyle {\sum}_{i=1}^Q\left\{{\rho}^{(0)}\right(}i,k):m\to i,k\Big\}, \) which is simply a sum of the normalized number of votes received by all the answers which he voted for. Clearly, a voter v m will get a large initial trustworthiness only if he has chosen answers which many other community members have also chosen. In the next round of iteration of our vote aggregation procedure not every vote has an equal value, but its value depends on the trustworthiness of the voter. Thus, at every consecutive stage of iteration n + 1 we have:

$$ \begin{array}{c}{\tau}^{(n+1)}(m)={\displaystyle \sum_{1\le i\le Q}\{{\rho}^{(n)}(}i,k):m\to i,k\};\hbox{-}\ \mathrm{computing}\ \mathrm{the}\ \mathrm{trustworthiness}\ \mathrm{of}\ \mathrm{voter}\ \ {v}_m\hfill \\ {}{\rho}^{(n+1)}(i,k)=\frac{{\displaystyle {\sum}_{m\ :\ m\to ik}{({\tau}^{(n+1)}(m))}^p}}{\sqrt{{\displaystyle {\sum}_{1\le j\le {n}_i}{({\displaystyle {\sum}_{m\ :\ m\to ik}{({\tau}^{(n+1)}(m))}^p})}^2}}};\hfill \\ {}\hbox{-}\ \mathrm{computing}\ \mathrm{the}\ \mathrm{new}\ \mathrm{rank}\ \mathrm{of}\ \mathrm{answer}\ a(i,k)\ {\mathrm{to}\ \mathrm{question} q}_i\end{array} $$

iterating until \( {\displaystyle {\sum}_{1\le m\le V}{\left({\tau}^{\left(n+1\right)}(m)-{\tau}^{(n)}(m)\right)}^2}<\varepsilon. \) We note that the purpose of the denominator in the expression for \( {\rho}^{\left(n+1\right)}\left(i,k\right) \) is a normalization which keeps the iteration stable and allows an elegant convergence proof by ensuring that at every stage of iteration \( {\displaystyle {\sum}_{1\le k\le {n}_i}{\rho}^{(n)}}{\left(i,k\right)}^2=1 \), see Allahbakhsh and Ignjatovic (2015). The parameter p controls filtering; the larger the value of p the more the algorithm is robust against collusion attacks, but larger values also increasingly marginalize honest voters who do not vote entirely in accordance with the prevailing sentiment of the community.

With such a vote aggregation procedure the colluding voters must vote for the best answer for a significant number of other questions posed during the same period of time, and they cannot vote randomly, but must vote in accordance with the prevailing sentiment of the community, in order to receive sufficient trustworthiness. Only then can they vote differently from other voters for the answer to the question they are attacking, and hope that they can prevail over the honest voters. While this does not preclude entirely collusion attacks, it obviously makes them harder to execute.

Also note that in this case the data (the choice of the best answer) is not only non-numerical but also does not have any natural ordering. However, the same algorithm is applicable to numerical choices with values which are integers in a limited range as well as ordered choices. For example, customer feedback is usually in the range of one to five “stars” and the same applies to movie ranking. Market analyst’s recommendations are an example of non-numerical but ordered choices (strong_buy < buy < neutral < sell < strong_sell). After such an iterative procedure has converged and ranks ρ(i, k) of all choices have been determined, in case of numerical data one can form a weighted average of such numerical choices, with weights obtained from the ranks; in case of ordered choices it can be left to the user to choose the particular numerical values for the ordered alternatives to reflect user’s preferences, and then obtain the aggregate value as a corresponding weighted average.

Allahbakhsh at al. proved that the above algorithm always converges, and extensive tests not only on simulated data but also on real data, such as the publicly available movie rating dataset MovieLens, have shown that in terms of robustness against large collusion attacks such an algorithm outperforms the previous IF algorithms, see Allahbakhsh et al. (2015), Allahbakhsh et al. (2013).

Moreover, for cases where we can also rely on historic data, or in a case of a refereeing process where each referee can declare his level of competence for each paper, such additional information can be included into the iterative procedure of such an algorithm in a way that preserves the proof of convergence (Allahbakhsh et al., 2013).

The continuous case, such as aggregation of measurements of sensors, appears to be a significantly harder problem. An aggregation algorithm must be robust against collusion attacks without sacrificing its performance when the sources have only stochastic errors. In fact, even in the presence of a collusion attack, if the fraction of the colluding sources is reasonably small, the algorithms should provide output values which are close to the optimal, MLE estimate based on the data obtained from the sources with stochastic errors only. Rezvani et al. have designed an algorithm which, in extensive tests, appears to meet these requirements (Rezvani, Ignjatovic, Bertino, & Jha, 2014a). This algorithm is based on an idea of propagation of credibility \( \mathrm{c}\mathrm{r}(r) \) of one source to another source. It again takes the simple mean as the initial approximation of the aggregate values μ (0)(p) and assigns equal initial variance estimates \( {v}^{(0)}(r)=\frac{1}{\left(P-1\right)R}{\displaystyle {\sum}_{s=1}^R{\displaystyle {\sum}_{p=1}^P\left(M\right(}}s,p)-{\mu}^{(0)}(p)){}^2 \) to all sources; we then repeat until convergence:

$$ \begin{array}{c}{\mathrm{cr}}^{(n+1)}(r)={({\displaystyle \prod_{j=1}^R\frac{ \exp (-\frac{\frac{1}{P-1}{\displaystyle {\sum}_{1\le p\le P}(M(}r,p)-{\mu}^{(n)}(p){)}^2}{2{v}^{(n)}(j)})}{\sqrt{2\pi {v}^{(n)}(j)}}})}^{\frac{1}{R}};\\ {}\kern4em \hbox{-}\ \mathrm{computing}\ \mathrm{the}\ \mathrm{credibility}\ \mathrm{of}\ \mathrm{source}\kern.5em r\end{array} $$
(13)
$$ \begin{array}{ccc}{\mu}^{(n+1)}(p)={\displaystyle \sum_{i=1}^R\frac{{\mathrm{cr}}^{(n+1)}(i)}{{\displaystyle {\sum}_{k=1}^R{\mathrm{cr}}^{(n+1)}}(k)}}M(i,p);& & \hbox{-}\ \mathrm{computing}\ \mathrm{the}\ \mathrm{new}\ \mathrm{aggregate}\ \mathrm{values}\end{array} $$
(14)
$$ \begin{array}{ccc}{\mathrm{var}}^{(n+1)}(r)=\frac{1}{P-1}{\displaystyle \sum_{k=1}^P(M(}i,k)-{\mu}^{(n+1)}(k){)}^2& & \hbox{-}\ \mathrm{computing}\ \mathrm{the}\ \mathrm{new}\ \mathrm{variance}\ \mathrm{of}\ \mathrm{source}\text{\ }r\end{array} $$
(15)

Thus, at each stage of the iteration, the credibility of the values supplied by a source r is assessed by estimating the likelihood that the values supplied by r might have been obtained by every other source. The credibility is defined as the geometric mean of all of these likelihoods; see Eq. (13). The heuristic underlying such methodology is that the stability of such algorithm should come from the smoothing property of taking a mean of all of these likelihoods. The geometric mean was chosen with a hope that to be able to rigorously prove that, in case of purely stochastic normally distributed unbiased errors, the algorithm converges to the MLE estimation which could have been obtained if the non-colluding sources and their exact variances were a priori known; this would clearly ensure that our algorithm has the minimal possible variance, equal to the CRLB. Figure 6 shows a typical result obtained with 25 sources; 20 sources are “honest” providing the true mark t p of item p plus a normally and independently distributed unbiased noise with randomly chosen variances between 1 and 5. The remaining 5 sources collude, with the first 4 sources reporting skewed values s p  = 3t p and the fifth colluder the mean \( \left(\left(R-C+1\right){t}_p+\left(C-1\right)\kern0.3em {s}_p\right)/\left(R-1\right) \).

Fig. 6
figure 6

Reciprocals of normalized variances of sources, estimated using: IF with \( F(d)=1/ {d}^2 \) (filled circle), IF with \( F(d)=1-k\kern0.3em d \) (filled square), credibility propagation (filled diamond), normalized reciprocals of the true variances (filled triangle). Also shown are the corresponding RMS value of errors of the aggregate values (discrete values are joined by lines for better visual representation)

As it can be seen from Fig. 6, the weights obtained by the IF algorithm with the reciprocal penalty function 1∕d 2, (filled circle), are all essentially zero except for the weight of the last attacker which is 1 (out of range on the graph); the weights obtained by IF algorithm with the affine penalty function \( F(d)=1-k\kern0.3em d \), (filled square), are 0 for all attackers except the last one, but all other, non zero weights are essentially equal thus resulting in the simple mean of all honest sources and the last attacker. Finally, the weights produced by the algorithm based on the credibility propagation (filled diamond) are almost indistinguishable from the (normalized) reciprocals of the true variances of the “honest” sources (filled triangle), which in this case represent the optimal weights resulting in an estimator with the smallest possible variance. The RMS values of errors shown on the legend of Fig. 6 demonstrate the superiority of the credibility propagation algorithm. In fact, several IF algorithms—more than a dozen of them—were implemented and test and in all cases the algorithm by Rezvani et al. had the lowest RMS error, only slightly higher than the CRLB, even in the presence of a collusion attack. A Mathematica code which produced the above results is available online at http://www.cse.unsw.edu.au/~ignjat/IF.nb.

In addition, Rezvani et al. have applied ideas of the provenance of data (Lim et al., 2010) to design an iterative algorithm for computing the risk of flows and hosts in a computer network (Rezvani, Ignjatovic, Bertino, & Jha, 2014b; Rezvani, Ignjatovic, & Jha, 2013; Rezvani, Sekulic, Ignjatovic, Bertino, & Jha, 2014). For such iterative risk assessment algorithm as introduced in Rezvani et al. (2014b), Rezvani et al. were able to prove its convergence and also obtain sharp analytic estimates for its performance (Rezvani et al., 2014). Future research will aim to integrate the idea of provenance of data with IF algorithms in a single (possibly nested) iterative procedure. Such an integration should be done in a way which preserves the convergence proof of the resulting algorithm

6 Research Roadmap

In many real-life distributed systems such as social networks, rating system, participatory sensing networks and WSNs, the trustworthiness of participants has a significant role in the decision-making processes. While we believe that past results have demonstrated the potential of our IF algorithms as a robust trust framework for these distributed systems, achieving the objective requires much wider research efforts.

Most IF algorithms are still mostly “ad hoc” solutions which do not have a unified mathematical foundation. For example, in the discrete case we still lack an algorithm which, in case of domains which are integers (for example one to five star ratings) takes into account the proximity of votes, rather than just the coincidence of votes. This is clearly unsatisfactory: if a number of voters give a five star ranking to a movie, then a voter which gives it four stars should get some credit from them, and certainly more credit than a voter which gives the same movie only three stars. However, in algorithms by Allahbakhsh and Ignjatovic (2015), Allahbakhsh et al. (2015) both such dissent voters get no credit from the voters giving the movie five stars. Moreover, the degree of such credibility propagation from a voter to the voters who propose similar but not equal scores should depend on the estimated variances of the voters. It is also crucial that domain knowledge be incorporated into the data trustworthiness methodologies. For example, in a sensor network, a sensor that has been deployed for a long time may be considered less trustworthy than recently deployed sensors. Also metrics and methodologies from the area of data quality should be considered here (Reznik & Bertino, 2013).

In some distributed systems such as participatory sensing networks, preserving the privacy and anonymity of participants is mandatory (Wang, Cheng, Mohapatra, & Abdelzaher, 2013). Clearly, if the participatory networks fully anonymize the reported data, it is difficult to accurately estimate the trustworthiness of participants using the current state of our IF algorithms. Decentralization of our trust computation approach could improve the privacy of participant (Hasan, Brunie, Bertino, & Shang, 2013). Thus, proposing a decentralized privacy preserving IF algorithm for robust trust computation is an interesting open research area.

A tremendous volume of data generated by recent technological advances, referred to as Big Data can be used to provide data-driven decision-making. Moreover, the interconnected Big Data forms a large data redundancy which can be used to validate data trustworthiness (Labrinidis & Jagadish, 2012). An interesting research direction is to scale the IF algorithms to Big Data in order to extract hidden relationships within the data redundancy.

We will investigate applications of our IF algorithms other than just data aggregation or ranking. One such application was already implemented and tested as a part of an Honors Thesis project (D’Souza, 2011), where it was used to produce a novel recommender system. Taking as an example movie ranking, our algorithm aggregates ratings of movies provided by users, and, as we have explained, besides producing robust ratings of movies it also produces weights for users which reflect to what degree their ratings agree with the prevailing “community sentiment” ranks, as produced by our IF algorithm. We now use the observation that if two users have similar tastes, their weights must also be similar, because their movie ratings, being close to each other, must also be at a similar distance to the community sentiment ranks. Thus, to make recommendations for a particular user, we can restrict our attention only to users whose weights are close to the weight of that particular user.

In conclusion, we believe that the IF algorithms have demonstrated a promising potential for providing robust trust assessment methods for inconsistent information. Moreover, such algorithms provide a robust aggregate of such inconsistent information and can thus play a critical role in WSNs as a method of resolving a number of important problems, such as secure routing, fault tolerance, false data detection, compromised node detection, cluster head election, and outlier detection. They are also applicable to social networks, web services, and many other fields which involve inconsistent information.