Keywords

1 Introduction

Many studies on Artificial Intelligence face with the problem of data coherence diagnosis [14]. Many models and decision-making systems include expert knowledge and expert estimates. Expert methods are used in situations where selection, justification and impact assessment cannot be made using exact calculations [5]. Researchers can use the information received from the experts only if they have a possibility to present it in a form suitable for further research. Therefore, it is necessary to either formalize the experts’ knowledge, or evaluate its coherence and reliability [6, 7]. Such problems arise in studies focused on troubleshooting power system [4, 8], the diagnosis of distributed systems security problems anomalies [1, 9], as well as in other areas. In all these cases, the coherence of the data is extremely important.

In many fields of sociological, psychological and marketing research, we face the problem of risky behavior rate or frequency estimate on the basis of respondents’ self-reports about their behavior. We need to estimate behavior rate using the responses to the questionnaire or the results of the interview [8, 10].

An approach to the risky behavior rate estimate based on Bayesian belief networks and data obtained from interviews about last episodes of respondent’s behavior is proposed in [1012].

The initial model [11] was based on the data about the three latest episodes of respondents’ risky behavior and minimum and maximum intervals between the episodes. These data were usually obtained from questionnaires or interviews [13]. Respondents could give false (not corresponding to actual behavior) answers to make a positive impression or due to memory-related issues: episodes of risky behavior could happen a long time ago and, hence, be hard to remember. For example, the risky sexual behavior data (information needed for decision-making in various areas, including education, medicine and public health) was under-reported very often due to its very private nature, and such data often became a subject of significant social desirability bias. Sometimes respondents answering question could make mistakes or be confused [14, 15].

As an example, consider the following scenario. During an interview conducted on Monday, the respondent replied that the last behavior episode was on the last Monday, the previous one was on the last Wednesday and the last but two episode was a month ago. At the same time, the respondent defined the minimum interval between episodes as a “week”. Hence, the provided data were incoherent because the interval between the last episode and the previous one was less than the minimum.

Note, that this is a very simplified example of the problem; obviously such inconsistencies can be easily identified. However, there are possible more complex situations because of the sampling variables included in the model.

Thus, applications that used data obtained from respondents often faced with the problem of incoherent data. Therefore it is important to have tools to diagnose such situations.

In the paper we describe modified model that solves this problem. For more convenient work with the model software is provided. Also we discuss an extended example of the model usage.

2 Model Description

Figure 1 shows a generalized risky behavior model \( M = (G(V,L),{\mathbf{P}}) \) as a Bayesian belief network [16, 17]. The model structure is represented by the directed graph \( (G(V,L)) \), where \( V = \{ t_{01} ,t_{12} ,t_{23} ,t_{\hbox{min} } ,t_{\hbox{max} } ,\lambda ,n\} \) is corresponded to the set of nodes, \( L = \{ (u,v):u,v \in V\} \) is corresponded to the set of directed links between nodes. In other words, Fig. 1 shows random elements included in the model and relations between them. We used GeNIe 2.0 [18] to create Bayesian belief network and to implement the probabilistic reasoning algorithms. All figures were also constructed in GeNIe 2.0.

Fig. 1
figure 1

Risky behavior model on the basis of data about episodes

On Fig. 1, Rate is a random variable representing the behavior rateλ, \( t_{i,j} \) are random variables characterizing the lengths of the interval between the ith and jth to the end episodes. With an assumption that behavior was a Poisson random process random variables \( t_{i,j} \) were exponentially distributed. The additional information was obtained by including minimum and maximum intervals between episodes (\( t_{\hbox{min} } \) and \( t_{\hbox{max} } \) respectively).

We specified conditional probabilities \( {\mathbf{P}} = \left\{ {P(t_{j,j + 1} |\uplambda),P(t_{01} |\uplambda),P(t_{\hbox{min} } |n,\uplambda),P(t_{\hbox{max} } |n,\uplambda,t_{\hbox{min} } ),P(n|\uplambda),P(\uplambda)} \right\}, \) (edges between conditionally dependent nodes) as follows:

$$ \begin{aligned} P\left( {t_{j,j + 1}^{l,j} |\uplambda^{(i)} } \right) & = e^{{^{{ - a\uplambda^{(i)} }} }} - e^{{^{{ - b\uplambda^{(i)} }} }} ,\quad j = 0,1,2, \\ t_{j,j + 1}^{l,j} [a;b];P\left( {t_{\hbox{min} }^{{l_{3} }} |n,\uplambda^{(i)} } \right) & = e^{{^{{ - an\uplambda^{(i)} }} }} - e^{{^{{ - bn\uplambda^{(i)} }} }} ,\quad t_{\hbox{min} }^{{l_{3} }} = [a;b]; \\ p\left( {n\left| {\uplambda^{(i)} } \right.} \right) & = \frac{{\left( {\uplambda^{(i)} T} \right)^{n} }}{n!}e^{{ -\uplambda^{(i)} T}} ; \\ p\left( {t_{\hbox{max} }^{{(l_{4} )}} \left| {n,\uplambda^{(i)} ,t_{\hbox{min} }^{{(l_{3} )}} } \right.} \right) & = e^{{(n - 1)\uplambda^{(i)} t_{\hbox{min} }^{{(l_{3} )}} }} \left( {\left( {e^{{ -\uplambda^{(i)} t_{\hbox{min} }^{{(l_{3} )}} }} - e^{{ -\uplambda^{(i)} b}} } \right)^{n - 1} - \left( {e^{{ -\uplambda^{(i)} t_{\hbox{min} }^{{(l_{3} )}} }} - e^{{ -\uplambda^{(i)} a}} } \right)^{n - 1} } \right), \\ t_{\hbox{max} }^{{(l_{4} )}} & = \left[ {a;b} \right). \\ \end{aligned} $$

3 Model Extension

Figure 2 shows extended risky behavior model. The added nodes allow to estimate data given by a respondent.

Fig. 2
figure 2

Extended risky behavior model on the basis of data about episodes

The nodes \( c_{{t_{1,2,\hbox{min} } }} \) and \( c_{{t_{23,\hbox{min} } }} \) represent episode \( t_{ij} \) and minimal interval \( t_{\hbox{min} } \) coherence, the nodes \( c_{{t_{0,1,\hbox{max} } }} ,c_{{t_{12,\hbox{max} } }} \) and \( c_{{t_{23,\hbox{max} } }} \) represent episode \( t_{ij} \) and maximal interval \( t_{\hbox{max} } \) coherence. We did not consider \( c_{{t_{0,1,\hbox{min} } }} \), because \( t_{01} \) represents an interval between an risky behavior episode and the moment of interview, which is not an observing behavior episode.

In particular, for the node representing the coherence degree with a minimum interval, coherence rate \( c_{{t_{ij,\hbox{min} } }} \) could take the following three values: the values:\( t_{ij} \) and \( t_{min} \) were coherent \( (c_{{t_{ij,\hbox{min} } }}^{ + } ) \), values were incoherent \( (c_{{t_{ij,\hbox{min} } }}^{ - } ) \) and values were undefined \( (c_{{t_{ij,\hbox{min} } }}^{?} ) \). We assumed that the rate \( c_{{t_{ij,\hbox{min} } }} \) was undefined when both \( t_{ij} \) and \( t_{min} \) belong to the same intervals, i.e. if \( t_{ij} \in \left[ {a;b} \right) \) and \( t_{min} \in \left[ {a;b} \right) \) we could not define precisely whether the value \( t_{min} \) was smaller than \( t_{ij} \) or not.

We specified conditional probabilities of the extended model as follows:

$$ P\left( {c_{{t_{ij,\hbox{min} } }}^{(s)} |t_{ij} , t_{\hbox{min} } } \right) = \left\{ {\begin{array}{*{20}l} {\upalpha^{(s)} ,} \hfill & {t_{ij} > t_{min} ;} \hfill \\ {\upbeta^{(s)} ,} \hfill & {t_{ij} < t_{min} ;} \hfill \\ {1 -\upalpha^{(s)} -\upbeta^{(s)} ,} \hfill & {t_{ij} = t_{\hbox{min} } ;} \hfill \\ \end{array} } \right. $$

where \( s \in \left\{ { + , - ,?} \right\},\upalpha^{(s)} ,\upbeta^{(s)} \in \left[ {0;1} \right],\sum {\upalpha = 1} ,\sum {\beta = 1} ,\upalpha^{(s)} +\upbeta^{(s)} \le 1 \).

Similarly, we obtained the estimation of the coherence of the random variables \( t_{ij} \) corresponding to the intervals between the last episodes realizations with the realization of a random variable \( t_{\hbox{max} }^{{}} \) (\( c_{{t_{0,1,\hbox{max} } }} ,c_{{t_{12,\hbox{max} } }} \) and \( c_{{t_{23,\hbox{max} } }} \)):

$$ p\left( {c_{{t_{ij,\hbox{max} } }}^{(s)} |t_{ij} , t_{\hbox{max} } } \right) = \left\{ {\begin{array}{*{20}l} {\upalpha^{(s)} ,} \hfill & {t_{ij} < t_{\hbox{max} } ;} \hfill \\ {\upbeta^{(s)} ,} \hfill & {t_{ij} > t_{\hbox{max} } ;} \hfill \\ {1 -\upalpha^{(s)} -\upbeta^{(s)} ,} \hfill & {t_{ij} = t_{\hbox{max} } ;} \hfill \\ \end{array} } \right. $$

where \( s \in \left\{ { + , - ,?} \right\},\upalpha^{(s)} ,\upbeta^{(s)} \in \left[ {0;1} \right],\sum {\upalpha = 1} ,\sum {\beta = 1} {,\alpha }^{(s)} +\upbeta^{(s)} \le 1 \).

To estimate respondent reliability (r) we added a node connecting all these five new nodes characterizing the pairwise coherence.

To simplify the formulae for conditional probabilities let \( c = \left( {c_{{t_{12,\hbox{min} } }} ,c_{{t_{23,\hbox{min} } }} ,c_{{t_{01,\hbox{max} } }} ,c_{{t_{12,\hbox{max} } }} ,c_{{t_{23,\hbox{max} } }} } \right) \), \( c^{ + } = \left( {c^{ + }_{{t_{12,\hbox{min} } }} ,c^{ + }_{{t_{23,\hbox{min} } }} ,c^{ + }_{{t_{01,\hbox{max} } }} ,c^{ + }_{{t_{12,\hbox{max} } }} ,c^{ + }_{{t_{23,\hbox{max} } }} } \right) \), \( c^{ - } = \left( {c^{ - }_{{t_{12,\hbox{min} } }} ,c^{ - }_{{t_{23,\hbox{min} } }} ,c^{ - }_{{t_{01,\hbox{max} } }} ,c^{ - }_{{t_{12,\hbox{max} } }} ,c^{ - }_{{t_{23,\hbox{max} } }} } \right) \), \( c^{?} = \left( {c^{?}_{{t_{12,\hbox{min} } }} ,c^{?}_{{t_{23,\hbox{min} } }} ,c^{?}_{{t_{01,\hbox{max} } }} ,c^{?}_{{t_{12,\hbox{max} } }} ,c^{?}_{{t_{23,\hbox{max} } }} } \right) \).

Then \( p\left( {\varvec{r}^{ + } |c} \right) = \frac{{\sum {c^{ + } } }}{\sum c },p\left( {\varvec{r}^{ - } |c} \right) = \frac{{\sum {c^{ - } } }}{\sum c } \) and \( p\left( {\varvec{r}^{?} |c} \right) = \frac{{\sum {c^{?} } }}{\sum c } \).

4 Realization

We created software to supplement the risky behavior model with mentioned before diagnosis nodes and for more convenient work with this model. The software was developed by using C# and Smile library [18]. Firstly user defines model: sets intervals for \( t_{ij} ,t_{min} \) and \( t_{\hbox{max} } \) (Fig. 3); sets \( \upalpha^{(s)} ,\upbeta^{(s)} \) where \( s \in \left\{ { + , - ,?} \right\} \) and add diagnosis nodes to the model, it can be made at once or step by step (Fig. 3). After that respondents data can be inserted into the model, input can be made manually then results are shown in the same window (Fig. 4) or from MS Excel file in this case results are saved in a separate file.

Fig. 3
figure 3

Setting intervals and adding diagnosis windows

Fig. 4
figure 4

Manual input window

5 Example

Let \( t_{ij} \) to be divided into these disjunctive intervals: \( t^{\left( 1 \right)} = \left( {0;0,1} \right),\) \( t^{\left( 2 \right)} = \left[ {0,1;1} \right)\) \(t^{\left( 3 \right)} = \left[ {1;7} \right),\) \(t^{\left( 4 \right)} = \left[ {7;30} \right),\) \(t^{\left( 5 \right)} = \left[ {30;180} \right),\) \(t^{\left( 6 \right)} = \left[ {180; + \infty } \right)\), for clarity we take the same partition for \( t_{\hbox{min} } \) and \( t_{\hbox{max} } \).

We assumed that the coherence probability was zero, if the data provided by the respondent contradicted each other, and one, if there were no contradictions.

We considered the example with ten respondents’ given data. The data are presented in the Table 1, the first column contains a respondent’s id, the other columns contain respondent’s evidences about last risky behavior episodes \( (t_{01} ,t_{12} ,t_{23} ) \) and minimal and maximal interval evidences (\( t_{\hbox{min} } \) and \( t_{\hbox{max} } \)).

Table 1 Respondents’ data

Let us have a closer look to the second respondent’s data, particularly the coherence estimation of episode \( t_{12} \) and minimal interval \( t_{\hbox{min} } \). In this case data is incoherent. The posterior distribution of the coherence random variable is shown in Fig. 5.

Fig. 5
figure 5

Example of incoherent data (GeNIe)

After all the coherence estimations were defined, we estimated the respondent’s reliability. The second respondent’s reliability estimation is presented in Fig. 6.

Fig. 6
figure 6

Reliability estimation for the second respondent (GeNIe)

If the second respondent’s data should be considered or excluded from the sample depends on the concrete research problem posed. If we want to use only the data without any contradictions or any uncertainties (all the data is coherent), then we take into account only the data from respondents 1, 4, 5, 7 and 10. Figure 7 shows the reliability estimation with the maximal degree of reliability.

Fig. 7
figure 7

Reliability estimation of respondent’s coherent data (GeNIe)

6 Conclusions

We proposed a method of data coherence diagnosis of the risky behavior model with the data obtained from respondents. For more convenient use of the method software was developed and described. Example of the use of the method was also provided.

The more general cases of the coherence rate distribution (not only coherent-incoherent-undefined), different partition or unequal intervals can be considered.

This coherence diagnosis can be useful to eliminate not reliable respondents’ data from the sample. Respondent reliability rate can be used as an analogue of lie scale in psychological test.