Keywords

1 Introduction

Stroke stands for a blood supply interruption that occurs in the brain, once a blood vessel is blocked, causing an ischaemic hit or bursts, leading to a hemorrhagic blow. Being a major factor related with mortality, this disease is closely followed with the main purpose of preventing it from happen, once, when diagnosed, it becomes less hazardous and more treatable, comparing with similar ones [1]. However, there are several factors associated with stroke, which transport a higher probability of occurrence, and may lead to such a happening. Some of these risk factors can be avoid or controlled, like high blood pressure [1, 2], cigarette smoking [2, 3], diabetes mellitus [4, 5], high blood cholesterol [6, 7], or the absence of physical activity [8, 9].

Despite these causes there are those who cannot be controlled, such as age (older people have more tendency to stroke [2, 10]), and gender (stroke is more common in men than in women, and the mere fact of having suffered a previous stroke represents an increased risk not controlled by any means [2, 11]), ethnicity [10, 11], among others. In this work it will be emphasized the prediction of a giving event, according to a historical dataset, under a Case Based Reasoning (CBR) approach to computing [12, 13]. Indeed, CBR provides the ability of solving new problems by reusing knowledge acquired from past experiences [12], i.e., CBR is used especially when similar cases have similar terms and solutions, even when they have different backgrounds [13]. Indeed, its use may be found in different arenas, namely in The Law, Online Dispute Resolution [14, 15] or Medicine [16, 17], just to name a few.

It must be also highlighted that up to present CBR systems have been unable to deal with incomplete, self-contradictory, or even unknown information. As a matter of fact the approach to CBR presented in this work will be a generic one and will have a focus on such a setting. It brings to evidence that the first step to be tackled is related with the construction of the Case Base. Thus, a normalization and optimization phases were introduced and clustering methods were used to distinguish and aggregate collections of historical data, in order to reduce the search space that speeds up the retrieve stage and all associated computational processes.

The article develops along five sections. In a former one a brief introduction to the problem is made. Then the proposed approach to knowledge representation and reasoning is introduced. In the third and fourth sections it is assumed a case study and presented a solution to the problem. Finally, in the last section the most relevant conclusions are described and possible directions for future work are outlined.

2 Knowledge Representation and Reasoning

Many approaches to knowledge representation and reasoning have been proposed using the Logic Programming (LP) paradigm, namely in the area of Model Theory [18, 19], and Proof Theory [20, 21]. In this work it is followed the proof theoretical approach in terms of an extension to LP. An Extended Logic Program is a finite set of clauses in the form:

where “?” is a domain atom denoting falsity, the p i , q j , and p are classical ground literals, i.e., either positive atoms or atoms preceded by the classical negation sign [20]. Under this formalism, every program is associated with a set of abducibles [18, 19], given here in the form of exceptions to the extensions of the predicates that make the program. The term \( scoring_{value} \) stands for the relative weight of the extension of a specific \( predicate \) with respect to the extensions of the peers ones that make the overall program.

In order to evaluate the knowledge that stems from a logic program, an assessment of the Quality-of-Information (QoI), given by a truth-value in the interval [0, 1], inclusive in dynamic environments aiming at decision-making purposes, is set [22, 23]. Indeed, the objective is to build a quantification process of QoI and measure one’s Degree of Confidence (DoC) that the argument values or attributes of the terms that make the extension of a given predicate with relation to their domains fit into a given interval [24]. Thus, the universe of discourse is engendered according to the information presented in the extensions of a given set of predicates, according to productions of the type:

$$ predicate_{i} - \mathop {\bigcup }\limits_{1 \le j \le m} clause_{j} \left( {\left( {QoI_{{x_{1} }} ,DoC_{{x_{1} }} } \right), \cdots ,\left( {QoI_{{x_{m} }} ,DoC_{{x_{m} }} } \right)} \right)\,:\!\!\!:\,QoI_{i}\,:\!\!\!:\,DoC_{i} $$
(1)

where ⋃ and m stand, respectively, for set union and the cardinality of the extension of predicate i . QoI i and DoC i stand for themselves [24].

3 A Case Study

As a case study, consider a database given in terms of the extensions of the relations (or tables) depicted in Fig. 1, which stand for a situation where one has to manage information about stroke predisposing detection. The tables include features obtained by both objective and subjective methods, i.e., the physicians will fill the tables that are related to the Stroke Predisposing one while executing the health check. The clinics may populate some issues, others may be perceived by additional exams.

Fig. 1.
figure 1

A fragment of the knowledge base for Stroke Predisposing Diagnosis.

Under this scenario some incomplete and/or default data is also available. For instance, the Triglycerides in case 2 is unknown, while the Risk Factors range in the interval [0, 1]. In Previous Stroke Episode column 0 (zero) and 1 (one) denote, respectively, nonoccurrence and occurrence. In Lifestyle Habits and Risk Factors tables 0 (zero) and 1 (one) denote, respectively, yes and no. The values presented in the Lifestyle Habits and Risk Factors columns of Stroke Predisposing table are the sum of the correspondent table values, ranging between [0, 6] and [0, 4], respectively. The Descriptions column stands for free text fields that allow for the registration of relevant patient features.

Applying the rewritten algorithm presented in [24], to all the fields that make the knowledge base for Stroke Predisposing (Fig. 1), excluding of such a process the Description one, and looking to the DoCs values obtained in this manner, it is possible to set the arguments of the predicate referred to below, that also denotes the objective function with respect to the problem under analyze.

$$ \begin{aligned} &stroke:Age,P_{revious} S_{troke} E_{pisodes} ,B_{lood} S_{ystolic} P_{ressure} ,Chol_{{esterol_{LDL} }} , \hfill \\ & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,Chol_{{esterol_{HDL} }} ,Trigly_{cerides} ,L_{ifestyle} H_{abits} ,R_{isk} F_{actors} \to \left\{ {0,1} \right\} \hfill \\ \end{aligned} $$

where 0 (zero) and 1 (one) denote, respectively, the truth values false and true.

Exemplifying the application of the rewritten algorithm presented in [24], in relation to the term that presents the feature vector Age = 69, P revious S troke E pisodes  = 1, S ystolic B lood P ressure  = , Chol esterolLDL  = 131, Chol esterolHDL  = 49, Trigly cerides  = 200, L ifestyle H abits  = 4, R isk F actors  = [1, 2], one may have:

It is now possible to represent the normalized case repository in a graphic form, showing each case in the Cartesian plane in terms of its QoI and DoC (Fig. 2). Furthermore, the retrieval stage can be improved by reducing the search space, using data mining techniques, like clustering, in order to obtain different groups to identify the one(s) that are more closed to the New Case, which is represented as a square in Fig. 2.

Fig. 2.
figure 2

A case’s set split into clusters.

4 Case Based Reasoning

CBR methodology for problem solving stands for an act of finding and justifying the solution to a given problem based on the consideration of similar past ones, by reprocessing and/or adapting their data or knowledge [12]. In CBRthe cases – are stored in a Case-Base, and those cases that are similar (or close) to a new one are used in the problem solving process. The typical CBR cycle presents the mechanism that should be followed to have a consistent model. In fact, it is an iterative process since the solution must be tested and adapted while the result of applying that solution is inconclusive. In the final stage the case is learned and the knowledge base is updated with the new case [12, 13]. Despite promising results, the current CBR systems are neither complete nor adaptable enough for all domains. In some cases, the user is required to follow the similarity method defined by the system, even if it does not fit into their needs [25]. Moreover, other problems may be highlighted. On the one hand, the existent CBR systems have limitations related to the capability of dealing with unknown, incomplete and self-contradictory information. On the other hand, an important feature that often is discarded is the ability to compare strings. In some domains strings are important to describe a situation, a problem or even an event [12, 25].

Contrasting with other problem solving methodologies (e.g., those that use Decision Trees or Artificial Neural Networks), relatively little work is done offline. Undeniably, in almost all the situations, the work is performed at query time. The main difference between this new approach and the typical CBR one relies on the fact that not only all the cases have their arguments set in the interval [0, 1] but it also allows for the handling of incomplete, unknown, or even self-contradictory data or knowledge [25]. The classic CBR cycle was changed in order to include a normalization phase aiming to enhance the retrieve process (Fig. 3). The Case-Base will be given in terms of triples that follow the pattern:

$$ Case = \left\{ { < Raw_{case} , Normalized_{case} , Description_{case} > } \right\} $$

where Raw case and Normalized case stand for themselves, and Description case is made on a set of strings or even in free text, which may be analyzed with string similarity algorithms.

Fig. 3.
figure 3

The extended CBR cycle [25].

When confronted with a new case, (Fig. 4), the system is able to retrieve all cases that meet such a structure and optimize such a population, i.e., it considers the attributes DoC’s value of each case or of their optimized counterparts when analysing similarities among them. Thus, under the occurrence of a new case, the goal is to find similar cases in the CaseBase. Having this in mind, the reductive algorithm given in [24] is applied to the new case, with the results:

$$ \underbrace {{stroke_{new} \left( {\left( {1,1} \right),\left( {1,1} \right),\left( {1,1} \right),\left( {1,1} \right), \left( {1,1} \right),\left( {1,1} \right),\left( {1,1} \right),\left( {1,0.87} \right)} \right)\,:\!\!\!:\,1\,:\!\!\!:\,0.98}}_{new\,case} $$
Fig. 4.
figure 4

The new case characteristics and description.

After the normalization process, the new case is compared with every retrieved case from the cluster using a similarity function, sim, given in terms of the average of the modulus of the arithmetic difference between the arguments of the each case of the retrieved cluster and those of their counterparts in the problem (once Description stands for free text, its analysis is excluded at this stage). Thus, one may get:

where \( stroke_{new \to 1}^{DoC} \) denotes the dissimilarities between \( stroke_{new}^{DoC} \) and the \( stroke_{1}^{DoC} \). It was assumed that every attribute has equal weight. Thus, the similarity for \( stroke_{new \to 1}^{DoC} \) is \( 1 - 0.14 = 0.86 \). With respect to QoI the procedure is similar returning \( stroke_{new \to 1}^{QoI} = 1 \).

Descriptions will be compared using String Similarity Algorithms in order to get a similarity measure between them. It is then necessary to compare the description of the new case with the descriptions of the cases stored in the repository (in this study the strategy used was the Dice Coefficient one [26]):

$$ stroke_{new \to 1}^{Description} = 0.78 $$

With these values we are able to get the final similarity function, sim:

$$ sim\_stroke_{new \to 1} = \frac{0.86 + 1 + 0.78}{3} = 0.88 $$

These procedures should be applied to the remaining cases of the retrieved cluster in order to obtain the most similar ones, which may stand for the possible solutions to the problem.

5 Conclusions

In order to target the CBR cycle theoretically and practically, the Decision Support System presented in this work to assess stroke predisposing risk, is centred on a formal framework based on Logic Programming for Knowledge Representation and Reasoning, complemented with a CBR approach to computing that caters for the handling of incomplete, unknown, or even self-contradictory data or knowledge. Under this approach the cases’ retrieval and optimization phases were heightened and the time spent on those tasks shortened in 18.7 %, when compared with existing systems, being its accuracy around 89 %. The proposed method also allows for the analysis of free text attributes using String Similarities Algorithms, which fulfils a gap that is present in almost all CBR software tools. Additionally, under this approach, the user may define the weights of the cases’ attributes on-the-fly, letting him/her to choose the most appropriate strategy to address the problem (i.e., it gives the user the possibility to narrow the search space for similar cases at runtime).