1 Introduction

The global economic meltdown of the late 2000s exposed many organisations around the world where every financial indicator was on a downward trend. As companies begin their slow recovery, they are increasingly looking for ways to reduce the risk associated with their business. This led to the realisation of a number of advanced products and techniques that aim to help organisations to reduce risk or take better decisions. As a matter of the fact, when the quality of the possible investments decreases or the risk associated with investments increases, being able to fully understand the faced risks and reduce them while avoiding bad investments can make the difference between dying, surviving or expanding.

Nowadays organisations have access to a quantity of data and information that was not available 20 years ago. Looking at the current trend, in a few years the amount of information will even be more. In addition, nowadays everything is online and in seconds it is possible to have huge amounts of information. Besides, it is also becoming much easier to store and maintain large amounts of data. Hence, different financial organisations are moving towards generating models based on data where these models are trying to predict the future by looking at the past.

Many organisations around the World still use statistical regression models which capture only information that can be refined into mathematical models to generate two outputs (0/1 or Good/Bad). Statistical regression analysis include many techniques (linear, multiple, logistic) for modelling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. One of the simplest and most popular modelling methods is linear regression. Linear regression is the most used technique in finance. For example the “capital asset pricing model” uses linear regression (Cohen et al. 2003) as well as the concept of “Beta” for analyzing and quantifying the systematic risk of an investment (Levinson 2006). Linear regression is also often used in financial time series modelling (Cohen et al. 2003). In addition, linear regression is also an important empirical tool in economics, for example, it is used to predict consumption spending (Deaton 1992) fixed investment spending, inventory investment, purchases of a country’s exports (Krugman and Obstfeld 1988) spending on imports (Krugman and Obstfeld 1988) the demand to hold liquid assets (Laidler 1993) labour demand (Ehrenberg and Smith 2008) and labour supply (Ehrenberg and Smith 2008). Logistic regression is a variant of nonlinear regression that is appropriate when the target (dependent) variable has only two possible values (e.g., live/die, buy/don’t-buy, infected/not-infected). However regression techniques in general are often considered black box models which cannot be easily understood and analyzed by the normal user.

Some advanced machine learning and artificial intelligence techniques have been applied in the financial domain. For example Support Vector Machines (SVMs) have been applied in Kim (2003) to forecast financial time series and in Kim and Sohn (2010) to effectively manage governmental funds to small and medium enterprises by identifying those likely to default. Another machine learning technique is Neural Networks (NNs) which have been applied successfully in big number of financial applications such as Giacomini (2003), Lawrence (1997), Kwong (2001). However, the drawback of such advanced machine learning techniques is that although they can give good prediction accuracies, they provide black box models which are very difficult to understand and analyse by a financial analyst where it is now becoming a common requirement to have an explanation or the reasoning behind a given financial decision.

There are actually a number of reasons why models that we can understand are important; the main reason is trust. No matter how sophisticated our economy has become all transactions still comes down to trust, we have to trust the person that we are trading with. This requires transparency so that we can see what the other party is doing. This need of transparency is reflected in legislations that force financial institutions to disclose the reasoning behind their financial decisions and models.

There exist various white box transparent models, one of these models is decision trees Decision trees are well suited to modeling target variables with binary values, but—unlike logistic regression—they also can model variables with more than two discrete values, and they handle variable interactions. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. Decision trees can provide an explanation for the output class chosen. Various works have been reported using decision trees in financial applications such as Garcia-Almanza (2008); Garcia-Almanza and Tsang (2008).

Fuzzy Logic Systems (FLSs) provide white box models which could be easily analyzed and understood by the layman user. However FLSs suffer from the curse of dimensionality problem which causes the FLS-based system to generate a big number of rules in order to give good model accuracy. Most recently type-2 FLSs that are capable of handling high uncertainty levels have been employed for the generation of classification models (Sanz et al. 2010, 2011). However, the existing type-2 fuzzy classification systems are not suited for the financial domain where such type-2 FLSs generate big rule bases; besides, they make the assumption that all the possible rules are represented in the existing models which is impossible for systems with big number of inputs where the generated model will only cover a small subset of the search space. Furthermore, FLSs have a high number of parameter to tune, which sometimes require some time to be chosen in the optimal way.

In this paper, we will present a genetic type-2 FLS for the modeling and prediction in financial applications. The proposed system avoids the drawbacks of the existing type-2 fuzzy classification systems in that the proposed system is able to carry prediction based on a relativity small pre-specified rule base size even if the incoming data vector does not match any rules in the FLS rule base. The proposed type-2 FLS aims to increase the understandability of the generated model by achieving the best performance possible with a limited and summarized number of rules in order to achieve simplicity and comprehensibility for the user. We have carried various evaluations where we are going to present in this paper results from two distinctive financial domains one for prediction of good/bad customers in a financial real-world lending application and the other domain was in the prediction of arbitrage opportunities in the stock markets. The proposed system was able to use the generated summarized models for the prediction within financial applications. The proposed Genetic type-2 FLS has outperformed white box models like the Evolving Decision Rule (EDR) procedure (which is a white based on Genetic Programming (GP) Garcia-Almanza and Tsang 2008 and decision trees) and gave a comparable performance to white box models like neural networks while the proposed genetic type-2 FLS provided a white box model which is easy to understand and analyse by the lay user.

In Sect. 2, we will present a brief overview on type-2 FLSs. Section 3 will present an overview on the fuzzy classification systems. Section 4 will present the proposed genetic type-2 fuzzy based modeling and prediction system for financial applications. Section 5 will present the experiments and the achieved results. Finally Sect. 6 will present the conclusions and future work.

2 Brief overview type-2 fuzzy logic systems

In the recent years type-2 FLSs have grown in popularity due to their ability to handle high levels of uncertainties. Type-2 FLSs employ type-2 fuzzy sets as shown in Fig. 1 where a type-2 fuzzy set is characterized by a fuzzy Membership Function (MF), i.e. the membership value (or membership grade) for each element of this set is a fuzzy set in [0, 1], unlike a type-1 fuzzy set where the membership grade is a crisp number in [0, 1] (Hagras 2004).

Fig. 1
figure 1

A type 2 fuzzy set

The membership functions of type-2 fuzzy sets are three dimensional and include a Footprint Of Uncertainty (FOU) (shaded in grey in Fig. 1), it is the new third-dimension of type-2 fuzzy sets and the Footprint Of Uncertainty (FOU) that provide additional degrees of freedom that make it possible to directly model and handle uncertainties (Hagras 2004; Mendel 2001). The interval type-2 FLSs use interval type-2 fuzzy sets (such as the type-2 fuzzy set shown in Fig. 1 to represent the inputs and/or outputs of the FLS). In the interval type-2 fuzzy sets all the third dimension values are equal to one. The use of interval type-2 FLS helps to simplify the computation (as opposed to the general type-2 FLS).

The proposed system in the paper is a type-2 fuzzy classification system and hence it does not follow the structure of the type-2 FLSs reported in Hagras (2004), and Mendel (2001) where the classification system process is summarized in the following section.

An interval type-2 fuzzy set denoted \(\tilde{A}\) is written as follows:

$$\begin{aligned} \mu _{\tilde{A}} (x)=\int \limits _{x\in X} \,\,\,{\int \limits _{u\in \left[ {\bar{\mu }_{\tilde{A}} (x),{\underline{\mu }}_{\tilde{A}} (x)} \right] ^{1/u} } } \end{aligned}$$
(1)

\(\bar{\mu }_{\tilde{A}} (x),\,{\underline{\mu }}_{\tilde{A}} (x),\) represent the upper and lower membership functions respectively of the interval type-2 fuzzy set \(\tilde{A}.\) The upper membership function is associated with the upper bound of the footprint of uncertainty \(FOU({\tilde{A}})\) of a type-2 membership function. The lower membership function is associated with the lower bound of \(FOU({\tilde{A}})\) (Hagras 2004).

3 Brief overview on fuzzy classification systems

In fuzzy logic classification systems, for a given c-class pattern classification problem with \(n\) attributes (or features), a given rule in the FLS rule base could be written as follows:

$$\begin{aligned}&\hbox {Rule}\,R^j:\hbox { If}\; x_1 \; {\hbox {is}}\; A_1^j \,\, \hbox {and} \ldots \hbox {and}\;x_n \,{\hbox {is}}\,A_n^j \,\hbox {then Class}\,C_j \nonumber \\&\quad \hbox {with}\,\, CF_j ,j=1,2,\ldots ,N \end{aligned}$$
(2)

where \(x_1 \ldots ,x_n\) represent the n-dimensional pattern vector, \(A_i^j \) is the fuzzy set representing the linguistic label for the antecedent pattern \(i\), \(C_j\) is a consequent class (which could be one of the possible \(c\) classes), \(N\) is the number of fuzzy IF-Then rules in the FLS rule base. \(CF_j\) is a certainty grade of rule \(j\) (i.e., rule weight). Assuming each input pattern is represented by \(K\) fuzzy sets and given that we have \(n\) input patterns, the possible number of rules that will cover the whole search space is \(K_n\). In the arbitrage application presented in this paper, we have seven inputs where each input is represented by five fuzzy sets; hence the needed number of rules to cover the whole search space for this given application is 5\(^{7}\) \(=\) 78,125 rules. Each rule represents all the available input patterns where each pattern is represented by one of the available fuzzy sets and “don’t care” conditions are not considered by any input feature. For our future work, we will introduce “don’t care” conditions as this will help to increase the interpretability of the rule as explained in Ishibuchi et al. (1999). In our given applications (which applies to the vast majority of financial applications), we do not have enough data to generate this huge number of rules. Hence, there will be various cases where the incoming input vector will not fire any rule in the FLS rule base.

In the design of a fuzzy rule-based system, there exist two conflicting objectives: error minimization and comprehensibility maximization. The trade-off between these two objectives has been discussed in some studies (Casillas et al. 2003a, b). Several type-1 fuzzy classification systems have been reported in the literature such as Ishibuchi (2001a, b), Ishibuchi and Yamamoto (2004, 2005, 2006) Shigeo (1995), Ahmad and Jahormi (2007), Wang (2003), and Mansoori et al. (2006). However, in the vast majority of these papers, the data was quite easy to partition, and if an input pattern does not match any of the decision areas previously labelled, the input is discharged. In financial applications this cannot be done where if a new pattern that has never been seen before is proposed, a decision needs to be made anyway, and unfortunately discharging a given pattern a priory cannot be the solution. A technique to resolve this problem was proposed in Garcia-Almanza and Tsang (2006, 2008), and, this technique keeps in a rule repository all the rules for the minority class in unbalanced data sets. All the inputs that do not match any rule in the repository are considered belonging to the majority class. This technique can work in unbalanced data set but might not work in all cases.

Most recently type-2 FLSs that are capable of handling high uncertainty levels have been employed for the generation of classification models (Sanz et al. 2010, 2011). However, the existing type-2 fuzzy classification systems are not suited for the financial domain where such type-2 FLSs generate big rule bases and make the assumption that all the possible rules are represented in the existing models which is impossible for the problems with big number of input variables where the generated model will only cover a small subset of the search space. In this paper, we will present a type-2 FLS for the modelling and prediction of financial applications. The proposed system avoids the drawbacks of the existing type-2 fuzzy classification systems where the proposed system is able to carry prediction based on a pre-specified rule base size even if the incoming data vector does not match any rules in the FLS rule base.

4 The proposed genetic type-2 fuzzy modelling and prediction system for financial applications

In fuzzy logic systems, the choice of the appropriate parameters of the fuzzy sets poses a major challenge to the design of a FLS. By simply changing the fuzzy sets parameters, it is possible to change the behaviour of a fuzzy logic system, for example in the field of managing risk in financial systems it is possible to build riskier or risk-averse fuzzy systems by changing the parameters of the fuzzy sets to make the FLS passing more or less customers. It is extremely difficult though to find the optimal configuration using a simple manual or heuristic approach because of the number of the variables to be optimised and the interaction of these variables. In our work Genetic Algorithms (GAs) were used to tune the parameters of the type-2 fuzzy sets of the FLS.

The GA uses a population where each chromosome describe a fuzzy set space, in other words the size and the position of each membership function for each input. The GA starts by producing randomly an initial population, and then it evolves at each generation the previous population. In the GA each instance of the FLS is created by using each individual of the population and each instance generates a different fitness value. Using the fitness value the best individuals are selected and operators of crossover and mutation are applied to produce the new population for the next iteration. As shown in Fig. 2, the steps followed by the proposed genetic type-2 fuzzy system can be summarised as:

  1. 1.

    Initialize randomly the first generation.

  2. 2.

    Build a rule-base for each parameter configuration of the type-2 fuzzy sets as provided by a given chromosome. As the matter of the fact each chromosome describes the fuzzy membership functions configuration and this in conjunction with the training data is used to build the rule-base (the rule base generation process is discussed in Sect. 4.2).

  3. 3.

    Evaluate the classification ability of the generated type-2 FLS and produce the fitness value for each individual.

  4. 4.

    If an individual reach the desired fitness value or the max number of iteration are reached the algorithm terminates.

  5. 5.

    The GA uses the population and their fitness values to evolve and produce a new population of type-2 fuzzy sets.

  6. 6.

    Go to step 2.

Fig. 2
figure 2

An overview of the proposed genetic type-2 fuzzy logic system

4.1 The GA operation

4.1.1 The GA fitness function

The GA tries to find the best membership function configuration to optimise the fitness function. Our work has been focused on classification problems, where the aim is to identify the correct class for a given input. In order to evaluate the performance we will use the Receiver Operating Characteristic (ROC) curve (Swets 1996). In order to explain how the ROC curve works we have first to briefly introduce the measures computed in a confusion matrix. A confusion matrix displays the data about actual and predicted classifications done by a classifier (Kohavi and Provost 1998). This information is used in supervised learning to determine the performance of classifiers. Given an instance and two classes (positive and negative) there are four possible results: The instance is positive and it is classified as positive (True Positive (TP)). The instance is negative and it is counted as positive (False Positive (FP)). The instance is positive and it is classified as negative (False Negative (FN)). The instance is negative and it is predicted as negative (True Negative (TN)). Figure 3 summarise the confusion matrix for a two class problem.

Fig. 3
figure 3

Confusion matrix for a two class problem

Hence True Positive (TP) is number of correct predictions in positive cases; False Positive (FP) is the number of incorrect predictions that were classified as positive when the instance is negative. False Negative (FN) is the number of incorrect predictions that were classified as negative when the instance is positive while True Negative (TN) is the number of correct negative predictions.

The ROC curve explains the performance of a classifier by plotting two measures.

  • Recall which is also called sensitivity or true positive rate which is defined as the proportion of positive cases that were correctly identified (Kohavi and Provost 1998), it is determined by the formula:

    $$\begin{aligned} { Recall}_{ positive} =\frac{ TP}{{ TP}~+~{ FN}} \end{aligned}$$
    (3)

    Recall is calculated on the positive class only (Swets 1996), though it is possible to extend the Eq. (3) on the negative class as well as shown in Eq. (4) below.

    $$\begin{aligned} { Recall}_{ negative} =\frac{ TN}{{ TN}~+~{ FP}} \end{aligned}$$
    (4)
  • False positive rate is the proportion of negative cases that were wrongly predicted as positive. It is determined by the formula:

$$\begin{aligned} \qquad { False}\,{ Positive}\,{ Rate}_{ positive} =\frac{ FP}{{ FP}~+~{ TN}} \end{aligned}$$
(5)

False positive rate is by definition calculated on the false positive value of the confusion matrix (Swets 1996). However, in the same way the \({ Recall}_{ positive}\) was extended to \({ Recall}_{ Negative} \) by calculating it for the negative class, it is possible to extend the False Positive Rate by considering its symmetric version on the negative class, as shown in Eq. (6) below, this measure is also known as False Negative Rate.

$$\begin{aligned} { False}\,{ Positive}\,{ Rate}_{{ negative}}&= { False \; Negative\;Rate}\nonumber \\&= \frac{ FN}{{ FN}~+~{ TP}} \end{aligned}$$
(6)

The point is that on a two class problem it is possible to calculate the recall on both classes, so that there will be \({ recall}_{ positive}\) and \({ recall}_{ negative},\) as well as all other measures. It is interesting to note that it is possible to calculate the false positive rate for both classes as well using the following formulas:

$$\begin{aligned} { False}\; { Positive}\; { Rate}_{ positive}&= 1- { recall}_{ negative}\end{aligned}$$
(7)
$$\begin{aligned} { False}\; { Positive}\; { Rate}_{ negative}&= 1- { recall}_{ positive} \end{aligned}$$
(8)

This conclusion is important for us because in this way we can consider only the recall as a measure in the fitness function. As the matter of the fact, in order to produce a classifier that optimizes the curve on a ROC graph, the classifier could simply optimize the average of the recall for all classes. Hence in our GA the fitness function will be the average recall for all classes. In order though to produce different points on the ROC curve representing more risky or risk-averse classifier we used some weights in order to favour some recall for some classes, and hence to position the classifier in different point of the graph.

$$\begin{aligned} {{ Fitness\;score}=\frac{\mathop \sum \nolimits _i^N { Recall}_i *w_i }{N}} \end{aligned}$$
(9)

\(N\) is the number of classes for the problem, and \(w_i\) is the weight defined as \(w=\{w_1 ,\ldots ,w_n\}\;{ with}\;n=[1,N].\)

4.1.2 Employing genetic algorithms to determine the type-2 fuzzy sets parameters

This section explains how a chromosome is translated into the fuzzy sets space, describing the size and position of the membership functions. The GAs implementation of Shakya (2004) was used to encode the chromosomes using real numbers. In order to use the implementation we needed to provide the algorithm with the following parameters:

  • Solution length: this is the length of a single chromosome; in our implementation this was the number of parameters describing the fuzzy set space that need to be tuned (Kassem 2012).

  • Min/Max Range: this is the minimum and maximum number that can be generated in a gene (Kassem 2012).

  • Fitness Function: this is the objective function that the GA tries to optimize. This function should take as an input a chromosome (a possible solution) and return fitness score (Kassem 2012).

  • Population Size: the number of individuals within a population (Kassem 2012).

  • Crossover Rate: every time a pair of parents are chosen from the population produced from the selection process, a random number is generated, if this number is less than the crossover rate then crossover is performed on the parents, otherwise the parents are copied without alteration as the offspring (Kassem 2012).

  • Mutation Rate: for every gene within a chromosome a random number is generated, if that number is less than the mutation rate then mutation is performed on that gene, otherwise the gene is left unaltered (Kassem 2012).

  • Maximum Generation: this is the maximum number of generations that if reached by the algorithm then termination is forced (Kassem 2012).

  • Elite Solutions: this is the number of elite solutions that are copied from one generation to another (Kassem 2012).

Once the above parameter have been chosen for the GA, the algorithm starts by creating an initial generation of individuals randomly. An individual or gene is a membership function parameter for the FLS.

Each input for the FLS is represented by five type-2 fuzzy sets, which need 17 parameters to be represented as shown in Fig. 4. Each of the 17 parameters represented in the chromosome does not represent the absolute coordinate in the universe of discourse, but the relative distance from the previous parameter. The translation process of a chromosome into a fuzzy set space will be explained in more detail later. To fully build the fuzzy sets needed by the system, the total number of parameters (genes) to be optimised can be found as follows:

$$\begin{aligned} { Number}\;{ of}\;{ parameters}=17 \times F \end{aligned}$$
(10)

where \(F\) is the number of inputs (or features). For a dataset with 7 input features the total number of parameter to tune will be 119 parameters thus creating a chromosome composed of 119 genes. The fuzzy partitions derived by the chromosome are specified under the constraint that the upper membership function of a given fuzzy set starts at the same point as the right hand vertex of the previous membership function. For this reason, upper membership functions always intersect at the membership value of 0.5. In addition, the sum of the upper membership values is equal to 1.

Fig. 4
figure 4

The number of type-2 fuzzy sets 17 parameters to be tuned for each input

The used GA parameters are listed in Table 1.

Table 1 The GA parameters

If we consider a single input, in order to shape the interval type-2 fuzzy sets for an input we need 17 parameters (as shown in Fig. 4). Each gene contains a percentage representing the distance between parameter \(i\) and \(i-1.\) Let’s take for example the segment of the chromosome shown in Fig. 5. This segment contains all parameters needed to build a fuzzy set space for one input. The sum of all genes within this segment is 150, so \({ gene}_5\) (21) is equivalent to (21/150) which is 14 %, this means that the distance between \(V_4\) and \(V_5\) is 14 % of the total universe of discourse. Let’s take another example; \(gene_1\) (6), this gene would be equivalent to 4 %. This means that the distance between \(V_1\) and the starting point of the fuzzy set would be 4 % of the total universe of discourse, considering that in this example the universe of discourse start from zero and ends at 50 as shown in Fig. 8, the core of the first membership function ends in a decimal value of 2. If the starting point of the universe of discourse would have been \(S,\) then the core would have been ending in \(S + 2\). Figure 6 shows the equivalent percentages for each of the genes shown in Fig. 5. After these percentages are derived, they are applied to the universe of discourse in order to determine the distance for each part of the membership functions. As mentioned, in the example considered in Fig. 8, the universe of discourse start in zero and end in 50, the equivalent distances for the membership functions can be shown in Fig. 7. So if we take \({ gene}_2 (2\,\%)\), so 2 % of 50 is 1, which is the second component in Fig. 7. As we have mentioned earlier these components are the distances between the parameters, hence the distance between the second and first parameter is 1 and the decimal value is \(2+1 =3\) from the beginning of the type-2 fuzzy sets universe of discourse. The final values of the parameters and the distances are shown in Fig. 8.

Fig. 5
figure 5

A segment of a chromosome

Fig. 6
figure 6

Percentages derived from the chromosome

Fig. 7
figure 7

Distances derived from the percentages

Fig. 8
figure 8

Final values of the derived membership functions

In order to build a fuzzy set space for 11 variables, 11 different chromosome segments will be selected and used to build the fuzzy set space, needing a chromosome of size \(11*17=187\) genes.

4.2 Rule generation in the proposed genetic type-2 FLS

The previous section showed how to employ genetic algorithms to learn the parameters of the type-2 fuzzy sets. This subsection will show how the rules of the type-2 FLS are modelled taking as an input a dataset and the fuzzy sets whose parameters were optimised by the GA. This is called modelling phase. In the modelling phase the rule base of the type-2 fuzzy classification system is constructed from the existing training dataset. Once the model has been built the FLS can be used to predict new inputs. This is called prediction phase. In the prediction phase, the generated rule base is used to predict the incoming input vectors. Figure 9 shows an overview on the modelling and prediction phases.

Fig. 9
figure 9

An overview on the modelling and prediction phases

4.2.1 The modeling phase

The modeling phase operates according to the following steps (as shown in Fig. 9):

Step 1: Raw rule extraction: For a fixed input–output pair \(({x^{(t)},C^{(t)}})\) in the dataset, \(t=1,\ldots T\) (\(T\) is the total number of data training instances available for the modeling phase) compute the upper and lower membership values \({\bar{\mu }}_{A_{s}^{q}}, \,{\underline{\mu }}_{A_{s}^{q}}\) for each antecedent fuzzy set \(q=1,\ldots K\) (\(K\) is the total number of fuzzy sets representing the input pattern \(s\) where \(s=1\ldots n\)). Generate all rules combining the matched fuzzy sets \({A_{s}^{q}}\) (i.e. either \({\bar{\mu }}_{A_{s}^{q}} > 0\) or \({\underline{\mu }}_{A_{s}^{q}}> 0)\) for all \(s=1 \ldots n\). Thus the rules generated by \(({x^{(t)},C^{(t)}})\) will have different antecedents and the same consequent class \(C^{(t)}\) Thus each of the extracted raw rules by \(({x^{(t)},C^{(t)}})\) could be written as follows:

$$\begin{aligned}&R^j:{ If}\;x_1 \;{ is}\;\varvec{{\tilde{A}}_1^{qjt}}\;{ and}\;\ldots \;{ and}\;x_n \;{ is}\;{\varvec{{\tilde{A}}_n^{qjt}}}\;{ then\; Class }\;C_t,\nonumber \\&\quad t=1,2,\ldots ,T \end{aligned}$$
(11)

For each generated rule, we calculate the firing strength \(F^t\). This firing strength measures the strength of the points \(x^{(t)}\) belonging to the fuzzy region covered by the rule. \(F^t\) is defined in terms of the lower and upper bounds of the firing strength \({\underline{f^{(t)}}},{\overline{f^{(t)}}}\) of this rule which are calculated as follows:

$$\begin{aligned} {\overline{f^{jt}}} ({{x}}^{(t)})&= {\overline{{\mu }_{A^{qjt}_{1}}}} ({{x}}_{1})*\cdots *{\overline{{\mu }_{A^{qjt}_n}}} {({x}}_{n})\end{aligned}$$
(12)
$$\begin{aligned} {\underline{f^{jt}}}({{x}^{(t)}})&= {\underline{\mu _{A^{qjt}_n}}}({{x}_1})*\cdots *{\underline{\mu _{A^{qjt}_n}}}({{x}_n}) \end{aligned}$$
(13)

The * denotes the minimum or product t-norm. Step 1 is repeated for all the \(t\) data points from 1 to \(T\) to obtain generated rules in the form of Eq. (11).

The financial data is usually highly imbalanced (for example in a lending application it is expected that the majority of people will be good customers and a minority being bad customers and usually the interesting class is the minority class). Hence, we will present a new approach called “weighted scaled dominance” which is an extension of our previous work “scaled dominance” and the “weighted confidence” work introduced by Ishibuchi and Yamamoto (2005). This method tries to handle imbalanced data by trying to give minority classes a fair chance when competing with the majority class. In order to compute the scaled dominance for a given rule having a consequent Class \(C_j\), we divide the firing strength of this rule by the summation of the firing strengths of all the rules which had \(C_j\) as the consequent class. This allows handling the imbalance of data towards a given class. We scale the firing strength by scaling the upper and lower bounds of the firing strengths as follows

$$\begin{aligned} \overline{fs^{jt}} =\frac{\overline{f^{jt}}}{\mathop \sum \nolimits _{j\in { Classj}} \overline{f^j}}\end{aligned}$$
(14)
$$\begin{aligned} \underline{fs^{jt}}=\frac{\underline{f^{jt}}}{\mathop \sum \nolimits _{j\in { Classj}} \underline{f^{jt}}} \end{aligned}$$
(15)

Step 2: Scaled support and scaled confidence calculation: Many of the generated rules will share the same antecedents but different consequents. To resolve this conflict, we will calculate the scaled confidence and scaled support which are calculated by grouping the rules that have the same antecedents and conflicting classes. For given \(m\) rules having the same antecedents and conflicting classes. The scaled confidence \(({{\bar{A}}_q \Rightarrow C_q})\) (defined by its upper bound \(\overline{c}\) and lower bound \(\underline{c}\), it is scaled as it involves the scaled firing strengths mentioned in the step above) that class \(C_q\) is the consequent class for the antecedents \({\tilde{A}}_q\) (where there are \(m\) conflicting rules with the same antecedents and conflicting consequents) could be written as follows:

$$\begin{aligned} {\bar{c}}({{\tilde{A}}_q \Rightarrow C_q})&= \frac{\mathop \sum \nolimits _{x_s \in Class C_q} \overline{fs^{jt}} ({x_s})}{\mathop \sum \nolimits _{j=1}^m \overline{fs^{jt}} ({x_s})}\end{aligned}$$
(16)
$$\begin{aligned} {\underline{c}}({\tilde{A}_q \Rightarrow C_q})&= \frac{\mathop \sum \nolimits _{x_s \in Class c_q} \underline{fs^{jt}}({x_s})}{\mathop \sum \nolimits _{j=1}^m \underline{fs^{jt}}({x_s})} \end{aligned}$$
(17)

The scaled confidence can be viewed as measuring the validity of \(({A_q \Rightarrow C_q})\). The confidence can be viewed as a numerical approximation of the conditional probability (Ishibuchi 2001b). The scaled support (defined by its upper bound \(\bar{s}\) and lower bound \({s}\), it is scaled as it involves the scaled firing strengths mentioned in the step above) is written as follows:

$$\begin{aligned} \bar{s}({\tilde{A}_q \Rightarrow C_q})&= \frac{\mathop \sum \nolimits _{x_s \in Class C_q} \overline{fs^{jt}} ({x_s})}{m}\end{aligned}$$
(18)
$$\begin{aligned} {\underline{s}}({\tilde{A}_q \Rightarrow C_q})&= \frac{\mathop \sum \nolimits _{x_s \in Class c_q} \underline{fs^{jt}}({x_s})}{m} \end{aligned}$$
(19)

The support can be viewed as measuring the coverage of training patterns by \(({A_q \Rightarrow C_q})\). The scaled dominance, (defined by its upper bound \(\bar{d}\) and lower bound \(\underline{d})\) can now be calculated by multiplying the scaled support and scaled confidence of the rule as follows:

$$\begin{aligned} \bar{d}({\tilde{A}_q \Rightarrow C_q})&= \bar{c}({\tilde{A}_q \Rightarrow C_q})\cdot \bar{s}({\tilde{A}_q \Rightarrow C_q})\end{aligned}$$
(20)
$$\begin{aligned} {\underline{d}}({\tilde{A}_q \Rightarrow C_q})&= {\underline{c}}({\tilde{A}_q \Rightarrow C_q })\cdot {\underline{s}}({\tilde{A}_q \Rightarrow C_q}) \end{aligned}$$
(21)

The “weighted scaled dominance” (defined by its upper bound \(\overline{wd}\) and lower bound \(\underline{wd})\) is calculated as follows:

$$\begin{aligned} \overline{wd} ({\tilde{A}_q \Rightarrow C_q })&= \bar{d}({\tilde{A}_q \Rightarrow C_q})-\overline{d_{ave}}\end{aligned}$$
(22)
$$\begin{aligned} \underline{wd}({\tilde{A}_q \Rightarrow C_q })&= {\underline{d}}({\tilde{A}_q \Rightarrow C_q })-\underline{d_{ave}} \end{aligned}$$
(23)

where \(d_{ave}\) is the average dominance (defined in terms of \(\underline{d_{ave}}\, and\, \overline{d_{ave}})\) over fuzzy rules with the same antecedent \(\tilde{A}_q\) but different consequent classes.

For rules that share the same antecedents and have different consequent classes, we will replace these rules by one rule having the same antecedents and the consequent class which will be corresponding to the rule that gives the highest average “weighted scaled dominance value” \(= (\frac{\overline{wd} +\underline{wd}}{2})\)

In Sanz et al. (2010), the rule generation system generated only the rule with the highest firing strength, however in our method, we generate all rules that are generated by the given input patterns, and this allows covering a bigger area in the decision space.

Step 4: Rule selection: As fuzzy based classification methods generate a large number of rules, this could cause major problems for financial applications where the users need to understand the system. Hence, in our method, we will reduce the rule base to a relatively small pre-specified size of rules that generates a summarized model which could be easily read, understood and analyzed by the user. In this step, we select only the top \(Y\) rules per class (\(Y\) is pre-specified by the given financial application) which has the rules with the highest average weighted scaled dominance values. This selection is useful because rules with low weighted scaled dominance may not actually be relevant and possibly introduce errors. This helps to keep the classification system more balanced between the majority and minority classes. By the end of this step, the modeling phase is finished where we have \(X= nC\cdot Y\) rules (with nC the number of classes) ready to classify and predict incoming patterns as discussed below in the prediction phase.

4.2.2 Prediction phase

When an input pattern is introduced to the generated model, two cases will happen: the first case is when the input \(x^{(p)}\) matches any of the X rules in the generated model, in this case we will follow the process explained by case 1 below. If \(x^{(p)}\) does not match any of the existing X rules, we will follow the process explained by case 2.

4.2.2.1 Case 1: The input matches one of the existing rules

In case the incoming input \(x^{(p)}\) matches any of the existing X rules, we will calculate the firing strength of the matched rules according to Eqs. (12) and (13), this will result in \({\overline{f^j}} ({x^{(p)}}),{\underline{f^j}}( {x^{(p)}})\). In this case, the predicted class will be determined by calculating a vote for each class as follows:

$$\begin{aligned} {\bar{Z}}Class_h ({x^{(p)}})&= \frac{\mathop \sum \nolimits _{j\in h} \overline{f^j} (x^{(p)})*\overline{wd} ({A_q \rightarrow C_q})}{\max j\in h(\overline{f^j} (x^{( p)})*\overline{wd} ( {A_q \rightarrow C_q }))}\nonumber \\ \end{aligned}$$
(24)
$$\begin{aligned} \underline{Z}Class_h ( {x^{(p)}})&= \frac{\mathop \sum \nolimits _{j\in h} \underline{f^j}(x^{( p)})*\underline{wd}( {A_q \rightarrow C_q })}{\max j\in h(\underline{f^j}(x^{( p)})*\underline{wd}( {A_q \rightarrow C_q }))}\nonumber \\ \end{aligned}$$
(25)

In the above equations \(\max j\in \hbox {h}(\overline{f^j} (x^{( p)})\, *\, \overline{wd} ( {A_q \rightarrow C_q }))\) and \(\max j\in \hbox {h}( {\underline{f^j}( {x^{( p)}})*\underline{wd}( {A_q \rightarrow C_q })})\) represent taking the maximum of the product of the upper and lower firing strengths and the weighted scaled dominance respectively among the “\(K\)” rules selected for each class. The total vote strength is then calculated as:

$$\begin{aligned} ZClass_h =\frac{\overline{Z} Class_h ({x^{(p)}})+\underline{Z}Class_h ({x^{(p)}})}{2} \end{aligned}$$
(26)

The class with the highest \(ZClass_h\) will be the class predicted for the incoming input vector \(x^{(p)}\).

4.2.2.2 Case 2: The input does not match any of the existing rules

In case the incoming input vector \({x^{(p)}}\) does not match any of the existing X rules, we need to decide the output class for the input. The first step is to build all the rules that are possible to be generated from the given input, using the matched fuzzy sets. Let’s suppose we have a classification problem with two inputs \(\hbox {x}_1\) and \(\hbox {x}_2\) Let’s suppose that a given input \({x^{(p)}}\)will match overall four different fuzzy sets as shown in Fig. 10. Let \(MR({x^{(p)}})\) the set of rules obtained by combining the matched fuzzy sets. In the example shown in Fig. 10, there will be four matching fuzzy sets which will generate four different rules: \(\mathrm{R_1} =\left\{ { Small,Medium} \right\} \mathrm{R_2} =\left\{ { Small,Large} \right\} \mathrm{R_3}=\left\{ { Medium,Medium} \right\} \mathrm{R_4} =\left\{ { Medium,Large} \right\} \). Each rule will have an associated a firing strength but not an output class.

Fig. 10
figure 10

An example to illustrate the similarity measure

The following step is to find the closest rule in the rule base for each rule in \(MR({x^{(p)}})\). In order to do this, we need to calculate the similarity (or distance) between each of the fuzzy rules generated by \({x^{(p)}}\) and each of the X rules stored in the rule base. Let’s define “\(k\)” to be the number of rules generated from the input \({x^{(p)}}\) (\(k = 4\) in the example shown in Fig. 10). Let the linguistic labels that fit \({x^{(p)}}\) be written as \(v_{ inputr} =({v_{ input1r} ,v_{ input2r} ,\ldots ,v_{ inputnr}})\) where \(r\) is the index of the \(r\)-th rule generated from the input. Let the linguistic labels corresponding to a given rule in the rule base be \(v_j =({v_{j1} ,v_{j2} ,\ldots ,v_{jn}}).\) Each of these linguistic labels (Low, Medium, etc) could be decoded into an integer. Hence the similarity between the rule generated by \({x^{(p)}}\) and a given rule in the rule base could be calculated by finding the distance between the two vectors as follows:

$$\begin{aligned}&{ Similarity}_{{ input} r\leftrightarrow j} =\left( (1-\left| {\frac{{ vinput1}r-vj1}{V1}} \right| \right) \nonumber \\&\quad *\left( 1-\left| {\frac{{ vinput}2r-vj2}{V2}} \right| \right) *\cdots \cdot *\left( 1-\left| {\frac{{ vinputnr}-{ vjn}}{{ Vn}}} \right| \right) \nonumber \\ \end{aligned}$$
(27)

where \(V_1 \ldots V_n\) represents the number of linguistic labels representing each variable. Each rule in the rule-base will have at this point a similarity associated with the r-th rule generated form the input. For each rule in \(MR({x^{(p)}})\) the most similar rule in the rulebase, using Eq. (27) will be found to decide on the output class. There will be “\(k\)” rules (the most similar rules to the \(k\) rules in \(MR({x^{(p)}}))\) selected to decide for the \({x^{(p)}}\) input the output class. The predicted class will be determined as a vote for each class as follows:

$$\begin{aligned}&{\bar{Z}}{ Class}_{h} ({x^{(p)}})=\frac{\mathop \sum \nolimits _{j\in \hbox {h}} \overline{wd} ({A_q \rightarrow C_q})*\overline{f^j} (x^{(p)})}{\max j\in \hbox {h}(\overline{f^j} ({x^{(p)}})*\overline{wd} ({A_q \rightarrow C_q }))}\nonumber \\ \end{aligned}$$
(28)
$$\begin{aligned}&{\underline{Z}}Class_h ( {x^{(p)}})=\frac{\mathop \sum \nolimits _{j\in \hbox {h}} {\underline{wd}}({A_q \rightarrow C_q})*{\underline{f^j}}(x^{(p)})}{\max j\in {\hbox {h}}({\underline{f^j}}({x^{(p)}})*{\underline{wd}}( {A_q \rightarrow C_q}))}\nonumber \\ \end{aligned}$$
(29)

where \({\underline{f^j}}({x^{(p)}})\) and \({\overline{f^j}} ({x^{(p)}})\) are the lower and upper firing strength of the most similar rule in the rule base and \({\overline{wd}} ({A_q \rightarrow C_q})\) and \({\underline{wd}}({A_q \rightarrow C_q})\) are the upper and lower interval of the weighted scaled dominance of the most similar rule of the rule considered in \(MR({x^{(p)}})\). In the above equation \(\max \;j\in h({\overline{f^j}} (x^{(p)})*{\overline{wd}} ({A_q \rightarrow C_q}))\) and \(\max \;j\in h({{\underline{f^j}}({x^{(p)}})*{\underline{wd}}({A_q \rightarrow C_q})})\) represent taking the maximum of the product of the upper and lower firing strengths and the weighted scaled dominance respectively among the most similar rules to the “\(k\)” rules, this measure is used to scale the lower and upper voting strength of each class. The total vote strength is then calculated as:

$$\begin{aligned} { ZClass}_{ hr} =\frac{{ ZClass}_{h} ({x^{(p)}})+{\underline{Z}}{ Class}_h ( {x^{(p)}})}{2} \end{aligned}$$
(30)

The class with the highest \({ ZClass}_{h}\) will be the class associated \({x^{(p)}}\).

5 Evaluations and results

The classification ability of the proposed system has been tested with two different datasets. The first sets of experiments are based on the data which has been used for spotting arbitrage opportunities in the London International Financial Futures Exchange (LIFFE) market (Garcia-Almanza and Tsang 2006).

The second one is a credit approval dataset. This dataset was achieved from a real-world credit reference agency identifying good and bad customers where good customers are profitable customers and bad customers are non-profitable customers.

5.1 Performance on arbitrage dataset

We have tested the proposed genetic type-2 FLS to model and predict arbitrage opportunities. Computers today are able to spot in milliseconds the stock misalignment in the market. This would allow them to make almost risk-free profits. There are two main challenges in this type of operation. Firstly, arbitrage situation do not occur very often. Secondly, the operator must act ahead of others, so the competition is reduced to how fast a computer is, and how fast its connection to the stock exchange is. Garcia-Almanza and Tsang (2006) showed that arbitrage opportunities do not appear instantaneously. There are patterns in the market which can be recognized 10 min ahead.

The proposed system is trained to identify ahead of time arbitrage opportunities (Garcia-Almanza 2008). The data reported in this paper was further developed in Garcia-Almanza (2008), Garcia-Almanza and Tsang (2008) in order to identify arbitrage situations by analyzing option and futures prices in the London International Financial Futures Exchange (LIFFE) market. The pre-processed data comprised 1,641 instances of which only 401 representing arbitrage opportunities and the rest representing non-arbitrage opportunities. The data was split into 2/3 for modelling and 1/3 for testing.

According to Garcia-Almanza and Tsang (2008), the information used from the option and future prices in the London International Financial Futures Exchange (LIFFE) market have been manipulated, selected and reduced to just seven features. Those are described in Table 2.

Table 2 The seven input features (variables) for the arbitrage data set (Garcia-Almanza and Tsang 2008)

We have compared the proposed genetic type-2 FLS approach with one of the most powerful white box modelling and prediction systems for spotting arbitrage opportunities which is Evolving Decision Rule (EDR) procedure (Garcia-Almanza and Tsang 2008). The EDR method evolves a set of decision rules by using Genetic Programming (GP) and it receives feedback from a key element that is called repository. The repository is a structure whose objective is to store a set of rules. The resulting rules are used to create a range of classifications that allows the user to choose the best trade-off between misclassifications and false alarms cost.

We have compared as well the proposed genetic type-2 FLS approach against Neural Networks which was found to give a better performance than any other black box model available for this data set.

The proposed genetic type-2 FLS aims to fulfil two objectives: The first one is to get good results on both RECALL and false positive rate, the second objective is to use small number of rules to model and predict the arbitrage opportunities, thus presenting a white box model which could be easily understood and analyzed by the lay user. The perfect ideal classifier is able to have a RECALL (True Positive Rate) of 1 and a False Positive Rate (FPR) of 0 (where the area under the ROC curve will be equal to 1), thus the more predictive the given model is, the higher is its ROC curve which means that the ROC curve for the given prediction model approaches the ideal classifier. In general, this means having the highest RECALL possible and the lowest false positive rate possible. Hence, the more predictive a given model is, the more the area under its ROC curve will approach the ideal classifier whose area under its ROC curve equals to one.

Moving along the Receiver Operating Characteristic (ROC) curve (which plots the true positive rate, vs. false positive rate) means increasing the FPR at the expenses of the RECALL or vice versa.

In the following evaluations, we have employed the proposed Genetic type-2 FLS with different fuzzy set space configurations in order to move along the ROC curve. In order to do so the fitness function of the GA was weighed using different weights in Eq. (9). Figure 11 shows the ROC curve obtained over testing data by the proposed genetic type-2 FLS plotted against the ROC curves obtained by the EDR procedure (Garcia-Almanza and Tsang 2008) and the Neural Networks respectively. From Fig. 11, it is obvious that the proposed genetic type-2 FLS gives a better ROC curve than the EDR procedure and the Neural Networks while the type-2 FLS presents the user with a small number of rules which summarizes the model and explains the system behaviour to the lay user in an understandable and comprehensible way. Figure 11 shows the results obtained when employing the proposed genetic type 2 FLS with only 200 and 40 rules. The best results are obtained using 200 rules; the genetic type-2 FLS with just 40 rules has slightly worst performance than the performance with 200 rules but still produces much better results than the EDR procedure and also producing a slightly better performance when compared to Neural Networks. The selected 40 or 200 rules have been selected taking those with highest average weighted scaled dominance.

Fig. 11
figure 11

ROC graph over testing data for the arbitrage prediction comparing EDR, NN and genetic type-2 FLSs

In order to compare the classifier on their overall behaviour (including all range of riskier and risk-averse classifiers) the area under the curve (AUC) technique has been used. The machine learning community can use the ROC AUC statistic for model comparison (Hanley and McNeil 1983). The area under curve (AUC), when using normalized units, is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’) (Fawcett 2006). Table 3 summarise the AUC results for the classifier used where it is shown that the AUC for the genetic type-2 FLS with 200 rules gives the best AUC of 0.9849 followed by the AUC of 0.9755 for the genetic type-2 FLS with 40 rules which gave better performances than the Neural Networks classifier which gives an AUC of 0.9607 followed by the AUC 0.8039 for EDR.

Table 3 AUC results for the arbitrage data

The best average recall overall obtained on this dataset is 96.22 % with the Genetic Type-2 FLS with 200 rules. The Genetic FLS with 40 rules best average recall performance was 94.64 %. The best average recall obtained by the neural network was 94.04 %. The EDR best average recall was around 78 %.

These results shows that the proposed genetic type-2 FLS gives a better performance when compared to a white box model like the EDR procedure. The proposed genetic type-2 FLS gave also better performance when compared to black box models which shows that the proposed genetic type-2 FLS can achieve a similar or even better performance than black box models while providing a transparent white box model with a summarised number of rules which are easy to understand and analyse by the lay user.

It should be emphasised that all the inputs for the arbitrage data were continuous inputs which allowed the FLSs to give the best results. However in data employed in the next subsection the inputs will be both continuous and categorical/discrete (like gender) which will not allow the FLSs to give its best performance, however the FLSs will be able to give good comparable results to black box models which giving a white box model.

5.2 Performance on credit approval dataset

With the economic crisis of the recent year, prime and subprime credit requests continuously expanded. Unfortunately with the increase of the number of people asking for credit, the number of people not being able to repay increased as well. The ability of finding and declining bad credit request has been crucial for lenders and provide a huge money saving. Lots of effort is nowadays put in finding the best techniques to reduce the risk in the lending market. The proposed system has been tested using data gathered from a credit lender company. The dataset built includes the information that a customer would provide when asking for credit. The dataset is composed by 123,116 records and 10 features divided into 3/4 for the training set (92,397 records) and 1/4 for the testing set (30,720 records). The 40 % of the training set is used during the GA tuning as validation set. The system need to be able to find bad credit requests but as well need not to decline too many requests. As the matter of the fact a simple approach to avoid all bad credit requests would be simply of not accepting any requests at all, but of course this extreme scenario is not feasible because this does not match with the business model of credit lenders. The data is extremely unbalanced as it presents 98.54 % of customer belonging to class 0 (Good Customers) inputs and 1.46 % of customers belonging to class 1 (Bad Customers). Different types of our classifiers have been built for this reason, providing a more risk averse or riskier approach thus accepting more or less requests. It is worth mentioning that the data is quite different from the Arbitrage data set mentioned in the previous subsection where this data set is very noisy and sparse the hence the prediction accuracy of any prediction system will be rather limited.

The genetic type 2 FLS was compared to neural networks which were found to be the best black box model suited for this data set. Figure 12 shows the ROC curve over the testing data of the proposed genetic type-2 FLS plotted against the Neural Networks ROC curve. Table 4 summarise the best average recall and the area under the curve for both techniques.

Fig. 12
figure 12

Credit approval dataset ROC curve over testing data

Table 4 Credit approval AUC and best average recall

The best performances on this dataset are obtained by the neural network. The reason is mainly because half of the features on this dataset are categorical, and the FLS at the moment reduce its performance when dealing with these kinds of features. However, it can be seen that over this unbalanced noisy data, the proposed genetic type-2 FLS produced comparable results to black box models like Neural Networks while the proposed type-2 FLS produced a white box model that could be easily understood and analysed by the lay user.

5.3 Comparison between the proposed weighted scaled dominance and other measures employed in Fuzzy classification systems

In this paper, we have presented a new measure called weighted scaled dominance which is based on the extension of the “scaled dominance” measure introduced in our previous work and the weighted confidence measure introduced by Ishibuchi and Yamamoto (2005). The aim of this new measure is to give more weight to associations that do not occur very often in the dataset, especially those associations affiliated with the minority classes and hence described in the decision space by few samples. In most of the cases the minority classes are usually the relevant class to identify, and scarcity of the samples makes the problem more challenging. There exist in the literature other measures that aim to give more importance to infrequent but still important associations; one of these measures is the lift. The lift can be defined as the combined support of the consequent and the antecedent of a rule, over the support of the antecedent multiplied by the support of the consequent (Tufféry 2011).

$$\begin{aligned} l({{\tilde{A}}_q \Rightarrow C_q})=\frac{s({\tilde{A}}_q \Rightarrow C_q )}{s({{\tilde{A}}_q})*s(C_q)} \end{aligned}$$
(31)

Equation (31) can be rewritten using the definition of confidence in Eqs. (16) and (17) as follows:

$$\begin{aligned} l({{\tilde{A}}_q \Rightarrow C_q})=\frac{c({{\tilde{A}}_q \Rightarrow C_q })}{s({C_q })} \end{aligned}$$
(32)

The nominator is the confidence of the rule and the denominator of the equation represent the support of the consequent class. Another important measure is the weighted dominance which is obtained simply by the multiplication of scaled confidence and scaled support, explained in Ishibuchi and Yamamoto (2004, 2005). Those metrics are defined as follows:

$$\begin{aligned} wc({{\tilde{A}}_q \Rightarrow C_q})&= c( {{\tilde{A}}_q \Rightarrow C_q })-c_{ave}\end{aligned}$$
(33)
$$\begin{aligned} ws({\tilde{A}_q \Rightarrow C_q})&= s( {\tilde{A}_q \Rightarrow C_q })-s_{ave}\end{aligned}$$
(34)
$$\begin{aligned} wd({\tilde{A}_q \Rightarrow C_q})&= wc({\tilde{A}_q \Rightarrow C_q })*ws({\tilde{A}_q \Rightarrow C_q}) \end{aligned}$$
(35)

where \(c_{ave}\) and \(s_{ave}\) are the average confidence and support over fuzzy rules with the same antecedent \(\tilde{A}_q\) but different consequent classes (Ishibuchi and Yamamoto 2004, 2005).

We compared the results obtained by the different metrics using the same fuzzy sets. We have compared the metrics over various data sets but due to the space limitation, we will report only the results achieved over a complicated noisy data from the banking credit evaluation system (different from the data sets employed above). Figure 13 summarise the results obtained with the different data mining measures with varying the size of the rule bases used for classification.

Fig. 13
figure 13

Comparison between the proposed weighted scaled dominance and other measures employed in fuzzy classification systems

As shown in Fig. 13, the best result (with best average recall) was obtained employing the suggested weighted scaled dominance measure as described in Eq. (22), (23) or in general whenever the scaling procedure is used in any technique. The scaling procedure is applied by employing the scaled firing strength described in Eqs. (14)–(15). When the scaling is not used the next best results are obtained with lift described by Eq. (32). This underlines the point that infrequent association can as well be very important, and this is extremely true in imbalanced datasets. The weighted dominance is described by Eqs. (33)–(35) but without using the scaling procedure it did not get good results, as the matter of the fact this measure did not use the scaled firing strength as in the weighted scaled dominance. As can be seen, the comparisons have been conducted with different pre specified rule sizes. The rule selection has been performed by selecting those only with highest metrics in question.

5.4 Evaluation of the performance of the proposed similarity technique

The quality of a fuzzy logic system is positively correlated with the quality of its rule-base. But what happen though if an input does not match any rule in the rule base? In the majority of fuzzy classification systems, there are two main approaches to handle a situation when the incoming input does not match any rules from the FLS rule base. The first approach is to reject the input, and do not give a prediction for this given input, and hence do not consider it in the confusion matrix and in the calculation of the recall and false positive rate. The second approach is to build a default rule that fires every time an input does not fire any rules from the rule base.

The first approach is unacceptable solution for the financial domain where the prediction system should always be able to provide a prediction. The second approach could be acceptable in the case of highly unbalanced datasets and in case the system has only two output classes but overall it is not a strong solution as the matter of the fact it does not improve the quality of the classifier and will have problems when there is a big number of output classes. In fact on a two class problem, this approach by definition will always produce the same average recall improvement (for inputs not matching rules from the rule base) regardless of the class chosen as default option. Let’s consider for example a dataset where there are 1000 inputs that do not match any rule in the rule-base. Let’s suppose then that 700 of these cases are actually class 1 while 300 are actually class 0. If we create a default rule which says that all of these 1,000 unmatched inputs must be mapped as Class 1, we will have the confusion matrix in Table 5.

Table 5 Confusion matrix

From the example, even though the dataset is unbalanced and we chose the rule that would have made better sense, the achieved average recall is only 50 %, as the matter of fact regardless of the default rule we choose the recall on the default class by definition will always be 100 % and the recall will be 0 % on the other class and so the average recall will always be 50 % on the cases of inputs not matching rules from the rule base. Hence, the contribution that the default rule will give to the classifier will always be the same minimal (50 %) contribution (on the recall measure) regardless the output class chosen.

Our proposed techniques on the other hand aim to find the most similar rules from the rule base (as discussed above in Sect. 4.2.2.2, and choose by using the weighted scaled dominance approach the output class. This section will show the results obtained on both arbitrage and credit approval datasets mentioned above by the proposed similarity approach.

On the arbitrage data, we tried various pre specified rule base sizes of 10 rules, 20 rules, 30 rules and 40 rules. The numbers of cases where the inputs did not match any rules in the rule base were 97 (in case of using a rule base of 10 rules), 62 (in case of using a rule base of 20 rules), 56 (in case of using a rule base of 30 rules) and 34 (in case of using a rule base of 40 rules). Figure 14 shows the comparison on testing data on the average recall obtained when using similarity technique and the default rule technique. As shown in Fig. 14, when employing the similarity technique with just 5 rules per class (10 rules in the rule-base) the cases of inputs not matching rules in the rule base were 97, and the achieved average recall on cases where the inputs not matching rules from the rule base was 63.9 % (compared to 50 % average recall when using any default rule approach). The similarity method gives better results when increasing the number of rules in the rule-base because the decision space is better represented by the rules in the rule-base so the similarity can find more appropriate similar rules. On a rule-base of 40 rules the similarity is able to achieve an average recall of 100 % on cases where the inputs not matching rules from the rule base, thus the similarity technique was able to correctly classify all the 34 cases where the inputs did not match rules from the rule base.

Fig. 14
figure 14

Comparison between the similarity and default rule technique on arbitrage data set

The similarity measure was tested as well on the credit approval dataset which presented more complex features where the dataset have unordered categorical features on which it is difficult to create an ordered relationship (like for example credit card type). On these type of features a default distance of 1/(tot number of labels \(-\)1) was given, and the similarity formula in Eq. (27) is changed accordingly. The similarity technique on this dataset has been tested with a rule-base of 50, 100, 150, 200, 250, 300 and 350 rules. The credit approval data is a bigger dataset and highly unbalanced and the number of cases where the inputs do not match any of the rules in the rule base were 2,426, 748, 429, 319, 188, 125, 92 in case of using 50, 100, 150, 200, 250, 300 and 350 rules respectively. The dataset present 98.54 % of class 0 and 1.46 % of class 1. Figure 15 shows the average recall comparison when employing the similarity and the default rule technique. It can be seen that again the similarity technique gives better average recall (compared to the default rule) on the cases where the inputs do not match any rules from the rule base

Fig. 15
figure 15

Comparison between the similarity and default rule technique on credit approval data set

6 Conclusions and future work

The global economic meltdown of the late 2000s exposed many organisations around the world, this drove the need to build robust frameworks for predicting and assessing risks in financial applications. In the current economic situation, transparency became an important factor where there is a need to fully understand and analyze a given financial model. In this paper, we have presented a genetic type-2 FLS capable of generating summarized models from pre-specified number of linguistic rules, which enables the user to understand the generated financial model, thus generating a transparent and easy to read and analyse model. The proposed system was tested on two different datasets.

We have shown how the proposed genetic system allows learning the various parameters of the input type-2 fuzzy sets which cannot be easily designed or manually tuned.

We have presented two novel measures, the first measure is a data-mining measure called weighted scaled dominance. The performance of this novel technique was compared against other classic data-mining metrics and it was shown that the proposed weighted scaled dominance outperformed other widely used measures. The second presented measure was called similarity measure which is a technique used to be able to provide a classification even when the inputs do not match any rules from the rule base. This technique was implemented to avoid a commonly used approach of discharging any inputs that do not match any rule in the rule-base. We have also shown the improvements of the proposed similarity measures over the default rule where we have shown that for inputs not matching rules in the rule base, the proposed similarity measure result in a considerable uplift in the average recall when compared to using the default rule which cannot be easily used when the number of output classes is more than 2.

We have performed several evaluations in two distinctive financial domains one for the prediction of good/bad customers in a financial real-world lending application and the other domain was in the prediction of arbitrage opportunities in the stock markets. The proposed Genetic type-2 FLS has outperformed white box models like the Evolving Decision Rule (EDR) procedure (which is a white based on Genetic Programming (GP) and decision trees) and gave a comparable performance to black box models like neural networks while the proposed genetic type-2 FLS provided a white box model which is easy to understand and analyse by the lay user.

In financial applications, there is a need to have clear, transparent and easy to understand models which stresses the importance of increasing the interpretability of the given financial model. Hence, for our future work, we will aim to optimize also the length of the rules and use do not care conditions to make the genetic type-2 FLC easier to read by the lay user. In Ishibuchi and Nojima (2007), the trade-off between interpretability and accuracy of type-1 fuzzy systems has been discussed and how accuracy can be affected when trying to build interpretable systems. In our future work, we aim to carry out the same analysis as Ishibuchi and Nojima (2007) for fuzzy type-2 systems and investigate how this trade off affects type-2 systems.