Keywords

1 Introduction and Related Work

Data mining is the task of finding patterns in a large data collection. Methods of supervised learning find a mapping, called a model, between input objects and a target property. For the classification task the target property is categorical. Without loss of generality, we discuss classification tasks where the target property distinguishes between two classes denoted by the values 0 and 1, respectively. Let D be a set of objects (tuples of reals). Every object is characterized by its values for n given attributes: \(o=\left( o_1,\ldots , o_n\right) \). Furthermore, we assume the existence of a hidden mapping m from D to the classes, that is, \(m: D \rightarrow \{0,1\}\). We explicitly know the mapping only for a subset \(O \subset D\). That is, we hold \(M=\{(o,m(o))|o\in O\}\). Let \(T\!R\subset M\) be the set of training data and \(T\!E=M\setminus T\!R\) be test data.

Solving a classification problem means to construct a mapping function \(cl:D\rightarrow \{0,1\}\), called a classifier, from given training data \(T\!R\). The classifier should approximate m and should provide a prediction on \(T\!E\) with high accuracy. The accuracy of a classifier is quantified as the fraction of correctly classified objects of all test objects: \(accuracy =|\{(o,m(o))\in T\!E| m(o) = cl(o)\}|/|T\!E|.\) In addition to a good accuracy a classifier should explain to users the connection between object attributes and the corresponding class, see [2]. A very popular classification method is the decision tree (DT), see [1]. The DT is based on rules of Boolean logic and can be seen as a good trade-off between accuracy and power in order to explain the classifier [3]. That is, in contrast to works like [7, 15] we use logic as a means for explanation.

For finding the class of an object using a DT we navigate from the root to a leaf. Following such a path means to check conjunctively combined conditions on object attribute values. If we regard objects as points in \([0,1]^n\) then every tree node split on an attribute corresponds to one or more hyperplanes being parallel to \(n-1\) axes.

Fig. 1.
figure 1

(left) Class decision lines for \((x>0.5) \wedge (y>0.5)\) (dashed) and \(x*y>0.5\) (solid) and (right) space decomposition by axis-parallel decisions for \(x*y>0.5\)

See Fig. 1 (left) for a two-dimensional case where the dashed line refers to the class separation for \((x>0.5) \wedge (y>0.5)\). For that class decision the attribute values interact conjunctively on the level of Boolean truth values. But what about scenarios where the interaction takes place on object values directly? See for example the solid class separation line in Fig. 1 (left) for \(x*y>0.5\). Let, for example, \(x\in [0,1]\) encodes age and \(y\in [0,1]\) encodes continuously the BMI of a person. Furthermore, the risk of severe health damage from COVID may increase gradually in the shape of a product of age and BMI. In that and similar cases, decision trees based on axis parallel decisions can only roughly approximate non-parallel decision lines and deteriorate, see Fig. 1 (right). A tighter approximation would lead to even more deterioration.

In contrast to the traditional decision tree based on Boolean decisions we develop a quantum-logic inspired decision tree (QLDT). Instead of combining Boolean values we regard attribute values from [0, 1] as results from quantum measurements and combine them directly by using negation, conjunction and disjunction following the concepts of quantum logic [9, 12]. Different from fuzzy logic, quantum logic based on mutually commuting conditions obeys the rules of a Boolean algebra. Therefore, in contrast to decision trees based on fuzzy logic [5, 6, 10], every logical formula can be represented as a set of disjunctively combined minterms which are themselves conjunctions of positive or negated conditions (disjunctive normal form) [4]. After deriving a logic expression e in disjunctive normal form we generate a QLDT (qldt(e)) from it.

Referring to the solid decision line (left) in Fig. 1 we obtain the quantum logical expression \(x\wedge y\) with \(x,y\in [0,1]\).

The evaluation of a traditional decision tree against an input object differs from the evaluation [qldt(e)] of a QLDT. Starting from the QLDT root we navigate in a parallel manner to all leaves where for each leaf we obtain a leaf-specific evaluation value from [0, 1]. All evaluation values of class-1-leaves are summed up to a class value from [0, 1]. A final threshold \(\tau \) is applied to the class value for a discrete class decision. The class decision can be written as \([qldt(e)] >\tau \) (in the example we yield \(x*y>0.5\)).

For our quantum logic decision tree approach we identify the following advantages:

  1. 1.

    Quantum logic deals directly with continuous truth values;

  2. 2.

    In contrast to fuzzy logic our quantum-logic inspired approach obeys the rules of the Boolean algebra [8], for example \([e\wedge \lnot e]=0\) and \([e\wedge e]=[e]\);

  3. 3.

    Class separation lines are not restricted to be axis-parallel.Footnote 1

In following sections we will develop the QLDT. It is based on the concepts of CQQL. The quantum-logic inspired language CQQL (commuting quantum query language) was introduced in [12, 14].

2 Commuting Quantum Query Language (CQQL)

Syntactically, a CQQL expression is an expression of propositional logic based on conjunction, disjunction, and negation. We assume n atomic, unary conditions on the n values of an object o. Such a condition expresses gradually whether an input value is a high value, e.g. a high BMI value. Each of the conditions returns a value from [0, 1]. The upper bound 1 is interpreted as true and the lower bound 0 as false. In [11] we prove that a quantum logic expression based on atomic conditions on different attributes form a Boolean (orthomodular, distributive) lattice.

A CQQL expression e in a specific syntactical normal form (CQQL normal form, see [11, 12, 14]) can be evaluated arithmetically. Each CQQL expression can be transformed into that normal form, see [12, 14].

Let the function atoms(e) return the set of atomic conditions involved by a possibly nested condition e. The CQQL normal form requires that for each conjunction \(e_1\wedge e_2\) and for each disjunction \(e_1\vee e_2\) (but not for the special case of an exclusive disjunction) the atom sets are disjoint: \(atoms(e_1)\cap atoms(e_2)=\emptyset \). If for \(e_1\vee e_2\) the conjunction \(e_1\wedge e_2\) is unsatisfiable in propositional logic then the disjunction is exclusive. We mark each exclusive disjunction by \({\mathop {\vee }\limits ^{.}}\).

The evaluation of a CQQL expression e in the required normal form against an object o is written as \([\cdot ]^o.\) For brevity, we drop the object o and write just \([\cdot ]\). In the following recursive definition of [e], we distinguish five cases:

  1. 1.

    Atomic condition: If e is an atomic condition then \([e]\in [0,1]\) returns the result from applying the corresponding condition on o.

  2. 2.

    Negation: \([\lnot e]=1 - [e]\);

  3. 3.

    Conjunction: \([e_1\wedge e_2]=[e_1]*[e_2]\);

  4. 4.

    Non-exclusive disjunction: \([e_1\vee e_2]=[e_1]+[e_2]-[e_1]*[e_2]\); and

  5. 5.

    Exclusive disjunction: \([e_1{\mathop {\vee }\limits ^{.}}e_2]=[e_1]+[e_2]\).

We now extend the expressive power of a CQQL condition by introducing weighted conjunction \((e_1\wedge _{\theta _1,\theta _2}e_2)\) and weighted disjunction \((e_1\vee _{\theta _1,\theta _2}e_2)\). The work [13] develops the concept of weights in CQQL from quantum mechanics and quantum logic. Weight variables \(\theta _1,\theta _2\) stand for values out of [0, 1]. A weight \([\theta _i]=0\) means that the corresponding argument has no impact and a weight \([\theta _i]=1\) equals the unweighted case (full impact). We regard every weight variable \(\theta _i\) as a 0-ary atomic condition. Before we evaluate a condition with weights we map every weighted conjunction and weighted disjunction in e to an unweighted condition:

$$\begin{aligned} (e_1\wedge _{\theta _1,\theta _2}e_2)\rightarrow & {} ((e_1\vee \lnot \theta _1) \wedge (e_2\vee \lnot \theta _2))\\ (e_1\vee _{\theta _1,\theta _2}e_2)\rightarrow & {} ((e_1\wedge \theta _1) \vee (e_2\wedge \theta _2)). \end{aligned}$$

For a certain classification problem we want to find a matching CQQL expression e together with a well-chosen output threshold value \(\tau \) for cl: \(cl^\tau _e(o) = th_\tau ([e]^o)\) with \(th_\tau (x) = \left\{ \begin{array}{ll} 1 &{} \text {if }x>\tau \\ 0 &{} \text {otherwise.} \end{array} \right. \)

From the laws of the Boolean algebra we know that every expression e can be expressed in the complete disjunctive normal form, that is, every expression is equivalent to a subset of \(2^n\) minterms. We implicitly assume for each of the n object attributes exactly one atomic condition \(c_j\) for \(j=1,\ldots ,n\) and for an object \(o=(o_1,\ldots ,o_n)\) the equivalence \(o_j=[c_j]^o\in [0,1].\) The minterm subset relation for any logic expression can be expressed by use of minterm weights \(\theta _i\in \{0,1\}\):

$$\begin{aligned} e = {\mathop {\bigvee }\limits ^{.}}_{i=1}^{2^n}minterm_{i,\theta _i}&\text { mapped to }&{\mathop {\bigvee }\limits ^{.}}_{i=1}^{2^n}minterm_{i} \wedge \theta _i=e \end{aligned}$$
(1)

and \(minterm_i = \bigwedge _{j=1}^n c_{ij}\) with \( c_{ij} = \left\{ \begin{array}{ll}c_j &{} \text {if }(i-1) \& 2^{j-1} >0\\ \lnot c_j &{} \text {otherwise} \end{array}. \right. \)

That is, the value \(i-1\) is considered as a bitcode and identifies a minterm uniquely and \(j-1\) stands for a bit position. The symbol ‘ &’ stands for the bitwise and.

Please note that the disjunction of any two different complete minterms is always exclusive. Thus, e is in CQQL normal form and its evaluation against object o yields

$$\begin{aligned}{}[e]^o = \sum _{i=1}^{2^n} \theta _i \prod _{j=1}^n [c_{ij}]^o. \end{aligned}$$
(2)

3 Extraction of Minterms and Finding the Output Threshold

Next, we will extract a CQQL expression e in complete disjunctive normal form from training data. We have to find the weight \(\theta _i\) for every minterm i. The starting point is the training set \(TR=\{(x,y)\} = \{(o,m(o))\}\) with the input tuples (objects) \(x=o\in [0,1]^n\) and \(y=m(o)\in \{0,1\}\).

One important requirement for a classifier is high accuracy. Therefore, we maximize the accuracy of expression (1) depending on the minterm weights \(\theta _i\) based on \(TR=\{(x,y)\}\).

Accuracy acc for a continuous evaluation can be measured as sum over the two correct cases \((y=1)\wedge [e]^x\) and \((y=0)\wedge [\lnot e]^x\) over all pairs \((x,y)\in TR\):

$$\begin{aligned} acc= & {} \sum _{(x,y)\in TR} \left( y*[e]^x + (1-y)(1-[e]^x)\right) \\= & {} \sum \limits _{i=1}^{2^n} \theta _i \sum _{(x,y)\in TR} \left( (2 y -1)\cdot \prod \limits _{j=1}^n [c_{ij}]^{x}\right) + \sum _{(x,y)\in TR} (1-y). \end{aligned}$$

We see after applying Eq. (2) and some reformulations, that accuracy shows a linear dependence on the minterm weight \(\theta _i\) for fixed TR-pairs. The first derivative yields a constant gradient on \(\theta _i\):

$$\begin{aligned} \frac{\partial acc}{\partial \theta _i}=\sum _{(x,y)\in TR} (2 y -1)\cdot \prod \limits _{j=1}^n [c_{ij}]^{x}{=\sum \limits _{(x,1)\in TR} \prod \limits _{j=1}^n [c_{ij}]^{x} - \sum \limits _{(x,0)\in TR} \prod \limits _{j=1}^n [c_{ij}]^{x}}. \end{aligned}$$

For maximizing accuracy a minterm weight \(\theta _i\) should have the value 1 if \(\frac{\partial acc}{\partial \theta _i}>0\) and 0 otherwise. In other words, for the decision whether a minterm should be active or not it is sufficient to compare the impact of positive training data \(E_i\) against the impact of the negative training data \(N_i\) with

$$\begin{aligned} E_i=\sum \limits _{(x,1)\in TR} \prod \limits _{j=1}^n [c_{ij}]^{x}~\text {and}~N_i=\sum \limits _{(x,0)\in TR} \prod \limits _{j=1}^n [c_{ij}]^{x}. \end{aligned}$$

Please note that the decision depends on the relative number of positive training objects in TR. Therefore, let \(\gamma _1=|\{(x,1)\in TR\}|\) and \(\gamma _0=|\{(x,0)\in TR\}|\) be the number of positive and negative training objects, respectively. The fraction of negative objects is then given by \(\gamma = \frac{\gamma _0}{\gamma _1+\gamma _0}\). In the unbalanced case (\(\gamma \not =1/2\)) we compensate the effect on the minterm weight decision by:

$$\begin{aligned}{}[\theta _i]=\left\{ \begin{array}{ll} 1 &{} \text { if } \gamma \cdot E_i > (1-\gamma )\cdot N_i \\ 0 &{} otherwise \end{array}\right. . \end{aligned}$$
(3)

Following that minterm decision rule we can decide for every minterm whether it is active or inactive. In case of \(\gamma \cdot E_i\approx (1-\gamma ) \cdot N_i\) the decision is not clear. We call such kind of minterms unstable because adding a single new training object may change the decision. Instable minterms are less expressive then stable ones and have a low impact on accuracy. We are interested in stable minterms. For measuring stability we compute the ratio \(\rho _i\) of \(\gamma \cdot E_i\) to the sum of \(\gamma \cdot E_1\) and \((1-\gamma )\cdot N_i\) of a minterm i. A value for \(\rho _i\) close to 1/2 indicates an unstable minterm i, a value near 1 means a stable active minterm i, and a value near 0 means a stable inactive minterm i.

The question arises: should instable minterms be active or inactive? We propose to sort all minterms by their values \(\rho _i\) and choose a \(\rho \)-threshold \(\theta _\rho \) from them that provides a good trade-off between accuracy and compactness of expression e. The modified minterm decision rule is:

$$\begin{aligned}{}[\theta _i]=\left\{ \begin{array}{ll} 1 &{} \text { if } \frac{\gamma E_i}{\gamma E_i+(1-\gamma ) N_i} > \theta _\rho \\ 0 &{} otherwise \end{array}\right. . \end{aligned}$$
(4)

After applying our minterm decision rule (4) we obtain the expression:

$$\begin{aligned} e={\mathop {\bigvee }\limits ^{.}}_{i:[\theta _i]=1}\bigwedge _{j=1}^n c_{ij} { \text { with }[e]^x = \sum _{i: [\theta _i]=1} \prod _{j=1}^n [c_{ij}]^x.} \end{aligned}$$

Next we have to find the output threshold value \(\tau \) for \(cl^\tau _e(x) = th_\tau ([e]^x).\) Let \(\min _1=\min _{(x,1)\in TR} [e]^x\) be the smallest evaluation result of the positive training objects and \(\max _0=\max _{(x,0)\in TR} [e]^x\) be the highest result of the negative training objects. In case of \(\max _0 < \min _1\), positive objects and negative objects are well separated and we set \(\tau \) to \((\max _0+\min _1)/2\).

Otherwise, we have to choose a value \(\tau \) from the interval \([\min _1,\max _0]\). In order to find a threshold which maximizes discrete accuracy we use the training objects from TR:

$$\begin{aligned} \tau =\mathop {\mathrm {arg\,max}}\limits _{\tiny \begin{array}{c}(x,\_)\in TR\\ \tau _x:=[e]^x\\ \tau _x\in [\min _1,\max _0]\end{array}}accuracy(e,\tau _x,TR) \end{aligned}$$

where \(accuracy(e,\tau _x,TR) = |\{(x,y)\in T\!R| y = cl^{\tau _x}_e(x)\}|/|TR|.\)

4 Quantum Logic Decision Tree

Our expression e has so far been a disjunction of active minterms. The last step is to generate a quantum-logic inspired decision tree from e. The QLDT is just a compact presentation of the CQQL expression e based on active minterms. The basic idea is to regard the derived minterm weights as training data and to use a traditional DT algorithm for constructing the QLDT. The training data for the decision tree construction are just the bit values 0 and 1 from the binary code \(bitcode(i-1)=x_1\ldots x_n\) of all minterm identifiers i: The bit values of the bitcode are regarded as attribute values and the minterm weight as target for the DT algorithm:

$$\begin{aligned} TR'= & {} \{(bitcode(i-1),[\theta _i]) \mid i=1,\ldots , 2^n \}. \end{aligned}$$

For example, following minterm weights for \(n=3\) and \(e=(x_1\wedge x_2\wedge x_3){\mathop {\vee }\limits ^{.}}(\lnot x_1\wedge x_2\wedge x_3)=x_2\wedge x_3\) with evaluation \([e]^x=x_2*x_3\) produce the QLDT shown on the right hand side. The leaves correspond to the minterm weights:

i

\(x_1\)

\(x_2\)

\(x_3\)

\([\theta ]\)

1

0

0

0

0

2

0

0

1

0

3

0

1

0

0

4

0

1

1

1

5

1

0

0

0

6

1

0

1

0

7

1

1

0

0

8

1

1

1

1

figure a

But what about the evaluation of a QLDT? The generated decision tree looks like a traditional decision tree. However, its evaluation differs. The evaluation result of a QLDT should be the same as the evaluation result of e in disjunctive normal form:

$$\begin{aligned}{}[e]^x=\left[ {\mathop {\bigvee }\limits ^{.}}_{i:[\theta _i]=1}\bigwedge _{j=1}^n c_{ij}\right] ^x = \sum _{i:[\theta _i]=1}\prod _{j=1}^n [c_{ij}]^x. \end{aligned}$$
(5)

Please note, that the disjunction of minterms is always exclusive and, that is why, its evaluation leads to a simple sum. The same holds for a split node of a QLDT. A QLDT assigns implicitly to every leaf a set of minterms sharing the same path conditions (split attributes). Let \(L_A\) refers to the set of all active leaves where every leaf is represented by the identifier i of one of its assigned minterms and let \(path_i\) be the path from the root to leaf i. Then the evaluation of a QLDT is given by:

$$\begin{aligned}{}[qldt(e)]^x = \sum _{i\in L_A}\prod _{j \in path_i} [c_{ij}]^x=[e]^x=\sum _{i:[\theta _i]=1}\prod _{j=1}^n [c_{ij}]^x. \end{aligned}$$

Actually, the evaluation of a QLDT takes into account only active leaves. The inactive leaves correspond to \(\lnot e\) and are connected to the active leaves by \([e]^x+[\lnot e]^x=1\). Therefore, the inactive leaves are removed from the QLDT without any loss of semantics and as result the QLDT becomes more compact.

So, we end up with the QLDT classifier:

$$\begin{aligned} cl^\tau _e(x) = th_\tau ([qldt(e)]^x)=\left\{ \begin{array}{ll} 1 &{} \text {if }\sum _{i\in L_A}\prod _{j \in path_i} [c_{ij}]^x > \tau \\ 0 &{} otherwise. \end{array}\right. . \end{aligned}$$

5 Experiment

Next, we shall apply our approach to an example scenario and compare the resulting traditional decision tree with the generated quantum-logic inspired decision tree.

Fig. 2.
figure 2

Traditional decision tree for blood transfusion with 6 leaves (left) and quantum-logic inspired decision trees with best accuracy (right)

Our experimental dataset is the blood transfusion service center dataset.Footnote 2 A classifier needs to be found which predicts whether a person donates blood or not. To make that decision for every person we know recency (months since last donation), frequency (total number of donation), monetary (total blood donated in c.c.), and time (months since first donation). In the balanced case 178 people donated blood and 178 did not. 300 of the 356 people belong to the training set and the remaining ones to the test set. For further processing the attribute values of the four attributes are mapped to the unit cube \([0,1]^4\) using a normalized rank position mapping. For that mapping all values of an attribute are sorted. Then, every value is mapped to its rank position divided by the total number of values. For tied values the rank maximum is taken.

We obtain a traditional decision tree with highest accuracy of 64%, 6 leaves, and 5 levels, see Fig. 2 (left). The QLDT, see Fig. 2 (right), with its highest accuracy of 70% is achieved when we choose \(\theta _\rho =0.71\) and \(\tau =0.045\). That QLDT corresponds to the expression

$$\begin{aligned} e:=\lnot t \wedge \lnot r \wedge f \wedge m \end{aligned}$$

and its evaluation is

$$\begin{aligned}{}[e]=[\lnot t] \cdot [\lnot r] \cdot [f] \cdot [m]. \end{aligned}$$

6 Conclusion

Table 1. Differences between Boolean-logic-based decision tree and quantum-logic inspired decision tree

In our paper we suggest a decision tree classifier based on quantum logic. In contrast to Boolean logic, quantum logic can directly deal with continuous data, which is beneficial for many classification scenarios. Other than fuzzy logic, our quantum logic approach obeys the rules of the Boolean algebra. Thus, Boolean expressions can be transformed accordingly and a check of hypotheses against a Boolean expression becomes feasible.

In Table 1 we compare Boolean-logic-based decision trees with our quantum-logic inspired decision tree with respect to several criteria.

In a Boolean-logic-based decision tree every split decision at node level represents Boolean conditions and can be regarded as axis-parallel decision lines within the input space. For a QLDT the restriction on axis-parallel lines does not hold. A QLDT is appropriate in scenarios where the classification decision relies on sums and products rather than on a combination of Boolean values. The input for the threshold-based class decision is the evaluation result of the QLDT.

A QLDT represents syntactically a Boolean expression for one class. Thus, a QLDT classifier is a one-class classifier. Many-class classification problems can be transformed to multiple one-class classifier decisions.

Every inner QLDT node corresponds to exactly one logical expression with two outcomes. Since we drop 0-class leaves, some inner nodes have only one child. In contrast, a split rule in a traditional decision tree leads to more than or exactly two children.

For our QLDT we assume attribute values from the unit interval. The normalized rank position mapping maps ordinal and metric attribute values to the unit interval. But what about categorical attributes? We did not discuss this aspect but it can be easily solved by introducing an artificial attribute for each category.

As discussed above, the class decision in a traditional decision tree depends on exactly one leaf in contrast to the sum of all leave scores in a QLDT.

The complexity of a Boolean-logic-based decision tree for a number n of attributes corresponds to the number of leaves. In the worst case, that number is exponential. However, in many real-case-scenarios the number of leaves is much lower. In contrast, the generation of the QLDT requires to compute the weight for all \(2^n\) minterms. Thus, too many attributes cause a complexity problem more for the QLDT rather than for a Boolean-logic-based decision tree.