1 Introduction

Machine learning (ML) methods have been quite successful in various applications, such as face recognition, weather prediction, image reconstruction, and many more. Security experts use the methods for quick and reliable malware detection [10]. However, many methods, such as deep neural networks, are often considered black boxes as the reasoning behind their decisions may be unclear [13]. Understanding these decisions is essential; we may need to know why a person is considered at high risk of criminal activity [3] or why a benign file was classified as malicious. Understanding the decisions may not only be necessary for data scientists to understand their models better but may be required by some state’s law or regulation. The General Data Protection Regulation (GDPR), introduced by the European Parliament in 2016 and has taken effect as law in 2018, makes understanding decision-making based on personal data necessary [14].

However, what does it mean to understand the results of machine learning models? Is it the degree to which we understand the data or the inner workings of the algorithm? As one may expect, the answer to this is not precisely clear. The current literature offers various approaches that may help decide what part of the machine learning or the data mining process should be more understandable. One could gain a better understanding of the machine learning models through interpretability [19], explainability [13], or transparency [24]. Interpretability can help one to understand the decision-making of a model. Explainability, which is often interchangeably used with interpretability, could offer an explanation of why the model made the decision or why the model should make the decision. Transparency mainly relies on the process of understandable data processing or algorithmic deployment. Our work focuses on the interpretability of the results of the machine learning models applied to malware detection. We may also use the term explainability, in which case we consider it equal to interpretability.

In [19], Miller uses the following definition of interpretability: “the degree to which an observer can understand the cause of a decision”. In a more machine learning-based context, Doshi-Velez and Kim [9] define interpretability as the “ability to explain or to present in understandable terms to a human”. The authors of [5] argue that both definitions could be seen as two different approaches: one that requires a priori interpretable models, and the other that would create explanations to the existing or the future black-box methods.

In our work, we implement two well-known rule-learning algorithms, I-REP and RIPPER and the structures necessary for the representation of a decision list. We discuss possible speed-ups of the RIPPER algorithm, and we incorporate them into our implementation. Some of the speed-ups were necessary, as we use the EMBER dataset, which contains hundreds of thousands of samples.

For a given machine learning model, we try to interpret its results using decision lists generated by the aforementioned rule-learning algorithms. We first discuss the successfulness of the rule-learning process of both algorithms by exploring the success rate of the algorithms (e.g., accuracy, true positive rate) and by taking into account the complexity of the built decision lists. We then discuss the interpretability of the machine learning results using the Interpretability Entropy (see Definition 14). and how much do the predictions of machine learning methods match with the generated decision lists.

Throughout the experiments, we try to understand better the RIPPER algorithm’s performance by either changing its pruning metrics or its hyperparameters. We consider whether the order in which the rules were learned is strictly given or whether we can change the positions of the rules.

The rest of the paper is organized as follows: Sect. 2 presents works related to malware analysis. Some of the works make use of rule-learning algorithms (e.g., RIPPER) or approaches that try to either explain or interpret the reasoning behind the decisions of used machine learning methods. Section 3 introduces the theory necessary for rule-learning algorithms. Section 4 describes the specifics of our implementation of the rule-learning algorithms RIPPER and I-REP. Section 5 gives details on the used dataset and its split, transformation, and feature selection. Also, Sect. 5 introduces a metric that can be used to measure the interpretability by the decision lists. It contains the evaluation of the experiments, too.

2 Related Works

In this section, we provide an overview of works related to malware research, rule-learning algorithms, and interpretability.

The authors of [11] combined three different methods for malware detection: hash-based approach, Support Vector Machine-based approach, and rule-based approach. Each of the named methods is intended to be used for malware classes with different distributions. Using static analysis, they extracted n-grams based on the content of a Portable Executable file—n-grams are all substrings of a fixed length n [25]. The paper does not discuss the interpretability of the used model; however, it outlines one of its positive outcomes—reduction of space complexity. In their experiments, they reduced the storage cost from 1.8MB (signature-based approach) to 17.9KB (combined approach).

The work [26] compares the RIPPER algorithm (see Sect. 3.2) with other machine learning algorithms in malware detection. This is done merely on previously unseen samples. The paper does not clarify the number of iterations used for RIPPER. They used static features extracted from the Portable Executable files—used DLLs, DLL function calls, and the number of DLL function calls. The authors of this publication discuss how malware developers could use the information gathered by the classifiers to modify their malware. For example, by changing resource usage.

To explain the reasoning of their model, the authors of [4] created a tool, which not only classifies Android malicious files but also displays the features that contributed the most to the decision. The tool goes by the name DREBIN and uses features extracted through static analysis. These features are then mapped to a vector space, and Support Vector Machines are used for the classification. Features contributing to the classification can be derived from the vector space.

Inspired by deep learning and computer vision, the authors of [17] make use of convolutional neural networks and the Gradient-weighted Class Activation Mapping (Grad-CAM) [29] technique. This technique [29] uses gradient information that serves as an input for the final convolutional layer in the CNN. In [17], an APK file is first converted to a grayscale image representation and then used as an input for the deep learning models. Heatmaps are generated using Grad-CAM to explain the model results. Subsequently, heatmaps are averaged for distinct malware families; the authors refer to this as Cumulative Heatmaps. They can be used for malware analysts to gain more knowledge about the malware (by observing the areas of the code highlighted by the heatmaps), or they can be used to distinguish between better-performing models.

Similar steps towards interpretability in malware detection were taken in [7]. Portable Executable binaries are transformed to grayscale images, and deep transfer learning is employed for the classification task. The authors try to interpret the results of the models as follows. A binary file is first divided into super-pixels, contiguous regions. Then for each region, the coefficients are obtained. The positive coefficient values indicate that a region contributes to the classification decision, and the negative coefficient values indicate that a region does not contribute to the classification decision. The paper, however, does not further explain how malware analysts could use such information.

3 Rule-Based Classification

Several malware detection models based on machine learning techniques, such as neural networks, are considered a black box because it is difficult (for humans) to determine precisely why a given false positive or false negative occurs. Malware researchers prefer interpretable detection systems, such as rule-based methods, since they can be easily understood and better controlled. The goal is to improve the interpretability of the classification models. In this section, we describe the theoretical background for a rule-based system, specifically, decision rules. Some of the definitions provided in this section are later used in the algorithmic description (see Sect. 3.2), and they provide a high-point view of our implementation (see Sect. 4). Decision rules can be expressed as a set of if-then rules [20]. For example: if a file contains a suspicious function call, then mark this file malicious.

Definition 1 (Condition)

A condition c is defined as follows,

$$\displaystyle \begin{aligned} c \equiv x \odot h, \end{aligned} $$
(1)

where x is a feature, ⊙ is a relational operator, and h is the value of the feature x.

Usually, the conditions are logically ANDed together, making it necessary for all tests to fire [20].

Definition 2 (Rule & Rule Size)

A rule r is defined as follows,

$$\displaystyle \begin{aligned} r \equiv c_1 \land \cdots \land c_m, \end{aligned} $$
(2)

where m is the number of conditions for the rule r. We say that the rule r has a size of m.

However, a conjunction of the conditions is not a necessity, and a single rule may be expressed by a general logical expression [30].

For a given rule, we are interested in its quality, more specifically in its coverage (support) and accuracy (confidence). We say that a rule covers a sample if the sample satisfies the rule’s conditions [23].

Definition 3 (Rule Coverage)

Given a set of samples S, the coverage of a rule r is defined as

$$\displaystyle \begin{aligned} coverage( r, S ) = \{ s\:|\:s \in S, r\:covers\:s \}. \end{aligned} $$
(3)

The following definition allows to express the coverage of a rule numerically.

Definition 4 (Rule Coverage Size)

Given a set of samples S, we define the coverage size of a rule r as

$$\displaystyle \begin{aligned} coverage\_size( r, S ) = \frac{|coverage(r,S)|}{|S|} . \end{aligned} $$
(4)

Rules are said to be mutually exclusive if no two rules cover the same sample.

Definition 5 (Mutually Exclusive Rules)

Given a set of samples S, we say that a rule r i and a rule r j are mutually exclusive, if

$$\displaystyle \begin{aligned} coverage( r_i, S ) \cap coverage( r_j, S ) = \emptyset, i \neq j, i,j \in \{1,\ldots,n\}, \end{aligned} $$
(5)

where n is the number of rules for S.

Definition 6 (Exhaustive Rules)

Given a set of samples S, we say that rules r i are exhaustive, if

$$\displaystyle \begin{aligned} \bigcup_{i=1}^{n} coverage( r_i, S ) = S, i \in \{1,\ldots,n\}, \end{aligned} $$
(6)

where n is the number of rules for S.

However, such rule restrictions are often not required, and we allow the rules to overlap and not cover the whole set. Different problems arise, some rules may contradict each other, or some of the samples may not be covered at all. Two different schemas can be used to solve this: a decision list or a decision set [20].

In a decision list, the rules are ordered as follows:

$$\displaystyle \begin{aligned} R = [ r_1, r_2, \ldots, r_n ], \end{aligned} $$
(7)

where n is the number of rules for a given list. In other words, the rules are kept in the order in which they were added. The same order is later used for classification, too.

The decision set does not require the rules to be ordered; instead, all rules get to vote on classifying a given sample. Unfortunately, once the decision set grows very large, it becomes quite hard to understand. Thus, we will be using a decision list in this work if not stated otherwise.

Definition 7 (Decision List Coverage)

Given a set of samples S, the coverage of the decision list R is defined as

$$\displaystyle \begin{aligned} coverage( R, S ) = coverage( r_{n}, coverage( r_{n-1}, \ldots, coverage( r_1, S ) \ldots ) ), \end{aligned} $$
(8)

where n is the number of rules in R.

3.1 From Trees to Rules

In this section, we will briefly compare another popular machine learning tool—a decision tree. Decision trees are built from nodes, where each node, except the last ones, tests a feature with a given value (see Definition 1). The last node, also called a leaf, represents a decision, for example, classifying samples as benign or malicious [30]. Although the idea behind decision trees is quite simple, they may turn out to be quite complex and hard to interpret [6].

Figure 1 illustrates a simple decision tree. However, its outcome may be a little misleading as it is a simple disjunction, which can be easily described using rules:

$$\displaystyle \begin{aligned} if &\:a \land b\:\:then\:x\:\\ else\:if &\:c \land d\:\:then\:x\:\\ else\:if &\:e\:\:then\:x\:\\ else &\:y \end{aligned} $$
(9)
Fig. 1
figure 1

Decision tree—describing a simple disjunction

Quinlan [23] designed an algorithm called C4.5rules, which converts a decision tree to a decision list. After its construction, it tries to improve it. Unfortunately, this part of the algorithm is expensive. Cohen [8] showed that the complexity is near \(\mathscr {O}(n^3)\), where n is the number of samples.

3.2 Rule-Learning Algorithms

In this section, we briefly discuss one branch of rule-learning algorithms—separate-and-conquer. Unlike the divide-and-conquer technique, separate-and-conquer algorithms first focus on the part of the training set and then try to describe it. In contrast, divide-and-conquer strives to maximize the separation between classes [30].

Incremental Reduced Error Pruning (I-REP) [12] is an algorithm designed by Fürnkranz and Widmer in 1994. It implements two pruning approaches to deal with noisy data: pre-pruning and post-pruning. Pre-pruning ignores some of the training samples in the learning process so that the final decision list would not describe the training set perfectly. Post-pruning corresponds to removing a condition in a given rule. The following metric drives I-REP’s pruning,

$$\displaystyle \begin{aligned} \mathscr{P}_{\mbox{I-REP}}(p,P,n,N) = \frac{p+(N-n)}{P+N}, \end{aligned} $$
(10)

where p (n) is the number of positive (negative) samples covered by the current rule from a total number of P (N) positive (negative) samples in the pruning set. The algorithm is described in Algorithm 1.

Algorithm 1: I-REP

Thanks to its efficiency, I-REP is well-suited for large training sets. However, in 1995, Cohen showed that I-REP does not learn rules well enough and can be outperformed by previously known algorithms, such as C4.5rules [8]. Cohen has addressed specific issues and explained how I-REP could be improved. With these improvements, Cohen designed a new algorithm called Repeated Incremental Pruning to Produce Error Reduction (RIPPER).

Cohen’s team made three modifications—they replaced I-REP’s pruning metric, chose a different approach to stop the rule-learning process, and added decision lists optimizations. The following metric replaced I-REP’s pruning metric,

$$\displaystyle \begin{aligned} \mathscr{P}_{\mbox{RIPPER}}(p,P,n,N) = \frac{p-n}{p+n}. \end{aligned} $$
(11)

The following definitions are necessary to understand when RIPPER’s rule learning is stopped.

Definition 8 (Rule Description Length)

Given the positive real numbers n, k and p ≠ 1, we define the rule description length as follows,

$$\displaystyle \begin{aligned} \mathscr{S}(n,k,p) = \frac{1}{2}(k\log_2{\frac{1}{p}} + (n-k)\log_2{\frac{1}{1-p}} + \log_2{k}), \end{aligned} $$
(12)

As described by Cohen [8], this encoding allows two parties (sender and recipient) to work over a set of n elements. The recipient can recognize k elements, and p is known ahead. log2 k is the number of bits required to send the number k. The whole metric is scaled by \(\frac {1}{2}\) to limit possible redundancy in the features.

Definition 9 (Decision List Exceptions)

For a given set of samples S with a positive class P and a negative class N, and for a given decision list R, we define the number of exceptions as follows,

$$\displaystyle \begin{aligned} \mathscr{E}(R,S) = \log_2{\binom{TP + FP}{FP}} + \log_2{\binom{TN + FN}{FN}}, \end{aligned} $$
(13)

where TP (TN) is the number of samples correctly classified as P (N), and FP (FN) is the number of samples incorrectly classified as P (N).

Definition 10 (Total Description Length)

For a given set of samples S and a decision list R we define its total description length as follows,

(14)

where n is the total number of possible conditions for S and k r is rule r’s length.

Let minimum description length (MDL) be the current total description length (TDL) of a given decision list. Rule-learning stops if adding a new rule should increase MDL by more than 64 bits. Since Cohen described the RIPPER algorithm mostly with words, we include its pseudocode in Sect. 4 as it may not precisely correspond to the original implementation.

4 Implementation of Rule-Based Classifiers

To efficiently generate decision lists using well-known algorithms mentioned in Sect. 3, we created our implementations of rule-based classifiers (RBCs) in C++. Although some implementations of the algorithms exist, such as Weka [15] or wittgenstein [21], they are not quick enough to process large amounts of data. We did not want to lose the ability of most of the machine learning tools—quick and easy deployment. Thus, we added Python support to our library using pybind11 [18]. The code has been made publicly available on Github.Footnote 1 We further discuss some of the implementation details below.

4.1 Decision List

We implemented basic structures that correspond to the definitions in Sect. 3. Namely, those are the condition (see Definition 1), the rule (see Definition 2), and the decision list, often referred to as the ruleset. At this moment, the available operators for the condition are {<=,  >=}. Both operators are intended to be used for numerical features only.

4.2 I-REP

As the original paper for I-REP [12] does not cover dealing with numerical features, we used Cohen’s [8] suggestions for the algorithm. During the growth phase, the algorithm searches for the best split between the numerical features. Rule growth for both IREP and RIPPER is guided by maximizing FOIL’s gain and stops once no negative samples are left in the growing set. Neither of the papers mentioned above tackles the issue of learning the same rule twice. This may happen if the present feature values are the same for both positive and negative growing sets. Our implementation stops the growing phase and proceeds to the next step.

Algorithm 2: I-REP*

Algorithm 3: RIPPER

4.3 RIPPER

RIPPER increased the computational complexity with its improvements. The learning process will stop if MDL increases by more than 64 bits. The calculation complexity of TDL mainly lies in the calculation of exception bits. Naively, we could calculate TDL each time; fortunately, we can use memorization to speed up some parts of the calculations.

Rule description lengths can be cached. We only need to calculate the description length of one rule each time throughout the iterations in I-REP*. We can do similar steps for the exception bits. In I-REP*, we only need to remember the remaining samples (samples that were not covered by any rule). We have to do more steps in the optimization phase as rules depend on the previous ones. We need to compare the coverage of the new ruleset with the old ruleset—the new ruleset is the ruleset by which the previous one was replaced, either replacement ruleset or revision ruleset. Let r n be a new rule, R n a new decision list, r o the original rule, R o the original decision list, and S remaining samples that were not covered by any previous rule. We need to check two cases—increase or decrease of falsely, resp., correctly covered cases.

$$\displaystyle \begin{aligned} S_{n} = coverage( R_{n}, coverage( r_{o}, S ) \setminus coverage( r_{n}, S ) ) \end{aligned} $$
(15)
$$\displaystyle \begin{aligned} S_{o} = coverage( R_{o}, coverage( r_{n}, S ) \setminus coverage( r_{o}, S ) ) \end{aligned} $$
(16)

By calculating the sizes of S n and S o we do not have to recalculate TDL. We only need to either increase or decrease the number of falsely, resp., correctly covered cases.

The last part of our speed up lies in using Stirling’s approximation of the n!. For our case, we can derive the following formula,

$$\displaystyle \begin{aligned} \log_2{(n!)} \sim ( n + \frac{1}{2} ) \log_2{(n)} - n + \frac{1}{2} \log_2{(2\pi)}. \end{aligned} $$
(17)

Thus, the exception bits with known input arguments can be calculated in \(\mathscr {O}{(1)}\). We include RIPPER’s pseudocode, divided into IREP* (see Algorithm 2) and RIPPER’s optimization phase (see Algorithm 3).

5 Experiments

This section discusses the used dataset and data split we used throughout the experiments. We briefly summarize the data preprocessing, too. Furthermore, we discuss the interpretability of ML models by using decision lists. For our experiments, we define what it means for a model to be absolutely or partially interpretable by a decision list and when a model should be considered interpretable. Throughout the experiments, we examine closer the behavior of the RIPPER algorithm.

All of the experiments were run on a single computer platform with two processors (Intel Xeon Gold 6136 CPU @ 3.00GHz), with 755 GB of RAM running the Ubuntu 20.04 LTS operating system.

5.1 Dataset Description

For our experiments, we used the publicly available dataset called EMBER (Elastic Malware Benchmark for Empowering Researchers) [1]. More specifically, we employed the most up-to-date version from 2018. The authors of the dataset dealt with three significant challenges—legal (releasing binaries of monetized software), labeling (potentially requires expert knowledge), and security (releasing malicious binaries is not safe) aspects. Thus, using the static analysis, features were extracted and incorporated into the dataset. This tackles two of the challenges mentioned above; labeling was achieved by using services such as VirusTotal.

The EMBER dataset consists of 1.1M samples, divided into a training set with 900K samples (300K malicious, 300K benign, 300K unlabeled) and a test set with 200k samples (100K malicious, 100k benign). The newer version of the dataset contains the label avclass [28] for malicious samples. This label indicates to which malware family a given malicious sample belongs. Throughout the experiments, we ignored the unlabeled samples.

The dataset is stored using the JSON file format, and for each sample, eight groups of raw features are present. General file information—includes information about the virtual size of the file, presence of a debug section, number of symbols, and more. Header information—here, one can find information extracted from the Common Object File Format (COFF) header, for example, file characteristics, machine type, or information from the Optional header. Imported and exported functions—both raw sections include the names of the imported or exported functions, for example "SHLWAPI.dll":["PathIsUNCW"]. Section information—contains information about each of the present sections, e.g., their name, size, entropy, and more. The following three sections are independent of the PE file format. Byte and byte-entropy histograms—both sections consist of 256 integer values each, indicating either the number of occurrences or the entropy for each byte. The string section includes information about printable strings, such as their average length, histograms, and more.

5.2 Data Splitting

We merged the train and test sets for our experiments, both predetermined in the EMBER dataset. Furthermore, we created three disjunctive sets as follows: a training set (consists of 40% samples), first test set (consists of 40% samples), and second test set (consists of 20% samples). The training set will be used to train various machine learning models. The first test set will be used to measure the success rate of the models. Moreover, it will be used to generate predictions of the machine learning models, later used to train rule-based classifiers. The second test set will be used to measure how well RBCs interpret the ML models, that is, how well are predictions of the models matched with RBCs’ predictions.

During the experiments, we used 5-fold cross-validation. First, we partitioned the data into individual folds with the corresponding set sizes. That is, each fold had the corresponding sizes for each of the sets mentioned above—40:40:20. After partitioning, we applied data transformation techniques (e.g., normalization or PCA). The following steps were all done individually for all five folds. We trained each ML model on the training set and evaluated its performance on the first test set (see Table 1) and then on the second test set (see Table 3). RBCs were trained on the predictions of each ML model on the first test set. The performance of RBCs was evaluated on the first test set (see Table 2) and the second test set (see Table 4). Figure 2 describes the data split visually and it also explains the working process used in our experiments.

Fig. 2
figure 2

Visual description of the working process used in this work. Note that the decision list needs to be generated before extracting its predictions for both test sets. Results for machine algorithms for the first test set can be found in Table 1, for the second test set in Table 3. Results for rule-based classifiers and the ML predictions are listed in Table 2, and for the second test set in Table 4

Table 1 Well-known ML algorithms and their performance on the EMBER dataset using the first test set
Table 2 Measuring performance of RBCs on ML predictions using the first test set
Table 3 Well-known ML algorithms and their performance on the EMBER dataset using the second test set
Table 4 Testing how well do RBCs interpret ML algorithms’ predictions using the second test set

5.3 Feature Transformation and Selection

Even though rule-based classifiers can handle both numerical and categorical features, traditional machine learning algorithms and the implementations available from scikit-learn [22], which we used to train our models, require the features to be numerical only. The authors of the EMBER dataset published a code [2] that transforms some of the available raw features into vectorized ones using the hashing trick. We used this code to transform the features and ended up with 2381 new ones.

Before proceeding further, it was necessary to standardize the data. Some machine learning algorithms’ behavior may worsen if the data do not appear to be from the normal distribution [27]. We used the class MinMaxScaler from scikit-learn that transforms the features as follows,

$$\displaystyle \begin{aligned} x_{std} = \frac{x - \mathrm{min}(X)}{ \mathrm{max}(X) - \mathrm{min}(X) }, \end{aligned} $$
(18)

where x is the original value and X is the collection of every value in a given feature.

Consequently, we employed two dimensionality reduction methods—Principal Component Analysis (PCA) and Random Forest (RF), both available in scikit-learn. We picked both techniques as they are simple to use and arguably easy to understand. Even though one cannot simply see the original features with PCA, we can use the correlation matrix to determine which features were used to create the new ones. Unlike PCA, Random Forest keeps original features, thus maintaining higher interpretability. We chose to keep 200 features for PCA as it had less than 4% information loss. We decided to keep the same number of features using RF, too. By choosing this value for RF, we aim to get a different behavior of the models. Note that this may later put RF at a disadvantage as it may use redundant features.

5.4 Evaluation Metrics

To understand how well machine learning algorithms or RBCs perform, we use several different metrics described in this section. We first define terms that are used in the metrics [16].

  • True Positive (TP)—Correctly predicted malicious samples as malicious

  • True Negative (TN)—Correctly predicted benign samples as benign

  • False Positive (FP)—Incorrectly predicted benign samples as malicious

  • False Negative (FN)—Incorrectly predicted malicious samples as benign

Using the terms above, we can calculate the false positive rate (FPR), also referred to as the fall-out rate, the true positive rate (TPR), also known as sensitivity and accuracy (ACC).

$$\displaystyle \begin{aligned} FPR \equiv \frac{FP}{FP+TN} \end{aligned} $$
(19)
$$\displaystyle \begin{aligned} TPR \equiv \frac{TP}{TP+FN} \end{aligned} $$
(20)
$$\displaystyle \begin{aligned} ACC \equiv \frac{TP+TN}{TP+TN+FP+FN} \end{aligned} $$
(21)

To better distinguish between individual performances of RBCs, we use additional metrics. We denote the number of rules in a decision list as DL size and the mean number of conditions in rules in the decision list as ø r size.

5.5 Interpretability of Machine Learning Models

Doshi-Velez’s and Kim’s [9] definition of interpretability is suitable to our needs the best. Thus, using their definition, we will first define Human Most Understandable Model (HuMUM).

Definition 11 (Human Most Understandable Model)

We say that a model is most understandable if it has the ability to explain or to present in understandable terms to a human.

Although this definition is subjective and not rigorous, it will serve more as a naming convention. In our work, we consider decision lists generated by RBCs HuMUM, as they are simple and easily understandable by humans.

We say that a model is absolutely interpretable by Human Most Understandable Model if all of its predictions can be interpreted by HuMUM. That is, HuMUM makes the same predictions as the model would.

Definition 12 (Absolutely Interpretable by HuMUM)

We say that the model f is absolutely interpretable by HuMUM g, if the following holds,

$$\displaystyle \begin{aligned} f(X, y) &= y_{f},\\ g(X,y_{f}) &= y_{g},\\ y_{f} &= y_{g}, \end{aligned} $$
(22)

where X is the training set and y is the label set. Both f and g create new label sets y f and y g.

We say that a model is partially interpretable by Human Most Understandable Model, if some of its predictions can be interpreted by HuMUM. That is, HuMUM matches some of the decisions made by the model.

Definition 13 (Partially Interpretable by HuMUM)

We say that the model f is partially interpretable by HuMUM g, if the following holds,

$$\displaystyle \begin{aligned} f(X,y) &= y_{f},\\ g(X,y_{f}) &= y_{g},\\ y_{f} &\sim y_{g},\\ \end{aligned} $$
(23)

where X is the training set and y is the label set. Both f and g create new label sets y f and y g. y f ∼ y g indicates, that some of the predictions are equal.

To understand when a model is most intepretable, we define Interpretability Entropy.

Definition 14 (Interpretability Entropy)

Given the predictions y f of a model f and the predictions y g of HuMUM g, we define Interpretability Entropy as follows,

$$\displaystyle \begin{aligned} \mathscr{H}(T,F) = -\left(\frac{T}{T+F}\right) \log_{2}{\left(\frac{T}{T+F}\right)} - \left(\frac{F}{T+F}\right) \log_{2}{\left(\frac{F}{T+F}\right)}, \end{aligned} $$
(24)

where \(T = \sum {\delta _{y_{f}y_{g}}}\) and \(\delta _{y_{f}y_{g}}\) is Kronecker delta, and F = |y f|− T.

The main goal of HuMUM is to minimize Interpretability Entropy, which can be in the range of [0, 1]. Notice that this does not require models to be as close to being absolutely interpretable by HuMUM. If HuMUM would always make a different prediction, we would still have enough information about the model’s behavior.

5.6 Measuring Performance of RBCs on ML Predictions

Using the EMBER dataset, we applied five machine learning algorithms—Support Vector Machines with Radial Basis Function kernel (SVM), Random Forest (RF), Gaussian Naïve Bayes (GNB), k-nearest neighbors (KNN), and Deep Neural Network (DNN) with two hidden layers. As mentioned in Sect. 5.2, 5-fold cross-validation was employed to reduce the number of biases. We tried to fine-tune the hyperparameters of algorithms for which it was possible. The averaged results are shown in Table 1. At first glance, neither of the dimensionality reduction approaches seem to have significantly better performance. We included RIPPER and I-REP in the table as we want to compare their out-of-the-box performance to their performances when trained on the predictions of the ML algorithms (see below). RIPPER creates very few false positives; however, it is unable to detect enough malicious samples.

We used our RBCs implementations and trained them on the predictions of the ML algorithms on the first test set. That is, for each ML algorithm and all of its five predictions on the first test set, RBCs were used to describe the outcomes of that ML algorithm. The results were then averaged and are shown in Table 2.

I-REP’s performance is poor for both accuracy and TPR. Despite this fact, I-REP produces straightforward decision lists in comparison to RIPPER. RIPPER produces immensely few false positives; however, it struggles to find malicious samples. Notice that RIPPER has generally a lower performance when run on models that used RF as dimensionality reduction instead of PCA. This does not apply to I-REP, though.

For the individual cases, the results indicate that the predictions produced by GNB with PCA were reasonably easy for RIPPER to reconstruct. RIPPER, in this case, achieved both high TPR and low FPR, and the total number of conditions is smallest across RIPPER’s decision lists. This could mean that GNB is almost absolutely interpretable by RIPPER. GNB in combination with RF has very similar results, although RIPPER needed approximately 1.5 times more conditions than in the case of PCA. For other models, RIPPER generally required more extensive decision lists for those models which used PCA for dimensionality reduction. The RandomForest model was the second most partially interpretable model by RIPPER in both cases of dimensionality reduction. For the rest of the models, RIPPER obtained worse results. Models that were interpreted by RIPPER with RF as dimensionality reduction had generally fewer conditions in total. However, obtained rules were approximately 1.3 times larger. The model that is most probably the least interpretable is KNN with RF dimensionality reduction.

Table 2 presents experiments that aim to verify whether RIPPER’s behavior changes based on its learning parameters. RIPPERpr corresponds to RIPPER without pruning. RIPPERk corresponds to the number of optimization phases. RIPPER does not seem to be improving significantly with its optimization phases. We can see that in some cases, it does increase its FPR in trade for higher TPR. This is not surprising, as its optimization pruning metric is based on accuracy. This could be problematic for imbalanced datasets as the accuracy metric does not take this into account.

The results indicate that I-REP does not generate good rulesets despite their comprehensibility. This is most probably caused by its pruning metric. RIPPER achieved better results, although its TPR is relatively low for some of the models. Results indicate that RIPPER will probably interpret some of the results better than I-REP. Below we give an example of a rule generated by RIPPER for the RandomForest model in combination with RF dimensionality reduction:

BHist0[0] <= 0.185925 && BHist87[19] <= 0.003879 && EntBHist216[92] <= 0.167021 && section vsizes hashed1[159] <= 0.000000 && section vsizes hashed38[170] <= 0.000000 && imported libs hashed206[189] <= 0.400000 && EXPORT_TABLE_va[191] <= 0.000001 && RESOURCE_TABLE_size[193] <= 0.000000 && CERTIFICATE_TABLE_size[195] <= 0.000001.

The structure of the rule is as follows:

featureName|colNumber?|[colIndex] operator value &&?,

where colNumber is for features that are either hashed or built from histograms. If the feature is hashed, colNumber corresponds to the column of the hash. If the feature is built from byte histograms, colNumber corresponds to that byte, e.g., BHist87 corresponds to the part of byte histogram for byte 0x57. colIndex is used to access different columns (in this case indices can be from {0…199}). && is used as a conjunction. Parts of the rule marked with ? are optional. Feature names were created according to the EMBER [2] source code.

Figure 3 shows the process of rule learning for both RIPPER and I-REP. Rules with high TP coverage are primarily generated at the beginning of the learning process. IREP’s rules that have high FP coverage are obtained at the beginning of the learning process. We can see that RIPPER achieved a few spikes in the case of covered TP samples for DNN, KNN, and SVM. There are smaller spikes for both RF and GNB, too. As a result, RIPPER can find stronger rules in the later phase of the learning process. The reason that those rules are not found earlier is related to its pruning metric—RIPPER needs to cover a certain number of FP samples to start considering better rules. Note that I-REP’s decision list sizes are significantly smaller than RIPPER’s. This corresponds to the results in Table 5. Different sizes are also given by the fact that both I-REP and RIPPER have distinct stop conditions. I-REP, either used with PCA or RF, has very similar decision list sizes. The same does not apply to RIPPER, as its stop condition allowed it to generate many more rules for PCA than for RF.

Fig. 3
figure 3

Rule Coverage Size Over Time. The graphs show how rules cover different samples over time using the first test set and one of the five cross-validation folds. The y-axis is log-scaled and represents covered samples; the x-axis represents a decision list size. Value − 1 on the y-axis corresponds to no covered samples. (a) True positives—DNN. (b) False positives—DNN. (c) True positives—Gaussian Naïve Bayes. (d) False positives—Gaussian Naïve Bayes. (e) True positives—KNN. (f) False positives—KNN. (g) True positives—RandomForest. (h) False positives—RandomForest. (i) True positives—SVM. (j) False positives—SVM

Table 5 Measuring performance of RBCs with different metrics on ML predictions using the first test set

5.7 Interpreting ML Results Using RBCs

Using the second test set, we measured the performance of the ML algorithms. Results are shown in Table 3. There is no significant drop in performance for none of the algorithms when compared to Table 1.

To find out how well generated decision lists interpret the ML models, we measured also their performance on the second test set. This time, we did not use the predictions of the ML algorithms. We used the original class labels for the second test set. In Table 4, we replaced the decision list and rule sizes with two columns, TP match, and FP match. We first create predictions of the ML algorithms and extract only TP and FP samples for both columns. We then generate RBC predictions on these samples and calculate how many of the predictions were the same. As seen in both Tables 2 and 4, I-REP’s FPR was relatively high. This was not improved by training it on the ML algorithms’ predictions. In fact, in some cases, it had FPR higher than when trained on its own. RIPPER’s overall accuracy did not change significantly when trained on better models. It did, however, get closer to the results of ML algorithms that it was trained on. The overall results can be misleading, and that is why we included TP and FP matches, and Interpretability Entropy. Our initial proposal that GNB would be the most partially interpretable model holds. RIPPER did match most of its predictions and did achieve the lowest Interpretability Entropy among all other models. This does not apply to I-REP; even though it matches most of GNB’s predictions, its high FPR makes it less efficient for machine learning models to be interpreted by it.

The second most interpretable model by RIPPER according to Interpretability Entropy was RandomForest. RIPPER did get similar rates for FPR. It did not match all of the predictions that RandomForest made. This is reflected in the Interpretability Entropy, too. The last three models, namely SVM, DNN, and KNN, and their results seem to be quite hard to interpret by both RIPPER and I-REP. The Interpretability Entropy tells us that models interpreted by RIPPER could retain some information. In the case of I-REP combined with PCA, there is very little one could gain.

The value of the Interpretability Entropy for the RandomForest model interpreted by RIPPER with PCA could be considered an acceptable limit of what we could consider a strong, partially interpretable model. The results in Table 4 show that RBCs have trouble matching the FP predictions of the original model. However, this is not necessarily adversity; if the FP match is close to zero, we could still interpret the results as well as creating explanations of why the model made incorrect predictions. Close to 50%, FP match, resp. TP match, does not give any information whatsoever, as it could be viewed more as guessing than interpreting.

5.8 Pruning and Metrics

I-REP and RIPPER both utilize pruning to handle noisy data. Cohen [8] pointed out that I-REP’s incapability to converge towards better solutions is mainly caused by its pruning metric, based on accuracy. The metric is one of the essential parts of I-REP-like algorithms. Naturally, we may ask whether or not we can affect the behavior of the metrics or whether it is more of a trial and error challenge.

The proposed version of RIPPER by Cohen [8] is capable of handling multiclass problems. Using RIPPER, we can reduce the multiclass problem to an alternating two-class problem. Thus, we can understand pruning metrics as two-variable functions. Fortunately, this number is perfect for a better understanding of pruning metrics by graphing them. Figure 4 shows metrics used by I-REP and RIPPER, and other metrics we tried to use throughout the experiments. We simplified I-REP’s pruning metric as it can be viewed as a plane for fixed P and N (see Metric 10). Here lies the key reason why I-REP tends to make bad decisions when pruning; points with a different number of malicious and benign samples are often indistinguishable. RIPPER’s pruning metric has good characteristics; we could only identify that it does not differentiate between positive samples when no negative samples are present.

Fig. 4
figure 4

Understanding pruning metrics as 3D graphs. Pruning metrics should have the following properties: If only malicious files are present, pruning metrics should be at their maximum. For benign files only, they should be at their minimum. Otherwise, they need to compromise, and should take into account a lower number of benign files. (a) Simplified IREP’s pruning metric. (b) RIPPER’s pruning metric. (c) RIPPER metric with curvature for malicious samples. (d) RIPPER metric with more curvature for malicious samples

We experimented with the metrics in Fig. 4, and the results are shown in Table 5. Below we assign each metric (additionally, we added a function with a saddle point) to its name:

$$\displaystyle \begin{aligned} \underbrace{\frac{p-n}{\sqrt{p}+\sqrt{n}+1}}_{\text{sqrt}},\:\:\: \underbrace{\frac{p-n}{p + n + 1}+\frac{p}{n+1}}_{\text{impr}},\:\:\: \underbrace{p^2-n^2}_{\text{saddle}}, \end{aligned} $$
(25)

where p is the number of positive samples and n is the number of negative samples in the pruning set. Each of the names (sqrt, impr, saddle) is used in Table 5 and indicates what pruning metric was used. For the experiment, we used the first test set and RIPPER with k set to zero. Results indicate that used metrics did not achieve significantly better performance. RIPPPERimpr’s behavior is comparable to RIPPER0, and for some cases, it reaches better FPR. With the decreasing number of rules, we can see a significant decrease in the performance. With higher TPR and FPR, RIPPERsqrt achieves similar accuracy rates to RIPPERimpr; for most cases with more than 20% decrease in decision list sizes. RIPPERsaddle performs worse than the original pruning metric of I-REP. Surprisingly, it generates more rules than I-REP (see Table 2).

5.9 Does Order of the Rules Matter?

Figure 3 demonstrates a few spikes throughout the learning process of RIPPER. We could potentially shift these spikes to have them occur as soon as possible. As a side effect, we would violate the order in which they were learned. On the other hand, would this change the overall behavior of the model?

Let R be a decision list with rules r 1, …, r n. We want to swap rules r i and r j, where i < j. Rule r j now covers at least all samples it covered before the swap. It may also cover new samples then covered by r i and r k, where i < k < j. This means that the number of TP and FP samples for rule r j can remain the same or increase. Samples that were covered by r i before the swap and are not covered by r j after the swap can still be covered by r k. Thus, the number of TP and FP samples covered by r i after swap can remain the same or decrease. The swap does not add any new samples that could be covered and only changes the behavior of each individual rule. The overall behavior of R remains the same.

To sort the rules, we would always need to find a rule with a spike that is larger than the previous ones. Unfortunately, we cannot use fast sorting algorithms as we always need to update the number of TP samples of the following rules. The sorting itself would require \(\mathscr {O}(n^2)\), where n is the number of rules, and the coverage of each following rule would require \(\mathscr {O}(n m)\), where m is the number of samples. This means that the sorting would require \(\mathscr {O}(n^3 m)\). Therefore, we decided to sort the rules greedily—only once given by their covered TP samples. The results can be seen in Fig. 5—we used the first test set and the predictions of ML algorithms. We can see that this approach smoothened the TP curves for RIPPER. Some rules generated by I-REP had no TP coverage when reordered. This can be seen in the case of the GNB model with RF dimensionality reduction. A similar case can be seen for SVM, again with RF.

Fig. 5
figure 5

Changing Order of the Rules. The graphs show how rule ordering affects covered samples over time using the first test set and one of the five cross-validation folds. The y-axis is log-scaled and represents covered samples; the x-axis represents a decision list size. Value − 1 on the y-axis corresponds to no covered samples. (a) True positives—DNN. (b) False positives—DNN. (c) True positives—Gaussian Naïve Bayes. (d) False positives—Gaussian Naïve Bayes. (e) True positives—KNN. (f) False positives—KNN. (g) True positives—RandomForest. (h) False positives—RandomForest. (i) True positives—SVM. (j) False positives—SVM

Rule ordering could lead to potential speed-ups if used in production, as stronger rules would trigger earlier. Also, it could be used as an additional tool in RIPPER’s optimization phase to achieve new properties.

6 Conclusion and Future Work

The interpretability of machine learning methods could be considered one of the leading research goals in the current era. Many works focus on the essence of interpretability itself, whereas other works focus on the domain of specific models. In this paper, we examined the use of rule-learning algorithms to extract decision lists based on the predictions of machine learning models. We used decision lists as they are one of the most understandable models in machine learning.

In our experiments, we used two rule-learning algorithms: I-REP and RIPPER. I-REP had inferior results, and we discussed the reason for this in Sect. 5.8. RIPPER covered most of the predictions well; however, it could not find appropriate rules that would not increase the MDL metric mentioned in Sect. 3.2. Using Doshi-Velez’s and Kim’s definition of interpretability, we defined Human Most Understandable Model. We defined absolutely and partially interpretable models by HuMUM, together with Interpretability Entropy (see Sect. 5).

We tried to estimate how well do RBCs interpret the results produced by the ML models. We merely did this by taking into account the accuracies, true and false-positive rates of RBCs. This gave us a good idea of what ML models could be interpreted by RBCs better than others. For example, we have correctly assumed that GNB would be more interpretable than KNN by taking into account all of the three metrics. Using this approach is limited since we cannot state how well RBCs interpret the ML models precisely. Thus, we inspected the amounts of matched predictions for the ML models. Using these amounts, we saw where RBCs fail to interpret the ML models. Finally, the Interpretability Entropy allowed us to numerically compare what ML models are more interpretable than others. We conclude that in the case of the EMBER dataset, we could consider the Gaussian Naïve Bayes model almost absolutely interpretable by HuMUM. The random forest model could be viewed as a possible borderline of what we still could consider interpretable. To increase the measure of interpretability for other models, such as deep neural network, we need to improve the performance of the rule-based classifiers.

Throughout the experiments, we tried to inspect the behavior of the RIPPER algorithm. We discussed the importance of the rule order in a decision list and how changing it will not affect the behavior of the whole decision list. We confirmed that the metric plays a significant role in rule learning, and by modifying it, we can either achieve better performance or more comprehensible decision lists.

Although we created our implementations of rule-based classifiers, they are far from being finished. We believe there is still space for speed improvements using memorization. Currently, our implementations run sequentially; we could achieve significant speed-ups by parallelizing certain parts of the implementations, for example, looking for the best condition while growing a rule.

Decision lists generated by rule-learning algorithms could be used as an adversarial tool, too. We could create features given by the conditions and examine when the predictions of an interpreted model differ from the predictions generated by RBCs. This could deepen the understanding of the interpreted model and allow for other methods to be used in its weaker performing parts.