1 Introduction

Nowadays, computer systems are present in almost every aspect of human life. Not only are they used to provide entertainment, manage financial matters, perform scientific calculations, but they are also a part of health care or rescue services (Porwik et al. 2011). Therefore, the security of computer systems and networks has become a top priority task. A rapidly increasing number of computer systems and mobile devices result in high threat of attacks. An intrusion detection system (IDS) has been developed to monitor computer system and network usage patterns to detect intruders. Some of these systems work on the principle of detecting abnormal patterns by employing the anomaly detection techniques. There are many different kinds of cyber-attacks (Denning 1997; Raiyn et al. 2014); however, masqueraders constitute the greatest threat (Salem et al. 2008). A masquerader in computer systems is an attacker who temporarily overtakes the role of an authorized, legitimate user to impersonate this user for malicious purposes. Masqueraders could be both: insiders, who use another authorized user’s identity or outsiders, who gain access to the system by obtaining a legitimate user’s identity. Worse kind of masqueraders are insiders, who themselves are legitimate users but they take advantage of their rights for their own interest, harming the system that has given them these rights. Insider masqueraders are more difficult to detect because they can observe the habits of the person they intend to impersonate and the insider attack is less expected due to the trust that has been given to the employee. In both cases, the first step of a masquerading attack is the identity theft (Salem and Stolfo 2012).

One of the most important security measures is to prevent the user’s identity theft. The second one is to detect intrusions such as described above, in both online and off-line data access mode. In most cases, IDSs work in an online data access mode and therefore they are real-time systems. Such IDS is connected to the computer systems to monitor users’ activities and to process the data in real time to detect abnormal user’s behavior. In case of intrusion detection, an IDS generates an alert so that the Security Officers could take actions to avoid more serious threats. On the other hand, the task of a system working in an off-line mode is to verify if the activities of a user recorded in the past are free of masqueraders’ attack. IDS that is designed to process data in an off-line mode is used, for example, as supplementary utility to assist Security Officers—the attempt of an off-line IDS that supported Security Officers was made by military already in 1988 (Smaha 1988). Off-line mode IDSs are also useful to verify if a legitimate user with a previously created activity profile was really the one operating the computer system at the time of an attack if the intrusion was not detected by an online IDS but later.

Both online and off-line IDSs could work on open or closed set of data. In case of intrusion detection, a data set is considered open unless all the possible users of monitored computer systems are known and their data are pre-recorded (user’s profiles are created). By contrast, the closed set of data represents the activities of all the possible users of the computer system. In case of a closed data sets, masqueraders have to be insiders and in such case the identification of an intruder is theoretically possible. In practice, the closed type of data set is more common due to the necessity to register all the users of the computer systems. But if the intruder is an outsider, the data set becomes open.

A separate aspect is a question of whether the system should identify the intruder or detect an attack only. If the system runs on an open data set the identification of an intruder is unlikely.

The problem of detecting masqueraders in computer systems is clearly defined by Schonlau et al. (SEA) (2001). The aim is to develop a method working in an off-line mode with an open set of data—user’s activities are recorded but the intruders are outsiders and there is no information about them. The task is to detect masquerades within the provided data set.

The problem of user’s activity within the computer systems involves great uncertainty. Therefore, the approach presented in this paper employs the idea of fuzzy sets, providing the solution to the problem of uncertainty (Zadeh 1965). Since the first implementation (Mamdani and Assilan 1975), different fuzzy systems, more and less complicated, are widely applied to solve various problems of complex and uncertain nature (Czogała and Łeski 2000; Klir and Folger 1988; Klir and Yuan 1995; Kudłacik 2012; Kudłacik and Porwik 2012; Takagi and Sugeno 1985; Wang 1998; Zimmermann 1985).

Subsequent sections of this paper describe the problem of masquerade detection, review available approaches, introduce the idea of local and fuzzy user’s profiles and present obtained results in comparison with other methods.

2 Related works and critical remarks

The issues of computer systems security have been the subject of extensive research of many scientists over the years (Anderson 1980; Lunt and Jagannathan 1988). There have been numerous attempts to detect intrusions in computer systems (Wesołowski and Kudłacik 2013, 2014) and networks (Fiore et al. 2013; Palmieri and Fiore 2010; Palmieri et al. 2014; Salem et al. 2008; Wang et al. 2008) or frauds in telecommunication (Olszewski 2012) as it has also become an important part of human life.

2.1 User profiling

Most of the systems that prevent intrusions in real-world applications are based on users’ profiles, as this type of systems is considered to be more effective than rule-based type. There are many approaches that create and use profiles. User profiling is a very broad concept and profiles are created in different ways for different applications. It is important to distinguish those profiles that describe user’s needs from those that describe a user’s behavior. The first group consists of profiles created to describe the needs or taste of the user. Therefore, constructed profiles, based on collaborative filtering approach, are used by the recommendation algorithms and systems to suggest items that could be most useful to the users considering their past preferences suggesting, for example, their purchase (Gunes et al. 2013). Similar methods are used to automatically create profiles based on personal data extracted from information published in the Internet social media. This kind of profiling could compromise the privacy of the user, therefore, various methods of preventing automatic user profiling have been developed (Viejo et al. 2012).

In both cases, such profiles are not useful from the intrusion detection systems point of view, and therefore have not been included in this article.

The second type of user’s profiles tries to answer the question “how does the user behave?” or “what is the user like?” as these are the key features that allow to detect when somebody tries to impersonate the user. These type of profiles are related to intrusion detection systems and are used in intrusion detection methods like the one presented in this article. Such profiles are based on user’s activities in the computer system and can concern network traffic and requested resources, computer mouse data (Pao et al. 2012), keystrokes statistics and dynamics (Banerjee and Woodard 2012) or, like in the case of this article, the command names used by the computer user.

2.2 Critical remarks

Unfortunately, despite the fact that the intrusion detection methods are widely researched and developed, comparing them in a reliable way is very difficult or in some cases even impossible. The reason for these difficulties is the inconsistency in the verification studies and in the presentation of obtained results. In practice, various reference data sets are used, each differing in size and type of data (Greenberg 1988; Kdd cup 1999; Lane 1999), user’s command line calls, system calls (Borah et al. 2011; Estrada et al. 2009) or network traffic data (Corchado et al. 2005; Herrero et al. 2007). Finally, authors’ own data bases, which are not publicly available, are frequently used (Sodiya et al. 2011; Wespi et al. 1999).

Another major problem that prevents a fair comparison of different intrusion detection methods is the presentation of the results. It can be observed that various benchmarking criteria are introduced (cost Maxion and Townsend 2002, efficiency Borah et al. 2011, effectiveness Bhukya et al. 2006) and the accuracy is identified using different parameters, such as: false acceptance ratio (FAR) which indicates the percentage of incorrectly accepted invalid inputs; false rejection ratio (FRR) that shows the percentage of incorrectly rejected valid inputs; equal error rate (EER) that indicates a rate at which both FAR and FRR are equal. In some articles, a non-standard metrics are used, for example Hit Rate defined as a new parameter for comparison of intrusions detecting methods (Maxion and Townsend 2002). Despite the fact that FAR and FRR coefficients are standardized, the differences in calculations are present. In some cases, calculations of FAR and FRR are made globally for the whole set of experiment data and in other attempts the results are obtained as an average of local coefficients for separate sub experiments.

Unfortunately, in most cases, only one of the available parameters is used as a determinant of the quality of proposed method. There are even papers in which none of the common indicators of quality is available (Bartkowiak 2011). On the other hand, it is not possible to compare the methods in terms of performance because the computational complexity of the algorithms is not determined, though, in some studies information about the duration of the experiment can be found. However, this parameter is dependent on the hardware and the execution of own experiments under the same conditions is not possible. Nevertheless, despite all the difficulties, we have made a great effort to make sure that the method presented in this article has been reliably compared with the previously developed intrusion detection methods classified as an anomaly detection type (according to the definitions found in Lane 2006).

2.3 Related methods

The rest of the section describes the methods of intrusion detection which are based on the commands used by the users. A new method presented in this paper, based on fuzzy reasoning, was compared to the methods described below because these have been experimentally tested by their authors using SEA data set. Therefore, it was possible to refer to the results of the tests. The experimental results and comparison with the referenced methods are discussed in the following sections of the article.

As the starting point for described investigations the methods proposed by SEA (Schonlau et al. 2001) are taken. The method called Uniqueness presented in Schonlau and Theus (2000) is based on the unique (used by only one user) and unpopular (rare commands used by only a few users in a training data set) commands. Unfortunately, the preparatory phase takes into account the whole set of data which makes it virtually impossible to use in the real world. However, the Uniqueness method is considered to be the best from the SEA proposals. Bayes One-step Markov approach is based on one-step transitions from previous command to the next one, while Hybrid Multistep Markov is a hybrid that uses mostly a multistep Markov chain and occasionally an independence model, where the switching between the two models is made automatically, according to the fact how unusual the test data are. Actually, a Hybrid Multistep Markov by SEA originates from a Hybrid High-order Markov Chain Model presented in Ju and Vardi (1999). Compression method uses the modified Lempel–Ziv compression algorithm implemented in the UNIX standard compress tool to detect masqueraders. Incremental probabilistic action modeling (IPAM) method is based on probabilities of one-step command transition and Sequence-match approach uses a similarity measure between a designated user’s profile and the most recent 10 commands in a test. The target of the original experiment was to reach a FAR of 1 %. None of the above methods has succeeded. Following the task given by SEA many scientist developed various methods which would be able to compete with the ones presented in Schonlau et al. (2001).

In Coull et al. (2003), a technique based on pair-wise sequence alignment was proposed. The Semi Global Alignment method is a bioinformatic approach in which the Smith–Waterman semi-global alignment algorithm with a simplistic scoring system was used. Few years later in Coull and Szymanski (2008), the authors announced modifications of the original method—the Sequence Alignment with various scoring systems. However, despite the good results it can be suspected that the proposed method does not fully realize SEA’s objective. From statement “The lack of detailed information about which commands were intrusions in the test blocks makes a thorough analysis of masquerade detection techniques difficult, at best.” it can be noted that the authors possibly misinterpret the data set. Recently another attempt in masquerade detection has been made with an improved cross semi-global alignment algorithm (Sodiya et al. 2011). Unfortunately, due to the use of own set of test data, which is not publicly available, the comparison with this method is not possible.

The authors of Maxion and Townsend (2002) introduced the Naive Bayes approach based on naive Bayes classifiers (with and without updating) that serves as a basis for comparison. Yet, it is not clear why in case of creating single user’s profile also the profiles of all the other users are created. After studying the article, we have concluded that possibly the whole data set is taken into consideration while testing the single user, making it impossible to use the method in the real world. In Maxion and Townsend (2002), a new indicator of quality for intrusion detection method—cost of the method—is proposed. The same authors in Maxion (2003) discuss a disadvantage of SEA data set caused by a loss of information due to truncation of system calls arguments.

The naive Bayes method depends on probability of individual commands. However, the authors of Dash et al. (2005) concluded that it is more appropriate to consider expressive groups of commands rather than individual commands and proposed Episode Based Detection method, a technique of masquerade detection by considering episodes—meaningful sub sequences of long sequence of commands.

An anomaly intrusion detection model based on Principal Component Analysis (PCA) with low overhead and high efficiency is presented in Wang et al. (2004), while authors of Kim and Cha (2005) proposed a masquerade detection technique based on Support Vector Machines (SVM) that comprises two novel concepts of common commands and voting engine.

Paper Bhukya et al. (2006) presents an approach to intrusion detection based on Hidden Markov Model (HMM) with emission symbol semantics for system calls. The authors stated that the HMM is a well-known probabilistic method to model the sequences of information and an optimal modeling technique to minimize false positive rate while maximizing detection rate. In the next paper (Lane 2006), the authors develop a model of intrusion detection based on semi-supervised learning and partially observable Markov decision process (POMDP) framework. The IDS presented in Lane (2006) eliminates the distinction between misuse and anomaly detection methods while preserving some of the strengths of both. However, to achieve a reliable comparison, only the results in the “pure anomaly detection” mode are taken into consideration.

The Recursive Data Mining approach investigated in Szymanski and Zhang (2004) applies recursive data mining to solve the masquerade intrusion detection task. The recursive data mining is used to characterize the structure and high-level symbols in user’s signatures and the monitored sessions.

One of the recent works in masqueraders’ detection is a fast intrusion detection technique based on non-negative matrix factorization (NMF) Guan et al. (2009)—a powerful method for reduced representation of data. NMF has been already successfully applied in biometrics (face recognition).

Schonlau et al. (2001) proposed the uniqueness approach based on the unique and unpopular commands. In contrary to SEA, authors of Wan et al. (2008) explore the other side of the spectrum—the High Frequency Commands—to determine whether they work well as users’ signatures.

The methods mentioned above were mostly tested using the SEA data set, hence, it is possible to attempt a reliable comparison. However, the results obtained by authors of the analyzed methods in some cases strongly depend on the assumptions that have been made to the data set used in tests. In some cases, the assumptions have changed the task itself or have made it unclear what causes the methods inapplicable in practice.

3 User’s activity database

To test the method presented in this paper, it was necessary to prepare a set of test data. Thorough research has revealed different data sets that could be used for testing intrusion detection methods. The publicly available user’s command data sets were proposed by Greenberg (1988), Lane (1999, 2002), Schonlau (2001) and Schonlau et al. (2000). There is also available the data set used at The International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The International Conference on Knowledge Discovery and Data Mining (Kdd cup 1999), and its modified version (Tavallaee et al. 2009). However, these data sets (consisting only of network traffic information) are used to detect intrusion of computer networks and cannot be applied in this study. Another possibility is to generate a testing data set using one of the available generators such as the one proposed in Chinchani et al. (2004).

Since the aim is to develop methods for its use in real computer systems, it appears advisable to use real data for testing purposes. The second important aspect is the ability to compare the results obtained using the method proposed in the paper to other available methods of detecting intrusions. Therefore, tests should be performed using one of the commonly available standard user’s command data sets, for example by means of the Schonlau data set (SEA data set). It is the most popular data set among the research community. Taking all the above into consideration, SEA is in this case the best choice as a reference data set.

3.1 SEA data set

In Schonlau experiment, the system calls made by 70 users of UNIX operating system were collected and for each user 15,000 commands were recorded. Out of all users, 50 were randomly selected as native (legitimate) users and the remaining 20 became intruders (masqueraders). The commands of each user were divided into 150 blocks of 100 commands each. Next 150 blocks were divided into 3 groups of 50 blocks. The first group (blocks 1–50) consisted of original (not contaminated) data, the other two groups (blocks 51–150) were contaminated randomly by alien blocks (intrusions) taken from the data sets of 20 intruders. The non-contaminated blocks created a training data set. Detailed description of the experiment together with the data set itself and the information about alien blocks can be found in Schonlau (2000); Schonlau et al. (2001).

4 A new profile of computer user

A profile of a computer user should contain a set of features corresponding with a specific style of user’s activity in computer system. Examined features can consider different aspects of work in computer environment. Some are directly derived from physical computer control devices like mouse and keyboard data. Others are involved with logical usage such as types of used programs, chosen program options, storage usage, and applied protocols. Different parameters of generated network traffic can also be helpful or even crucial for some applications—especially because of information about user’s location.

This section focuses on the method of collecting events involved with an examined parameter (feature) of user’s activity. The process eventually allows to construct a fuzzy profile, which represents the activity in a defined period of time.

The crucial parameter of the approach is the time length of the tested user’s activity. Generally, it can be either a given period of time \(T\) or the number of analyzed actions \(N\). The \(T\) and \(N\) parameters are naturally related: low values of \(T\) for low number of analyzed actions (low \(N\)) and large \(N\) for long periods of analyzed activity (high \(T\)). Increasing the \(N\) or \(T\) parameters moves the analysis from examination of local activity—smaller number of user’s actions—to more general, where a broader view on user’s work is taken into consideration.

This paper proposes two kinds of profiles for which a certain data structures are designed. The first one, the local profile, represents the actual user’s activity or specific window of activity in the past. As the name itself suggests, the value of \(T\) or \(N\) for local profiles should be relatively small.

The second profile, called the fuzzy profile, contains much more general information. In contrast to the first one, the fuzzy profile describes user’s activity examined for a relatively long period of time. In fact, the approach calculates a form of average structure based on a number of local profiles gathered in a given period. The local and fuzzy profiles are described in detail in the following subsections.

The approach described in the following sections works on any countable feature, such as keyboard keys’ sequences, characteristic data sequences retrieved from pointing device, chosen options, and requested network resources. However, the database used in examinations contains subsequent command names typed by users of unix system (Schonlau et al. 2001). Therefore, examined feature of profiles proposed in further analysis is based on number of occurrences of command and program names launched by users in a given period of activity. Considering the used data set, the specific period of user’s activity (\(T\)) cannot be distinguished (not stored). Consequently, further analysis assumes a given number of commands (\(N\)) understood as some period of user’s actions.

4.1 The local profile

The local profile of activity corresponds to \(N\) subsequent events, which are either directly or indirectly involved with user’s actions. The value of the \(N\) parameter has to be relatively small. However, too low values will not allow the system to analyze adequate amount of activity. On the other hand, if \(N\) value is too large, the mechanism becomes too general and loses its local nature. Therefore, it has to be chosen properly and will strongly depend on examined feature itself.

Fig. 1
figure 1

An example of the local profile of a given user. Hyphens indicate a number of occurrence \(o_i^j\) for each \(c_i^j\) command

In case of the database used in this research, subsequent events represent command or program names launched by a user. Therefore, in the following analysis, this kind of feature is used.

Let \(C^j\) represent the set of unique command names (unique events) and \(O^j\) represent the set containing possible numbers of the command occurrences (event occurrences) within the group of \(N\) analyzed samples of \(j\)th event block. Therefore, the local profile \(P^j\) of the \(j\)th block can be described as the following set of pairs \((c^j_i,o^j_i)\):

$$\begin{aligned} P^j= & {} \left\{ (c^j_i,o^j_i) : c^j_i \in C^j,\; o^j_i \in O^j,\; i=1,2, \dots ,L^j\right\} ,\nonumber \\&\qquad j=1,2,\ldots ,M , \end{aligned}$$
(1)

where \(c^j_i\), \(o^j_i\) represent an \(i\)th command name and a number of its occurrence within analyzed \(j\)th block, respectively, \(L^j\) represents the number of commands within \(j\)th block and \(M\) designates the maximum number of blocks. Considering the above, it is obvious that

$$\begin{aligned} \forall _{j=1,\ldots ,M} \sum _{i=1}^{L^j} o^j_i = N \end{aligned}$$
(2)

and

$$\begin{aligned} \forall _{j=1,\ldots ,M} \; \forall _{i=1,\ldots ,L^j} \; o^j_i \in \mathbb {N^+}, \; \end{aligned}$$
(3)

where \(\mathbb {N^+}\) represents the set of natural numbers without zero. An example of the set (1) for \(j\)th block of a given user is presented in Fig. 1, where hyphens represent the numbers of command occurrence \(o^j_i\).

Structure of used database (Schonlau et al. 2001) imposes the number of commands contained within the local profiles, because 1500 entries stored for each user are divided into blocks with 100 commands each. Therefore, in this case \(N=100\) and \(M=150\).

4.2 The fuzzy profile

As it was mentioned at the beginning of the section, the fuzzy profile is constructed using a configured number of local profiles. There are no restrictions on order of chosen profiles. They can be taken one by one as they are stored in activity history or they can be arbitrarily chosen without preserving any order. The only recommendation concerns the number of selected local profiles, which should be big enough to represent user’s general activity.

Local profiles of analyzed user selected for the process should contain blocks of activity taken from different parts of a working day. Moreover, in case of users performing different tasks periodically, the chosen local profiles should reflect as many such periods as possible to cover appropriate activity to a satisfying extent.

Let \(C\) represent the set of unique command names and \(F\) represent the set of fuzzyfied numbers of occurrence for commands within \(C\). Consequently, the fuzzy profile \(P_F\) can be described as the set of the following \(K\) pairs \((c_i,F_{ci})\)

$$\begin{aligned} P_F = \big \{(c_i,F_{ci}) : c_i \in C,\; F_{ci} \in F,\; i=1,2,\ldots , K \big \}, \end{aligned}$$
(4)

where \(c_i\) represents ith command and \(F_{ci}\) stands for its generalized, fuzzy occurrence. The number of commands in fuzzy profile is at least equal to the number of commands in any local profile used for obtaining \(P_F\), which can be expressed as follows:

$$\begin{aligned} L^j \le K, \quad j=1,2,\ldots , R \end{aligned}$$
(5)

where \(R\) designates the number of local profiles used to obtain the fuzzy profile; \(P_F\) and \(L^j\) represent the number of commands within \(j\)th local profile.

The most important parameters describing the fuzzy occurrence \(F_{ci}\) of \(c_i\) command are: the average number of occurrence \(a_{ci}\) and the largest number of occurrence \(m_{ci}\). These values can be directly used to define the membership function \(\mu _{ci}\) of \(F_{ci}\).

Fig. 2
figure 2

Membership function \(\mu _{ci}\) of fuzzy occurrence \(F_{ci}\) for the \(c_i\) command

The average number of occurrence \(a_{ci}\) can be obtained using any mean operator. The proposed approach applies the arithmetic mean, which is described by the following equation:

$$\begin{aligned} a_{ci} = \frac{\Sigma _{j=1}^R o^j_i}{R} \end{aligned}$$
(6)

where \(o^j_i\) represents the number of occurrence of command \(c_i\) in \(j\)th local profile. If \(j\)th local profile does not contain an entry for command \(c_i\) the occurrence equals 0 in this case.

The largest number of occurrence \(m_{ci}\) is also obtained by iterating every local profile within the analyzed learning set, which can be described as follows:

$$\begin{aligned} m_{ci} = \max _{j=1}^R o^j_i. \end{aligned}$$
(7)

Trapezoidal form of membership function \(\mu _{ci}\) that was used in the proposed approach can be described by the following equation:

$$\begin{aligned} \mu _{ci}(o:\gamma ) = \left\{ \begin{array}{ll} 1 &{} \quad \text {for} \; 0 < o \le a_{ci} \\ \frac{-o+a_{ci}}{\gamma m_{ci}-a_{ci}} &{} \quad \text {for} \; a_{ci} < o \le \gamma m_{ci}\\ 0 &{} \quad \text {for} \; o > \gamma m_{ci} \end{array} \right. ; \quad o \in O_{ci},\nonumber \\ \end{aligned}$$
(8)

where \(O_{ci}\) represents the universe of discourse for occurrence of \(c_i\) command and \(\gamma \) (fuzzyfication ratio) is a parameter of the system and allows for influencing the level of uncertainty. Sample trapezoidal membership functions defined by (8) using different \(\gamma _1\) and \(\gamma _2\) parameters are shown in Fig. 2.

Nevertheless, the approach is very flexible and allows the system to use other possible forms of membership functions. For example, uncertainty can be described using a Gaussian function as in the following equation:

$$\begin{aligned} \mu _{ci}(o:\gamma ) = \left\{ \begin{array}{ll} 1 &{} \quad \text {for} \; 0 < o \le a_{ci} \\ e^{-\frac{(x-a_{ci})^2}{2{\gamma m_{ci}}^2}} &{} \quad \text {for} \; a_{ci} < o \\ \end{array} \right. ; \quad o \in O_{ci}.\nonumber \\ \end{aligned}$$
(9)

Figure 3 depicts an example of fuzzy profile in the same form as the local profile in Fig. 1. For each command, \(c_i\) the most important values are demonstrated: the average numbers of occurrences \(a_{ci}\) (indicated by a dot) and the largest numbers of occurrences \(m_{ci}\) (indicated by a hyphen).

Fig. 3
figure 3

Sample fuzzy profile. For each command, \(c_i\) dots indicate the average numbers of occurrences \(a_{ci}\) and hyphens indicate the largest numbers of occurrences \(m_{ci}\)

4.3 Profile boosting

The fuzzy profile defined in Sect. 4.2 generates very small values of \(a_{ci}\) parameters for commands occurring rarely. For example, if command \(c_x\) occurs once and in only one of \(R\) local profiles used to obtain a fuzzy profile, then \(a_{cx} = \frac{1}{R}\). From one point of view that is the nature of rare commands within proposed solution. On the other hand, the approach should provide a mechanism of additional configuration if it is desired by a user of the system and that is why the profile boosting process has been designed.

Generally speaking, the profile boosting promotes very low \(a_{ci}\) and \(m_{ci}\) values without affecting other fuzzy occurrences within the fuzzy profile. The difference between profile boosting and using higher values of \(\gamma \) parameter is that \(\gamma \) affects all fuzzy sets within the profile, which is not always expected. Moreover, parameter \(\gamma \) has much stronger influence on higher values of \(m_{ci}\) and does not affect \(a_{ci}\) parameters at all.

Boosting procedure is configured by only one parameter—the boosting level \(\beta \), which is a natural number (\(\beta \in \mathbb {N^+}\)). Precise behavior of the mechanism is described by the following algorithm:

  1. 1.

    if \((m_{ci}<(\beta +2))\)   then   \(m_{ci}=m_{ci}+\beta \)

  2. 2.

    if \((a_{ci}<\beta )\)   then   \(a_{ci}=\beta \)

  3. 3.

    if \(((m_{ci} - a_{ci})<\frac{1}{2}\beta )\)   then   \(m_{ci}=m_{ci}+\beta \)

The presented procedure is repeated for all commands within the fuzzy profile.

An example of boosted fuzzy profile for \(\beta =2\) is presented in Fig. 4. It can be compared with the fuzzy profile previously presented in Fig. 3. Values of \(a_{ci}\) and \(m_{ci}\) parameters changed by boosting procedure are marked with circles, whereas the previous values are shown in gray color.

Fig. 4
figure 4

Fuzzy profile from Fig. 3 after boosting procedure for \(\beta =2\). For each command, \(c_i\) dots indicate the average numbers of occurrences \(a_{ci}\) and hyphens indicate the largest numbers of occurrences \(m_{ci}\). Circles indicate changed values after boosting process. Values before changes are grayed

5 Profile verification

Fuzzy profiles, created according to procedures described in previous sections, are stored in a profile database and can be used for further verification whether or not a given local profile conforms with chosen fuzzy profile. Therefore, a mechanism calculating the level of conformity is needed.

Let the level of conformity be denoted by \(v\). There are many possible ways to obtain \(v\) for given local and fuzzy profiles. The solution applied in presented approach is based on the arithmetic mean and can be described by the following equation:

$$\begin{aligned} v = \frac{\Sigma _{i=1}^L \mu _{ci}(o_i)}{L}, \end{aligned}$$
(10)

where \(o_i\) represents occurrence of \(c_i\) command in a given local profile (containing \(L\) commands) and \(\mu _{ci}\) denotes membership function of the same \(c_i\) command within the fuzzy profile that the local profile is being compared to. If the fuzzy profile does not contain \(c_i\) command then the membership level for this command equals 0.

In general, instead of (10) other mean operators can be applied. Assuming \(\uplus \) as any mean operator the equation takes the following, overall form

$$\begin{aligned} v = \biguplus _{i=1}^L \mu _{ci}(o_i). \end{aligned}$$
(11)

The higher value of calculated level of conformity \(v\), the bigger certainty that the particular local profile matches activity of the user described by the fuzzy profile.

5.1 Automatic adjustment of the acceptance threshold

For automatic decision making, the system provides a configurable parameter: the acceptance threshold \(\alpha \). It is used to qualify the obtained level of conformity as one describing an authorized user. The local profiles are accepted as matching the chosen fuzzy profile only if \(v\ge \alpha \).

The acceptance threshold \(\alpha \) can be assigned globally to describe testing conditions for all stored fuzzy profiles. However, the specificity of user’s activity is various. Therefore, the acceptance threshold adjusting to these specificity should be obtained for each fuzzy profile separately.

The described approach applies the following solution. The same learning set of \(R\) local profiles used to obtain particular fuzzy profile is further used to calculate \(R\) levels of conformity \(v_i\) according to (10). The output acceptance threshold \(\alpha _{P_F}\) obtained for the fuzzy profile \(P_F\) is based on the minimum of calculated \(v_i\) values. The final result can be described by the following equation:

$$\begin{aligned} \alpha _{P_F} = \left( \min _{i=1}^R v_i\right) \left( 1-\delta \right) , \end{aligned}$$
(12)

where \(\delta \) represents the acceptance buffer.

The acceptance buffer \(\delta \) is a global parameter which allows the system to influence the level of acceptance for all fuzzy profiles.

6 Computational complexity

There are two types of possible approaches to implementation of the fuzzy profiles mechanism to detect abnormalities in users’ work. The one concerns installation of all described functionality directly in the client machine. In this case, the fuzzy profiles are connected with each user’s account and stored locally. The system could send only general information to a kind of monitoring center. Such approach would be problematic in environments where users use more than one workstation. Moreover, the profiles describing generalized behavior, being stored remotely, are susceptible to manipulation.

The second, more secure approach, concerns a central analyzing server. In this scenario, the client machines send current reports describing user’s actions in the form of the local profile defined in previous sections. The server provides a central repository of the fuzzy profiles describing generalized behavior of all users in analyzed computer systems. However, the server resources can become overloaded due to the central computation supporting a certain number of monitored client machines. Therefore, in this section, a short analysis of computational complexity for the fuzzy profiles’ method is presented to assess the extent of the possible environment of application.

The complexity of the local profile creation can be described by the following simplified expression using \(O\)-notation:

$$\begin{aligned} O(N), \end{aligned}$$
(13)

where \(N\) represents the number of commands loaded to the local profile.

Adding new commands into the profile structure involves searching through the existing list of commands. The linear complexity in this case is obtained using hash tables as the main command container.

The fuzzy profile is based on \(R\) number of local profiles. Therefore, the computational complexity can be expressed as follows:

$$\begin{aligned} O(R \cdot N + N_{u})= O(R \cdot N), \end{aligned}$$
(14)

where \(N_u\) stands for the number of unique commands within the fuzzy profile of selected user. The \(N_u\) factor represents the process of final calculations needed after all training profiles have been analyzed. The product \(R \cdot N\) in the Eq. (14) represents higher complexity. Therefore, the \(N_u\) can be omitted in the \(O\)-notation.

The last element of complexity analysis includes profile verification, when a given local profile is compared to the stored fuzzy profile. The process can be described by the following expression:

$$\begin{aligned} O(N + N_u) = O(N), \end{aligned}$$
(15)

where \(N\) represents the phase of loading \(N\) commands during creation of the local profile and \(N_u\) represents a unique number of commands within the created local profile. The number of loaded commands \(N\) is usually significantly higher than the number of unique commands within the local profile. Moreover, assuming normal system usage, the number of unique commands \(N_u\) changes slower for growing number \(N\). Therefore, it can be omitted within the \(O\)-notation.

The expressions shown above clearly indicate the linear computational complexity according to the number of commands and learning profiles, which is the lowest possible result. However, it must be emphasized that parallel computation was not taken into account for the analysis.

Fig. 5
figure 5

Analysis of time complexity according to the \(N\) parameter (number of commands within the command block)

To verify analyzed computational complexity, four tests have been performed for changing \(N\) parameter. The results are presented in Fig. 5. Each value marked with black circle represents an average of 480 iterations of one examination. One examination included the following process for all 50 users within SEA database:

  1. 1.

    building a learning set: local profiles for 50 first blocks of user’s commands

  2. 2.

    calculating the fuzzy profile based on the learning set

  3. 3.

    boosting the fuzzy profile

  4. 4.

    obtaining the acceptance threshold

  5. 5.

    building a test set: local profiles for next 100 blocks

  6. 6.

    confronting (verifying) the test set with the fuzzy profile.

As it can be observed, the average processing time of computations performed by an ordinary personal computer (Intel Core 2 Duo P8400 2.26 GHz) equals around 0.2 s. This means that an analysis of all blocks for only one user (the enumeration shown above) would take less than 5 ms.

Therefore, the fuzzy profile mechanism can be successfully applied in the central profile repository model with one intrusion detection server. The examinations have shown a possibility of real-time data analysis performed by one computer for a large number of incoming local profiles.

7 Results obtained

The presented method was implemented in Java programming language using the FUZZLIB library (Kudłacik 2008, 2010b, a) which provides the set of tools for fuzzy systems’ development.

The application designed for testing the approach allows to configure a couple of parameters, which are presented by different equations in previous sections. All parameters are gathered in Table 1.

Table 1 Parameters of the system

The testing process considered changing two parameters: the acceptance buffer \(\delta \) and the fuzzyfication ratio \(\gamma \) for approaches using the boosting procedure and those without it.

50 fuzzy profiles were created—each for one user’s data file within the database (Schonlau et al. 2001). To obtain the fuzzy profiles, only non-contaminated first 50 blocks of commands were used. The same learning set of blocks was also used to automatically calculate the acceptance threshold—individually for each analyzed user’s data.

The remaining 100 blocks of each user were used as a testing set. The local profiles obtained from the testing set were confronted with the fuzzy profile by calculating the level of conformity. The decision of either accepting or rejecting the local profile was a matter of comparison between the obtained level of conformity and the acceptance threshold.

False acceptances and false rejections were counted to obtain final results: FAR and FRR ratios (Jain et al. 2007). The FAR ratio represents falsely accepted alien blocks as clean blocks. On the other hand, the FRR ratio represents falsely rejected clean blocks (detected as alien blocks).

The FAR was calculated globally according to the following equation:

$$\begin{aligned} \mathrm{FAR} = \frac{\sum _{i=1}^U \mathrm{FA}_i}{\sum _{i=1}^U A_i} \cdot 100, \end{aligned}$$
(16)

where \(\mathrm{FA}_i\) represents falsely accepted alien blocks for ith user, \(A_i\) stands for the number of alien blocks within the data of ith user, and \(U\) denotes the number of analyzed users of a computer system.

Analogically, the FRR ratio was obtained as follows:

$$\begin{aligned} \mathrm{FRR} = \frac{\sum _{i=1}^U \mathrm{FR}_i}{\sum _{i=1}^U \mathrm{CL}_i} \cdot 100, \end{aligned}$$
(17)

where \(\mathrm{FR}_i\): falsely rejected clean blocks for ith user and \(\mathrm{CL}_i\): the number of clean blocks within the data of ith user.

Presented ratios can also be obtained as an average of partial FAR and FRR obtained for each user separately (Coull et al. 2003). In this case, the results can differ. Therefore, during tests, another versions of FAR and FRR were calculated and marked as FAR\(^\prime \) and FRR\(^\prime \):

$$\begin{aligned}&\mathrm{FAR}^\prime = \frac{1}{U_A}\sum _{j=1}^{U_A} \frac{\mathrm{FA}_j}{A_j} \cdot 100,\end{aligned}$$
(18)
$$\begin{aligned}&\mathrm{FRR}^\prime = \frac{1}{U}\sum _{i=1}^U \frac{\mathrm{FR}_i}{\mathrm{CL}_i} \cdot 100, \end{aligned}$$
(19)

where \(U_A\) represents the number of users containing at least one alien block (\(j\) iterates only through these users).

Within the results of other researchers, the hit rate HR (Coull et al. 2003; Coull and Szymanski 2008) or detection ratio DR (Bhukya et al. 2006; Wang et al. 2004) can be found, which are the opposite of defined FAR:

$$\begin{aligned} \mathrm{HR} = \mathrm{DR} = 100 - \mathrm{FAR}. \end{aligned}$$
(20)

The work Maxion and Townsend (2002) proposes a different ratio, called by the name of authors the Maxion–Townsend Score, which is a cost function

$$\begin{aligned} \mathrm{MTS} = 6 \cdot \mathrm{FRR} + (100 - \mathrm{DR}) = 6 \cdot \mathrm{FRR} + \mathrm{FAR}. \end{aligned}$$
(21)

As it can be observed, the MTS cost strongly promotes low values of FRR.

Analogically, the HR\(^\prime \) and MTS\(^\prime \) ratios are obtained using the FAR and FRR. Unfortunately, in many papers, there is no precise description of how results are calculated. Therefore, in this paper, the two versions of obtained ratios (global and average) are presented for the method of fuzzy profile, which allows to compare different approaches.

7.1 Comparison with other methods

The approach presented in this paper was confronted with the results obtained using different methods encountered in the literature. The research allowed us to distinguish two main branches. The first one concerns the first 50 blocks of users’ data as the only source of learning samples. The second one either uses an updating procedure (Ju and Vardi 1999; Maxion and Townsend 2002; Coull and Szymanski 2008), causing the system to adjust the user’s profile with blocks beyond the given learning set, or uses the whole set of commands for initial calculations (Schonlau and Theus 2000). For the purpose of this paper, the methods within the second group are called as the approaches using the extended learning set.

In our opinion, altering the initial assumptions for the SEA data set or inventing new score ratios causes that different methods become hard to compare objectively. That is why the fuzzy approach presented in this article does not apply any updating mechanisms, like most of other encountered solutions. Moreover, we postulate to use in research very well-known ratios such as FAR, FRR, and ERR only (Jain et al. 2007). Convergent examinations enable scientists to confront results in different ranges of FAR and FRR. In contrast to one fabricated ratio such approach leaves much more field to scientific discussion.

Table 2 contains the comparison of the fuzzy profile approach with other methods using the same learning set, while Table 3 compares the approach with solutions using the extended learning set. For addressing purposes, all presented methods were assigned to subsequent numbers from 1 to 23, except the fuzzy profile method.

Unfortunately, in many cases, researchers provide various levels of obtained FAR and FRR ratios, which causes problems in direct comparison. Therefore, presented approach was confronted with each level of FRR ratio found in other papers. The rows are sorted in increasing order of FRR and better results within each row are placed higher.

All ratios marked with apostrophe, like FRR\(^\prime \), FAR\(^\prime \) etc., are presented in brackets. This case concerns the methods no. 4, 9, 13, 22, and 23, because the authors provided an algorithms of ratios’ computation. In other cases, the methods either use the approaches described by Eqs. (16) and (17) or the algorithms are unknown.

Table 2 Comparison of obtained results ordered by FRR

It can be noticed that the fuzzy technique was able to obtain better results in comparison with the methods 1, 2, 5, 6, 7, 8, 9, 11, 12, 16 and 17. Most of the results are significantly better.

The methods no. 13 and 9 are the same. For the values of no. 13, the fuzzy approach received worse results. However, the authors of Coull et al. (2003) provided a ROC curve, where other values of FAR and FRR could be found. Thus, in comparison with lower value of FRR in the row no. 9 the fuzzy method is scored better.

The worse results of the fuzzy approach in comparison with the methods no. 3, 4, 10, 15 and 18 are not significant (FAR lower for about 3 in the worst cases). However, it must be emphasized that the solution presented in this paper is characterized by a very low computational complexity, thus, it can be applied in real-time analysis of large number of blocks. Therefore, an overall assessment, concerning an application of the solution, speaks in favor of the fuzzy profile.

A significant difference of results can be noticed only in case of the method no. 14 (Bhukya et al. 2006). Unfortunately, the solution and examination process in particular are not precisely described in the short paper. The authors provide only one result (HR and FRR) giving the best score concerning the effectiveness ratio (Bhukya et al. 2006). In contrast to the MTS ratio, the effectiveness ratio used to obtain the results in Bhukya et al. (2006) strongly promotes the HR ratio (\(0.8 \cdot \mathrm{HR} + 0.2 \cdot \mathrm{FRR}\)).

Moreover, it is not clarified whether the initial processing of the method concerns the extended learning set or not. That is why, having no proof, the results are presented in Table 2. The parameter speaking in favor of the fuzzy profile approach is in this case a relatively high computational complexity of all methods based on Hidden Markov Models.

Table 3 Comparison with approaches using extended learning set

Table 3 contains results of approaches using extended learning set in comparison with the fuzzy profile method. Although the conditions of system configuration were unfavorable for the fuzzy approach, it obtained better results for two out of five analyzed cases.

Table 4 contains precise values of system parameters for which different results of the fuzzy approach were obtained. It can be used for direct application in monitoring system employing the described solution or to repeat and verify the presented results. The last table (Table 5) presents the same information for two EER values (equal error rates): one obtained for procedure using boosting mechanism and the second without it. The EER rates represent the state of system configuration for which FAR \(=\) FRR (Jain et al. 2007). The ERR itself can be used as a form of comparison factor between different verification mechanisms.

The chart presenting the examination for which the lowest EER was obtained is shown in Fig. 6.

It should be emphasized that the proposed solution can be modified by appropriate selection of the parameters \(\sigma \) and \(\gamma \). This way the approach is similar to the biometric system where a relation between FAR and FRR coefficients can be precisely selected according to the needs of application environment.

Table 4 Results and system configuration
Table 5 Equal error rates and system configuration
Fig. 6
figure 6

FAR and FRR according to \(\gamma \) parameter for \(\delta =0.05\) without boosting

8 Conclusions

Distinguishing computer users basing on the characteristics of their work in computer systems is not a trivial task. The nature of human activity is extremely uncertain and variable. The fuzzy approach presented in this paper was particularly designed to address these two concepts: uncertainty and variability.

The proposed fuzzy profiles of computer users have, in opinion of the authors, two main advantages. The first and the most important one concerns the flexibility of the idea in terms of features used to create presented fuzzy structures. The second is low, linear computational complexity according to the size of analyzed input data, which is very important in continuous analysis of a large amount of information.

The fuzzy profile allows to gather similar, countable features grouped by a number of their occurrences. Therefore, it represents a form of a fuzzy statistics. Taking into account the SEA database, the command names were features and the number of their occurrences within a given block were parameters of the created fuzzy statistics. The SEA database was chosen because of the availability of many research results concerning different methods of analysis. Therefore, an appropriate environment for comparison has been created. There is no such possibility in case of private databases.

The obtained results prove that the fuzzy profiles can distinguish user’s activity, even in case of such limited information stored within the SEA database. Moreover, the approach gives a wide range of possibilities by applying statistics of other features. Future research will focus on contemporary environments based on graphics, pointing devices and keyboards to find the best suite of features and parameters. Several valuable examples concerning contemporary user interfaces of operating systems and using the internet could involve the following values stored and measured in given period of time, such as:

  • names of windows chosen by the user and dynamics of pointing device before window selection

  • dynamics of keystrokes for chosen characteristic groups of keys

  • dynamics of pointing device in characteristic screen areas

  • chosen menu options and number of their selections

  • chosen network domains and network resources.

The research has shown that the computational effectiveness speaks in favor of the approach having an impact on the practical aspects of the problem. Considering complex user interfaces, obtaining, processing and storing the data of user’s activity is even more complicated. The situation is different in case of other solutions like those based on Hidden Markov Models and Support Vector Machines giving the best results for the SEA data set. Learning phases of these approaches are particularly complex making the process of updating very time consuming. In contrast, creating an updated version of the fuzzy profile is natural, because the data analysis is always based on local profiles, either during the learning phase or verification phase.

The fast analysis of user’s activity based on the fuzzy profiles does not use many resources of the client machine creating the local profiles and passing them to the central repository for the examination. Because of the low computational complexity the central monitoring server is able to process a large number of incoming local profiles in real time. Moreover, flexibility of the approach allows to use different types of features. Therefore, considering mentioned advantages and particularly the proved accuracy of the fuzzy profiles, the method should be considered in practical applications.