1 Introduction

Today, popular forms of biometric authentication include fingerprints and facial recognition. However, such biometric techniques do not resolve all authentication issues. For example, studies show that the elderly are reluctant to use facial recognition and fingerprint recognition for authentication on mobile phones, while young people prefer to type instead of using other ways to authenticate [20]. Therefore, some passive biometric have recently emerged. In this research, we consider biometric based on keystroke dynamics. Such techniques are applicable to the authentication problem, and can also potentially play a role in intrusion detection.

Keystroke dynamics are derived from typing behavior. This approach typically relies on features such as the duration of keyboard events, the duration of the “bounce,” the time difference between each character, and so on [38]. Such data can be collected through monitoring keyboard input and recording, for example, the time intervals between each keystroke. However, it is worth noting that a biometric based on keystroke dynamics is unlikely to be powerful enough to serve as a standalone authentication technique, and hence keystroke dynamics generally must be used in conjunction with other types of authentication, such as passwords [30]. In its related role as an IDS, keystroke dynamics may be competitive with other approaches [38].

Compared with popular biometric technologies such as fingerprints and iris scans, keystroke dynamics has some advantages. First, in terms of hardware, keystroke features can be gathered through a simple API interface, with the collected data then passed to a model for evaluation. Hence, no additional hardware deployment is involved, which reduces the cost. Second, as alluded to above, keystroke information can be obtained in a more passive and natural manner, which eases the collection burden on users. Third, keystroke dynamics can be used in an ongoing, real-time IDS mode to judge whether current behavior is consistent with a specific user’s previous behavior. In contrast, in a typical username and password authentication scenario, such passive monitoring is not an option. Therefore, keystroke dynamics can serve to enhance security beyond the authentication phase.

Of course, there are also some disadvantages to using keystroke dynamics for authentication. One issue is that if a user has an injured hand or is simply distracted or overly emotional, their typing patterns may not be consistent with the patterns used for training. Furthermore, another disadvantage is that typing patterns may vary based on different keyboards, or even due to new applications or software updates, which indicates that models must be updated regularly. Although such concerns are legitimate, it is clear that these issues can be mitigated, and hence the utilization of keystroke dynamics is likely to increase in the near future.

In this research, we analyze various keystroke dynamics data and train machine learning and deep learning models to distinguish between users. Features include individual key presses and flight time, among others. Note that for the sake of user privacy, we do not store sequences of actual keystrokes, and hence the text itself is not used for modeling purposes.

We consider a wide variety of learning techniques, including k-nearest neighbors (k-NN), random forests, support vector machines (SVM), convolutional neural networks (CNN), recurrent neural networks (RNN), long short-term memory (LSTM) networks, extreme gradient boosting (XGBoost), and multilayer perceptrons (MLP).

Much of the previous research in this field is based on multiclass models trained on relatively small amounts of data per user. There are several inherent problems with such an approach. For example, if a new user is added, or the typing content (e.g., password) is changed, the model needs to be retrained. Furthermore, until recently, most work in this field considered only traditional statistical machine learning methods, with limited use of modern deep learning techniques. In contrast, we focus on modern machine learning and deep learning techniques, and we are able to improve on previous related work.

The remainder of this paper is organized as follows. Section 2 discusses relevant background topics, including introducing the learning techniques considered. We provide a survey of previous work in Sect. 3, while Sect. 4 describes the dataset used in our experiments. Our experimental results are presented in Sect. 5. Lastly, Sect. 6 summarizes our main results and we include a discussion of possible directions for future work.

2 Background

In this section, we discuss keystroke dynamics in general and we consider previous work in this field. In the next section, we introduce the dataset and the various machine learning models that we use in this research.

2.1 Keystroke Dynamics

According to [27], “keystroke dynamics is not what you type, but how you type.” Most previous work on typing biometrics can be divided into either classification based on a fixed-text or authentication based on free-text [30]. For fixed text, the text used to model the typing behavior of a user and to authenticate the user is the same. This approach is usually applied to short text sequences, such as passwords. Classification can be based on various timing features related to the characters typed [10]. Moreover, by combining a password along with a username, such a system can be further strengthened [26]. A comprehensive discussion related to the fixed-text data problem can be found in [30].

As for the free text case, the text used to model typing behavior of a user and to authenticate the user is not necessarily the same. This approach is usually applied to long text sequences, and can be viewed as a continuous form of authentication or as an intrusion detection system (IDS). Again, in this paper we only consider the fixed-text problem.

Previously, many different distance-based methods have been applied to keystroke dynamics. More recently, machine learning techniques have been considered, including support vector machines (SVM), recurrent neural networks (RNN), and so on [38]. The learning techniques evaluated in this paper are introduced below.

2.2 Learning Techniques

For our experiments, we have considered a wide variety of learning techniques. We introduce these learning techniques in this section.

2.2.1 Random Forest

A random forest [14] is a supervised, decision tree-based machine learning method that is often highly effective for classification and regression tasks. This technique consists of a large number of individual decision trees, where each decision tree is based on a subset of the available features, and a subset of the training samples. The subsets used for each decision tree are selected with replacement. A majority vote or averaging of the component decision trees is used to determine the random forest classification.

2.2.2 Support Vector Machine

Support vector machines (SVM) [30] are a powerful class of supervised machine learning techniques. The key idea of an SVM is to construct a hyperplane, so that the data can be divided into categories [34]. The so-called “kernel trick” enables us to efficiently deal with nonlinear transformations of the feature data. As with random forests, SVMs often perform well in practice.

2.2.3 K-Nearest Neighbors

The k-nearest neighbors (k-NN) algorithm [24] is an intuitively simple technique, whereby we classify a sample based on the k nearest samples in the training set. In spite of its simplicity, k-NN often performs well, although overfitting is a concern, especially for small values of k. Both k-NN and random forest are neighborhood-based algorithms, although the neighborhood structure determined by each is significantly different.

2.2.4 T-SNE

The method of t-distributed stochastic neighbor embedding (t-SNE) is a non-linear dimensionality reduction technique that was originally proposed in [35]. It is typically used for data visualization, to reduce the dimensionality of the feature space, and for clustering. In contrast to the more well-known principal component analysis (PCA), t-SNE is better able to capture non-linear relationships in the data.

2.2.5 XGBoost

XGBoost, the name of which is derived from extreme gradient boosting, is a popular technique that has played an important role in a large number of Kaggle competitions. In comparison to the simpler AdaBoost technique, XGBoost has advantages in terms of dealing with outliers and misclassifications.

Data augmentation consists of generating synthetic data based on an existing dataset. Such “fake” data can be used to make up for a lack of data for a given problem. Data augmentation has often proved valuable in practice. We consider data augmentation in our XGBoost experiments.

2.2.6 LSTM and Bi-LSTM

Long short-term memory (LSTM) is a highly specialized recurrent neural network (RNN) architecture that is able to better deal with the vanishing and exploding gradient issues that plague plain “vanilla” RNNs [34]. Consequently, LSTMs generally perform much better over longer sequences as compared to vanilla RNNs.

A bi-directional LSTM (bi-LSTM) combines two LSTMs, one computed in the forward direction and another computed in the backward direction. Bi-LSTMs are well-suited to sequence labeling tasks and have proven to be strong at modeling contextual information in natural language processing (NLP) tasks.

In our LSTM and bi-LSTM experiments, we consider two different encoding methods. In addition to the standard raw feature encoding, we also experiment with one-hot encoding. Assuming that a feature can take on m possible values, a feature value of k has a one-hot representation consisting of a binary vector of length m with a 1 in the kth position and 0 elsewhere. When training, one-hot encoding has a natural interpretation as a vector of probabilities, and hence it is well suited to training involving a softmax output layer, for example.

We also consider attention mechanisms. The idea of an attention mechanism is intuitively simple—we want to force the model to focus on some specific aspect of the training data. Attention is somewhat related to regularization, in the sense that we reduce the potential for over-reliance on some parts of the training data, which can lead to various pathologies, including overfitting.

2.2.7 Convolutional Neural Network

Convolutional neural networks (CNN) are designed to deal effectively and efficiently with local structure. CNNs have proven their worth in the realm of image analysis. Most CNN architectures include convolutional layers, pooling layers, and a fully-connected output layer.

2.2.8 Multi-Layer Perceptron

The structure of a generic multi-layer perceptron (MLP) includes an input layer, one or more hidden layers, and an output layer. Each node, or neuron, in a hidden layer includes a nonlinear activation function, which is the key to the ability of an MLP to deal with challenging data. To mitigate overfitting, we employ dropouts for regularization in our MLP experiments [23].

3 Previous Work

In this section, we first consider distance-based methods. Then we discuss more recent work that relies on various machine learning techniques.

The concept of keystroke dynamics first appeared in the 1970s and was focused on fixed-text data [12]. In subsequent years, Bayesian classifiers based on the mean and variance in time intervals between two or three consecutive key presses were applied to the problem [28]. The result in [28] claim a classification accuracy of 92% on a dataset with 63 users.

Typical of early work in this field are nearest neighbor classifiers based on various distance measures. Initially, Euclidean distance or, equivalently, the L 2 norm was used. In contrast to the L 2 norm, the L 1 norm (i.e., Manhattan distance) makes it easier to determine the contributions made by individual components, and it is more robust to the effect of outliers. In [24], it is shown that among all distance-based techniques, the best performance is obtained from a nearest neighbor classifier that uses a scaled Manhattan distance.

Neither the L 1 nor the L 2 norm deal effectively with statistical properties, and hence statistical-based distance measures have also been considered. For example, Mahalanobis distance has been widely used in keystroke dynamics research [4].

Recently, research in keystroke dynamics has been heavily focused on machine learning techniques. Such research includes k-nearest neighbors (k-NN) [37], K-means clustering [17], random forests [25], fuzzy logic [15], Gaussian mixture models [16], and many other approaches. In the remainder of this section, we discuss some relevant examples of machine learning based research focused on fixed-text keystroke dynamics.

In [36], support vector machines (SVM) are used to extract features from the data that are then used for classification. Another popular machine learning technique has been used in keystroke dynamics is hidden Markov models (HMM). An HMM includes a Markov process that is “hidden” in the sense that it can only be indirectly observed [33]. In [7], an HMM is used to learn the time intervals in keystroke dynamics.

A number of neural network architectures have also been applied in keystroke dynamics in recent years [5, 22]. Deep learning techniques have also been successfully applied to classification and have achieved better performance, as compared to previous techniques, such as those considered in [29]. Deep networks usually require a relatively long time to train, and hence Adam optimization and leaky rectified linear unit (leaky relu) activation functions are often used to speed up the learning process [23].

In [2], a genetic algorithm known as neuro evolution of augmenting topologies (NEAT) is considered. This algorithm achieves a high accuracy on a custom dataset.

In [8], keystroke dynamics authentication based on fuzzy logic is considered, and an accuracy of 98% is achieved. This model evolves in the sense that it can update keystroke templates when a user login is successful. The research in [21] uses extreme gradient boosting (XGBoost), random forest, multilayer perceptron (MLP), and other machine learning methods to perform multiclass classification on the Carnegie Mellon University (CMU) dataset, which is the same dataset considered in this paper. In [21], a highest accuracy of 93.79% is achieved using XGBoost. However, these authors do not discuss hyperparameter tuning, and thus it may be possible to improve on their results.

As the name suggests, the equal error rate (EER) is the point where the false acceptance rate (FAR) and false rejection rate (FRR), at which point the sum of the FRR and FAR is minimized. The value of the EER is serves as an indicator of the performance of a system, enabling the direct comparison of different biometrics—the lower the value of EER, the better the performance of the system. The EER is easily obtained from an ROC curve.

The authors of [6] propose using convolutional neural networks (CNN) for authentication based on keystroke dynamics. Their model architecture is very similar to that in [18], with the main ideas deriving from a sentence classification task. They feed time-based feature vectors into the model directly instead of reshaping the vectors into matrices. They also explore the influence of different kernel sizes, different numbers of kernels, and different numbers of neurons in the fully connected layer. Their model is evaluated on an open fixed-text keystroke dataset, and their best equal error rates (EER) are 2.3 and 6.5% with and without data augmentation, respectively.

Time-based features and pressure-based features are considered in [1]. By combining the information of these two kinds of features, the authors achieve good performance. In addition, they deal with typos—when a typo is recognized, the duration of keystroke time between the wrong key and back-space key is ignored, as is the duration between the back-space key and the correct key.

Another study considers deep belief networks (DBN) to extract hidden features, which are then used to tune a pre-trained neural network [9]. The authors of [9] claim that deep learning techniques significantly outperform other algorithms on the CMU fixed-text dataset.

The CMU keystroke dataset is a well known public fixed-text dataset and has been extensively studied. The use of a common dataset enables research to be directly compared. In [24], the authors introduce this dataset and achieve a baseline result with an EER of 9.6%. There are now many studies that use this same dataset and outperform this baseline result. For example, in [6], the authors obtain an EER of 2.3%, based on a CNN with data augmentation, while in [23], an EER of 3% is attained using a multi-layers perceptron (MLP).

As an aside, we note that other keystroke features might be of interest. For example, keystroke acoustics for user authentication are considered in [32]. In this research, a dataset containing 50 users results in an EER of 11%, which shows that acoustical information can be informative. However, an advantage of keystroke dynamics is that such information is easily collected directly from any standard keyboard.

4 Dataset

The Carnegie-Mellon University (CMU) fixed-text dataset is used for all experiments considered in this paper. The CMU dataset commonly serves to benchmark techniques in keystroke dynamics research [3, 6, 11, 13, 21, 23, 31]. This dataset includes 51 users’ keystroke dynamics information, where each user typed the password “.tie5Roanl” a total of 400 times, consisting of 50 repetitions over each of 8 sessions. Between sessions, a user had to wait at least one day, so that the day-to-day variation of each subject’s typing was captured [24]. Furthermore this password was chosen to be representative of a strong 10-character password, as it contains a special symbol, a number, lowercase letters, and a capital letter. Each time this password is typed, 31 time-based features were collected, as listed in Table 1. Note that the Enter key is pressed after typing the 10-character password. Hence, there are 11 keystrokes, consisting of 10 consecutive pairs.

Table 1 Keystroke features in CMU dataset

Individual keystrokes in a sequence can be viewed as words in a sentence, in the sense that we can tie the UD-time and DD-time from two adjacent keystrokes with the duration of the previous keystroke. Following this approach, for each keystroke, we obtain a vector consisting of three features, which we interpret as an 11 × 3 matrix. Thus, our feature “vectors” consist of a sequence of these matrices. We refer to this matrix as the “fixed keystroke dynamics sequence,” which we abbreviate as fixed-KDS.

5 Experiments and Results

This section contains the results of our fixed-text experiments on the CMU dataset. We provide some analysis and discussion of our results.

As mentioned above, in the CMU dataset, the data is arranged as a table with 31 columns, representing the collected information for one timing of the password. For example, one column is H.period which is the hold time for the “.” key. The hold time is the length of time when the key was depressed. Another example is the column DD.period.t, is the time interval between when the “.” key was pressed until the “t” key was pressed. The overall table is 20, 400 × 31, where each row corresponds to the timing information for a single repetition of the password by a single subject. Figure 1 illustrates the timing relationship between consecutive keystrokes.

Fig. 1
figure 1

Keystroke dynamics features

5.1 Data Exploration

There are 31 timing features in the CMU dataset, which can be divided into three groups which we denote as DD, UD and H. Here, we analyze the data to determine whether there is any significant difference among these three groups. For this data exploration, we have randomly selected six of the 51 subjects for analysis.

In Fig. 2a, each line graph represents the 400 input feature vectors corresponding to a given subject. From this figure, we observe that most of the feature vectors are fairly consistent in that they follow a similar pattern for a given subject. This indicates that subjects tend to be relatively consistent with respect to this particular feature group. This observation can be seen as a positive indicator of the potential to successfully classify the subjects. However, when the six subjects’ average cases are compared in Fig. 2b, the results show that the subjects have somewhat similar typing patterns.

Fig. 2
figure 2

Key-down key-down for six subjects (400 keystrokes). (a) Individual. (b) Average

The analogous results for the key-up key-down features are shown in Fig. 3. We observe that this data is similar to key-down key-down data in Fig. 2.

Fig. 3
figure 3

Key-up key-down for six subjects for (400 keystrokes). (a) Individual. (b) Average

In Fig. 4a, we compare the six subjects based on the hold-time feature, and here the differences are more pronounced. In particular, the average cases in Fig. 4b reveal more substantial differences. These results indicate that the hold duration should be a strong feature for distinguishing users.

Fig. 4
figure 4

Hold time for six subjects (400 iterations). (a) Individual. (b) Average

To further explore the data, we apply t-SNE as a clustering technique to gain insight into how the data is distributed. In this case, we consider a subset consisting of the first seven subjects, using all 400 records for each of these subjects. The result in Fig. 5 show that the subjects can be clustered into different groups. This is again promising, as it indicates that we should have success in distinguishing users.

Fig. 5
figure 5

T-SNE of features of seven subjects

5.2 Classification Results

In this section, we give our classification results. Here, we experiment with k-NN, random forest, SVM, XGBoost, MLP, CNN, RNN, and LSTM.

5.2.1 K-Nearest Neighbor Experiments

We optimize with respect to three parameters of the k-NN algorithm, namely, the number of neighbors, the weight function used for prediction, and the distance category. As in all of our parameter tuning experiments, we employ a Bayes model to generate a suit of parameters with the highest probability being the best result. Table 2 shows the search space for each parameter and the best accuracy achieved. Boldface entries in Table 2 are used to indicate the optimal parameter values.

Table 2 Results for k-NN

5.2.2 Random Forest Experiments

We optimize four parameters of the random forest algorithm, namely, the number of decision trees, the maximum depth of each decision tree, the minimal number of samples in a leaf node, and the minimum number of samples required to split. Again, we make use of different combinations of values of these parameters to build a Bayes model, which generates a set of parameters that will, with high probability, yield the best result. Table 3 shows the range considered for each of these parameters, the optimal values that we found (in boldface), and the best result obtained.

Table 3 Results for random forest

5.2.3 Support Vector Machine Experiments

Here, we consider four parameters of an SVM, namely, the value of the regularization parameter, the kernel function, and the two coefficients of the kernel function. Again, a Bayes model is built to search the optimal values of these parameters. The search space for each parameter, the optimal values, and the best accuracy are given in Table 4.

Table 4 Results for SVM

5.2.4 XBGoost Experiments

Next, we classify the samples using XGBoost. Here, we consider each of the three feature groups (DD, UD, and H) individually, as well as the combination of all three. The multi-classification results for the 51 subjects are shown in Table 5 and the model parameters used to achieve these results are given in Table 6.

Table 5 Accuracy of four features for XGBoost
Table 6 Selected parameters for XGBoost

Based on these results, we conduct further experiments with XGBoost. Given the fairly limited size of the training data, we apply a simple data augmentation strategy—we randomly perturb each timing feature based on a range of (−0.02, 0.02). In this experiment, we set the augmentation ratio to two, meaning that the amount of augmented date is two times the amount of original data. We find that this data augmentation provides a slightly improvement in the accuracy, as shown in Table 7.

Table 7 Results for XGBoost

5.2.5 Multilayer Perceptron Experiments

Our generic MLP consists of four fully connected layers, in which the number of neurons are 512, 256, 144, and 51, respectively. The output of the last layer is fed into a softmax function to calculate the corresponding probability for each class. A rectified linear unit (relu) activation function and a batch normalization layer are used in the first and second dense layers. We use the cross entropy loss function for this model—additional parameters are listed in Table 8. This MLP model yields an impressive accuracy ot 95.96%.

Table 8 Results for MLP

5.2.6 Convolutional Neural Network Experiments

The input for our CNN model is the fixed-KDS data structure, which we discussed in Sect. 4. The architecture of our CNN is based on that of the so-called textCNN in [18], which is used to process sequential text data. The key idea is to apply multiple rectangular kernels, instead of more typical square kernels. Specifically, the width of all kernels is the same as the embedding size for each word, so the output for each convolution is a one-dimension vector. Then multiple max-pooling layers are used to process these vectors to yield one feature for each kernel. Finally, these generated features are concatenated into a one-dimension vector, and multiple fully-connected layers are used to produce the class prediction. Our CNN model is illustrated in Fig. 6.

Fig. 6
figure 6

Architecture of CNN for free-text datasets

In our keystroke dynamics model, we view each keystroke event as a “word” and each keystroke sequence as a “sentence.” In this way, six different convolution kernels are applied to this sequential data, and continuous max-pooling layers extract the most important feature from each kernel. Then the concatenated vector is fed into three dense layers, with a softmax function is used to generate the probability for each class. In addition, a dropout layer is added after the penultimate layer. The cross entropy loss function is used. For these CNN experiments, the best result we obtain is an accuracy of 92.57%.

5.2.7 Recurrent Neural Network Experiments

The architecture of our RNN-based neural network is shown in Fig. 7. The input data for this model is the fixed-KDS, as discussed in Sect. 4. The idea behind this model comes from the field of sentiment analysis. Since keystroke data is inherently sequential, we applying a two-layers bi-directional RNN. In this experiment, the cross entropy loss function is used, and the best result we obtain is an accuracy of 93.45%.

Fig. 7
figure 7

Architecture of bi-RNN

5.2.8 LSTM Experiments

Next, we apply both LSTM and bi-LSTM with one-hot encoding. In these experiments, one-hot encoding is applied on both the subject and the timing features, which then serve as the feature vectors for the LSTM and bi-LSTM. The accuracies we obtain for LSTM and bi-LSTM are shown in Table 9. Although these results are reasonably strong, they are not competitive with our XGBoost experiments.

Table 9 Results for LSTM and bi-LSTM with one-hot encoding

We further consider a bi-LSTM with attention, primarily as a way of analyzing feature importance. The attention matrix in the form of a heatmap, appears in Fig. 8. This matrix consists of the weights determined by the attention layer. In this matrix, the x-axis represents the 31 features, while the y-axis is based on 20 consecutive training samples. We observe that after several epochs, the attention seems to have a tendency to converge to specific features—at the end of the training, we find the most significant features are DD.period.t, DD.e.five, UD.Shift.r.o, DD.n.l, and DD.l.Return.

Fig. 8
figure 8

Attention matrix heatmap

5.3 Summary and Discussion

We summarize our experimental results for the CMU fixed-text dataset in Fig. 9. The result shows that among all models we have considered, XGBoost with data augmentation, denoted XGBoost-augment, achieves the highest accuracy at 96.39%.

Fig. 9
figure 9

Summary of our results

While XGBoost with data augmentation achieves the best results, MLP does nearly as well. When comparing the training times for these two models, we find that XGBoost with data augmentation take 18 minutes to train, while MLP requires about half an hour. Both of these are very reasonable training times, but if great efficiency during training is required, then XGBoost with data augmentation may be the better choice.

In Fig. 10 we provide a comparison of our best result to previous work. We see that our best accuracy of 96.39% offers a modest improvement over previous work in this field.

Fig. 10
figure 10

Comparison to previous work

6 Conclusion and Future Work

In this paper, we tested and analyzed a wide variety of machine learning techniques for biometric authentication based on fixed-text typing characteristics. We found that XGBoost with data augmentation performed best, with MLP performing nearly as well. Our results improved upon previous research involving the same dataset.

There are many avenues available for future work. For example, we model optimization and model fusion would be interesting. For model optimization, we could consider techniques from contrastive learning and self-supervised techniques to see whether these approaches can improve our model.

As another example of possible future work, the robustness of various techniques can be evaluated using an algorithm known as POPQORN [19]. The idea behind POPQORN is to observe the effect of outside disturbances to the model and thereby measure its robustness.