Keywords

1 Introduction

The pervasive presence of mobile devices equipped with many powerful sensors has led to new authentication mechanisms. One of them is user-authentication based on keystroke dynamics, an active research topic with remarkable results in the case of computers with hardware keyboards. Keystroke dynamics is a behavioural biometric which adds a second level security to alphanumerical passwords, by modelling the users’ typing rhythms. Attempts to access the device by impostors, who have illegally obtained the user’s password (through smudge-attack or shoulder surfing), can be detected based on the fact that they do not type the password in the same rhythm or that they handle the mobile device differently (device holding position, touchscreen usage).

In this paper we propose to investigate the influence of password difficulty on the authentication system’s performance. The analysis is performed on our new dataset collected using mobile devices. This allows investigation not only of the effect of password difficulty, but also the influence of new features provided by the sensors of mobile devices.

Our work makes several contributions. One concerns the collected data, which contain the password typing patterns of three types of password i.e. easy, strong and logical strong. Data was collected using mobile devices therefore; besides time-based raw data we obtained additional data from sensors such as touchscreen and accelerometer. We have already made this data publicly available, hence it can be used by other researchers. Another contribution is the proposed secondorder feature set, independent of the length of the password and with equal error rates close to those obtained from the full feature set. The final contributions concern the evaluation results and the software used for the evaluation. Overall, we hope that our work will help focus attention on the opportunities provided by mobile device sensors in user identity verification.

The remainder of this paper is organised as follows. The next section (Sect. 2) presents related work with an emphasis on studies conducted on touchscreen-based mobile devices. Section 3 addresses research methods such as data collection, feature extraction and the different feature sets used in the evaluation. Section 4 offers evaluation results including two-class classifiers and anomaly detectors. The final section concludes our study and its findings.

2 Related Work

Keystroke dynamics is a well researched area. Several survey papers have been published to date [1, 4, 9, 17]. Most of this research has been carried out on computers or older mobile devices that utilise hardware keyboards. Less work has been carried out on touchscreen equipped mobile devices. However, the influence of key press pressure has been studied before the touchscreen smartphone era [8, 12, 14, 16]. In these studies special pressure-sensitive hardware keyboards were built. All these studies came to the conclusion that using key pressure as an addition feature increased the keystroke dynamic authentication system’s performance.

In very recent years a few studies have been conducted on touchscreen-based mobile devices [2, 3, 6, 7, 10, 19, 21]. Except for Draffin et al.’s study [7], the other papers present results related to password-based authentication using keystroke dynamics. The most important aspects for the purpose of comparison are the datasets, the features, the methods and the results. Table 1 presents the characteristics of the datasets used in the aforementioned studies. It is important to note that not all studies saved the touch related raw data in the same way. Zheng et al. [21] and Buschek et al. [6] saved pressure and size (finger area) both at the moment of touch down and touch up. Conversely Antal et al. [2] saved this raw data only at the moment of key press. There are several differences between spatial raw data too. While Antal et al. saved the xy coordinates only at the key press moment, Buschek et al. saved both the coordinates of the touch point at the moment of touch down and touch up. The differences between raw data imply different features for the analysed studies. Only Zheng et al. used raw data obtained from the accelerometer and the gyroscope sensors.

Table 1 Characteristics of keystroke datasets collected on touchscreen-based mobile devices

We have found only three papers which have studied the influence of password difficulty on the performance of keystroke dynamics system. Bartlow and Cukic [5] conducted the first study in this direction. Besides common short 8-lowercase letter passwords, such as computer and swimming, they used long 12-character length randomly generated passwords the typing of which required the usage of the Shift key. Example of such passwords include +AL4lfav8TB= and UC8gkum5WH. In almost every EER performance measurement they observed a notable increase (at least 2 %) from short to long password, indicating that the usage of the shift key in a password plays a significant role. In feature ranking the shift key related features proved to be very discriminating.

Meng et al. [18] questioned the use of keystroke dynamics as biometrics. They built a training interface which allows intruders to train themselves in imitating another person’s password typing rhythm. For this study they used two 8-character length passwords, an easy and a difficult one. They concluded that passwords that are easier to type are also easier to imitate.

Mondal et al. [15] introduced complexity measurement related to the typing of a password after which several performance measurements were conducted. In contrast to the previous two studies, they concluded that easier passwords are better choice for keystroke dynamics biometrics.

3 Methods

3.1 Data Collection

An Android application was designed and implemented with the aim of collecting typing data for different passwords. Users had to type in three different fixed passwords. The following passwords were used: easy—kicsikutyatarka; logical strong—Kktsf2!2014; strong—.tie5Roanl. The easy password contained only lowercase letters and was formed by the first three words of a Hungarian saying. Our proposal utilises the logical strong type and is based also on the same Hungarian saying, but in this case we took the first letters of the words and used sf2! for sfsf (two occurences of sf) followed by the year of data collection. The logic behind the logical strong password was explained to subjects before the data collection experiment. The strong password was used in the keystroke dataset collected by Killourhy [11].

54 volunteers took part in the experiment, 5 women and 49 male, with an average age of 20.61 years (range: 19–26). At the registration stage they stated their experience with touchscreen devices as follows: 2—inexperienced, 6—beginners, 17—intermediate and 29 advanced touchscreen users. Among them 4 users were left handed the others right handed. Data was collected in three sessions one week apart. In each session they typed at least 60 passwords, at least 20 passwords from each type. At the end of data collection each user had provided at least 60 samples from each type of password (easy: 3323 samples, strong: 3303, logical strong: 3308). The data was collected using 13 identical Nexus 7 tablets. Typos were not allowed, instead, the subjects had to retype the password. Each password had to be typed in the same way: the same keys had to be typed in the same order.

Table 2 The most important raw data saved during data collection

3.2 Feature Extraction

The application implemented a custom keyboard in order to store the time, touch and accelerometer related raw data during each user’s typing. Raw data was saved at touch events initiated by the user for example, at the point of touch down and touch up. Touch down events were generated by the system when the user touched a key on the software keyboard, and touch up at the point of key release. Table 2 shows the raw data saved during the data collection process.

Fig. 1
figure 1

Data collection. Raw data: xy—coordinates; tdown, tup—timestamps; Ax, Ay, Az—directional accelerations; P—pressure; FA—finger area. Time-based features: H—hold time; UD—up-down time; DD—down-down time

Figure 1 shows the data saved at the moment of touch down and also the time-based features that can be extracted from these data such as hold time—the time between key press and release, down-down time—the time between consecutive key presses, and up-down time—the time between key release and next key press. The Nexus 7 tablet contains an embedded accelerometer with range \(-2g\) and \(+2g\) and measures the accelerations along three axes (the axes are device related). Its fastest sampling rate on sensor readings is about 50 Hz. During data collection these values were saved at the moment the user touched the screen. Using these directional accelerations we could characterise the device holding preferences of the users.

3.3 Feature Sets

Table 3 shows the full feature sets for each type of password. Because these feature sets contain features related to each key in a password, some feature types contain a different number of features for each password. Mean hold time (MHT) feature represents the average of key hold time values. The other mean values were computed similarly. The total distance feature (TD) was calculated as the sum of the distances (in pixels) between two consecutive buttons on the virtual keyboard. Total time (TT) represents the time needed to type in the password. Velocity (V) was computed as the quotient of the distance and the total time. Before evaluation data was normalized into the range [0, 1].

Table 3 Full feature sets for each type of password

Besides the full feature sets presented in Table 3 some evaluations were performed on a so called—secondorder—feature set. This feature set contains 9 features: mean hold time, mean pressure, mean finger area, mean x acceleration, mean y acceleration, mean z acceleration, velocity, total time and total distance. The most important characteristic of this feature set is that the number of features is password-independent. All information related to this research is available at http://www.ms.sapientia.ro/~manyi/mobikey.html.

4 Evaluation and Results

Keystroke dynamics based authentication is a typical outlier detection problem. Given the keystroke data of a typed password the system has to decide whether the data belong to the genuine user. This problem can be formulated as a classification and as an anomaly detection problem. In the case of classification we typically employ a two-class classification algorithm, where the positive samples belong to the genuine user and negatives are selected from the others. Classifiers are more powerful since they yield information about the impostors (negative samples), whereas anomaly detectors can only check the deviation from the genuine user (positive samples). We should mention that in a real-world authentication system only the anomaly detection method is viable because of the lack of negative samples. However for comparison purposes, we present the evaluation of two-class classifiers too.

4.1 Two-Class Classification

In the case of two-class classification we call the data from the legitimate user positive samples and that from impostors we call negative samples. As our dataset contains data from several users and as each user typed the same password, one can easily select negative data for each user.

The general algorithm used for two-class classification measurements is depicted in Fig. 2. First we select positive and negative samples for a given user (userData). As negative samples we used two randomly selected samples from each other user. Then we repeat nRuns times the randomization of the data followed by n-fold cross-validation evaluation for the given user. The above two steps were repeated for each user.

Fig. 2
figure 2

Two-class classification measurement algorithm using n-fold cross-validation

Scores for positive and negative test samples were computed so as to form two sets, one for genuine users the other for impostors. Then a user-independent threshold was scanned through the two sets of scores and the False Negative (FN) and False Positive (FP) rates computed for each threshold. Plotted as error curves, these values show the system performance (see Fig. 3).

Fig. 3
figure 3

EER computation for user 100 (Random forests classifier, secondorder features). EER for individual users were estimated as the intersection of FAR (False Acceptance Rate) and FRR (False Rejection Rate) curves

Besides Random Forests algorithm we chose to evaluate the k-nearest neighbours (kNN) and Bayes Net algorithms. All classification algorithms were used from the Weka Data Mining toolkit [20].

4.2 Anomaly Detection

In the case of anomaly detectors we used five detectors implemented in the R script provided by Killourhy and Maxion [11]. The detectors used were: Euclidean, Manhattan, Mahalanobis, Outlier count and Kmeans. This script works as follows: (i) it splits the data into three equal parts, each containing 20 samples from each user (in our case each part contained data from a single data-collection session) (ii) detectors are trained separately for each user using two-thirds of the data; evaluation was performed on the remaining third positive samples and two negative samples selected from each of the other users (20 positive \(+\) 53 * 2 negative); (iii) step (ii) is then repeated three times (threefold cross-validation), and the mean EER and its standard deviation computed.

Table 4 EER results for different methods and feature sets

4.3 Results

Results for classifiers and anomaly detectors are presented in Table 4. EER values were estimated for each user (see Fig. 3), then the mean and standard deviation were computed for each classifier or anomaly detector and each dataset.

We used 100 trees for the Random Forests classifier, \(k=1\) for the kNN classifier and the default Weka settings for the Bayes Net classifier. In the case of anomaly detectors the following settings were used: \(k=3\) clusters, at most 20 iterations for the kmeans detector; the \(threshold=1.96\) for the outlier count detector (used to count how many z-scores exceed a threshold) [11].

It can be seen that very low EER values were obtained by the classification algorithms, because these used the negative samples for building the user’s model. However in real systems negative samples are not available (in the enrolment stage samples are collected only from the genuine user).

Fig. 4
figure 4

DET curves—secondorder features. a Random Forests (T \(=\) 100). b Manhattan detector

For the error curve we chose the DET error curve (Detection Error Tradeoff) [13], which is the most important error curve for biometric systems. Figure 4a, b show these error curves obtained for the Random Forests classifier (number of trees: 100) and Manhattan detector.

The best equal error rates were obtained by the Random Forests classifier, around 5 % for the secondorder feature set and around 3 % for the full feature set. We mention again that these classifiers use negative samples for building the user’s typing model, which is not available in case of real systems. No significant differences were found in this evaluation between different types of password.

In the case of anomaly detectors, where the user’s model is based only on positive samples (the case of real systems), the equal error rates are always lower for logical strong and strong types of password.

5 Conclusions

Our objective in this work was to collect a dataset on mobile devices containing different types of password and to evaluate the influence of password difficulty on the performance of keystroke dynamics authentication. We provide both the datasets and evaluation methodology to the research community. The main contribution of this paper concerns the datasets, which not only contain three types of password, but contain raw data collected from mobile sensors too. Another contribution is the secondorder feature set which has the same number of features regardless of the password type. Measurements show the effectiveness of this novel feature set as very close to or sometimes better than the results obtained using the full feature set. Evaluations show that in the case of anomaly detectors the lowest equal error rates are obtained for the logical strong password, followed by the strong and the easy one. This is in concordance with the results obtained by Bartlow and Cukic [5] and Meng et al. [18].