Keywords

1 Introduction

User authentication is most commonly based on user-determinant keys, such as passwords or pin codes. In contrast to these traditional systems, biometric authentication [22, 40] provides an additional layer of security. But these approaches come at the cost of storing sensitive data like fingerprints and require dedicated pieces of hardware.

An inexpensive and readily-available alternative is offered by behavioural biometrics [14]. In contrast to their biometric peers, these methods are non-intrusive and continuously analyze user behaviour for authentication during a session. Behavioural traits of users have been analysed in handwriting [9], voice [41], or keyboard and mouse dynamics [26]. Particularly the latter suggests itself for user authentication in computer-based systems since keyboard and mouse are considered standard equipment. While keystroke dynamics may contain sensitive personal information like passwords, mouse movements offer an implicit and non-sensitive measurement of idiosyncratic behaviour [17]. In fact, Rodden et al. [29] show that eye and mouse movement are significantly correlated and conclude that mouse movement serves as an appropriate proxy to address implicit user behaviour.

Mouse movement dynamics are typically handled in a fully-supervised multi-class or multi-label setup, where class labels are identified with user IDs. While being purely supervised can add to predictive accuracy and detection performance, there are important limitations with this formalization. Firstly, the above approaches are often biased towards the data of other users present in training due to learning discriminating functions. Secondly, maintaining a multi-class approach in practice is close to being infeasible as every new user requires a full re-training of all models.

By contrast, we consider user authentication in an unsupervised approach, which learns a user’s representations by using only the data of this very user. Learning individual models of normality for every user allows to create features which are independent from other users. This allows the model to generalise better with respect to the target user, especially in the presence of unknown users who have not been seen during training.

In this paper, we propose a novel methodology for user authentication using mouse dynamics and a deep one-class setup. Our contributions are as follows: (i) We phrase user authentication as a deep one-class machine learning problem that can enhance user authentication and (ii) show the effectiveness of using a multifaced input to extract appropriate features from mouse data. Finally, (iii) we visualise the individual and relevant parts of the user’s mouse trajectory to gain an understanding of our approach, showing our model indeed focuses on characteristics previously known to be important from hand-crafted features but unattended in unsupervised approaches so far.

2 Related Work

Many studies have shown that mouse dynamics can deliver insights into a user’s mood [42], satisfaction and frustration [6] or mental state [13]. Consequentially, mouse movement has received much attention in behavioural analyses [2, 3, 24, 38].

The identification of users based on mouse strokes has been first investigated on the example of mouse-written signatures [9]. Many algorithmic approaches in dealing with mouse movement rely on hand-crafted features [11, 14]. Such representations are often extended by peers to increase expressiveness and/or incorporate additional characteristics like the number of pauses or pause length [25]. Matthiesen et al. [25] showed that using a fixed feature set for every user does not cover all of them equally. Similar conclusions were made by [34]. Therefore, an individual feature set for every single user is required. Neural approaches try to overcome the dependency on hand-crafted features while mapping the input data to a feature space using corresponding objectives. Using mouse dynamics to tell users apart can have two main objectives: User identification and user authentication. Many of the previous work addresses the topic of user identification, which is to detect the right user in a set of all users [1, 11, 14, 21, 36]. The usual setup for this is a supervised multi-class classification. In contrast, user authentication underlies a binary decision, e.g. is the target user/ is not the target user [7, 21, 26, 34, 35, 39]. Note that the common approach here is still supervised, i.e. using data of both classes. Chong et al. [7] are the first to investigate the potential of deep neural networks for mouse dynamics. They investigate multiple network architectures while casting the problem as a supervised multi-label problem. Applying a similar architecture, [1] propose a one-dimensional convolutional network (1D-CNN) for modelling temporal aspects of mouse movement. They train the models in a supervised one vs. rest manner using a binary-cross entropy loss.

Multi-class approaches imply retraining the whole model when adding/deleting users and are not feasible in dynamic environments. Binary one-vs-all strategies, on the other hand, are often biased towards the seen anomalies. Thus, we argue that an unsupervised anomaly detection (AD) method is more suitable for mouse-dynamics-based user authentication in real-world applications.

In the context of AD, unsupervised one-class approaches are appealing because they find minimum volume summarization of data at hand through hyperplanes [33] or hyperspheres [37]. Neural peers [20, 30] introduce improvements by identifying anomalies through their alterity in feature space. This is often done by including additional data that are not part of the target concept, into the training process, for example by semi-supervised learning [16], pre-training [19, 28], reference data [27, 28] or outlier exposure (OE) [19]. We will make use of the latter in the remainder. An overview of common setups incorporating additional data into AD and their typical optimisation can be found in Fig. 1. For example, in contrast to a binary one-vs-rest strategy, the OE-approach extract rich descriptive features from the mouse trajectories instead of focussing on increasing the distance between the two entities (normal and anomalous data). Note that the concept of OE in AD can be found widely in the literature under various terms, such as reference dataset [27, 28], auxiliary or OE dataset [19, 31]. This auxiliary dataset enables the anomaly detector to generalize better for unseen data.

Fig. 1.
figure 1

Setups of AD, from only using the target class to incorporate an additional dataset.

3 Representing Mouse Trajectories

Formally, a mouse trajectory is given by a sequence of spatial (xy) coordinates ordered in time \(\tau \). In addition to the spatio-temporal information, mouse data contains events, for example \(e \in \{\emptyset ,c_{L}, c_{R}, c_{M}, s_{\text {up}}, s_{\text {down}}\}\), with left (\(c_{L}\)), right (\(c_{R}\)) or center (\(c_{M}\)) clicks, up \(\{s_{\text {up}}\}\) and down \(\{s_{\text {down}}\}\) scrolls, or \(\emptyset \) in case there is no event. We represent a mouse trajectory as a sequence \( \boldsymbol{x}= \langle (\tau _1, x_1,y_1, e_1), \ldots , (\tau _T, x_T,y_T, e_T)\rangle . \)

Mouse data records consist likewise of movements for interacting with the application as well as idiosyncratic movements. In the following, we aim at devising a representation that is as independent as possible from actual user interface (UI) and rather aim at capturing how a user moves the pointer to a certain location instead of where exactly an action has been performed. While such velocities are translation invariant, they also render certain patterns almost undetectable (e.g. loops).

3.1 Image-Based Tensor Representations

We propose to represent different views of mouse trajectories (e.g., trajectory, speed, click, pause) as an image, which allows to access shape of motion as well as characteristic patterns like loop or hesitation [4, 7], see Fig. 2 for examples. A convolutional neural network (CNN) can detect edges very well and is therefore perfectly suited for such shapes. For the trajectory view, sub-sequences of the trajectory are re-scaled, plotted and saved as images. We adapt the size of the plot to the range of the respective trajectory to assure no bias from the positioning on the screen. Later, those images will be transformed into a multidimensional tensor to serve as an input for the network. To maintain the temporal information, we encode the speed of the movement with a colour interval, where the colour is determined by the actual speed of movement \(s_t\) at that position. To encode the speed value, we test two different normalisation approaches. We report on experiments with different normalisations in Sect. 6.1. Both ground on the speed \(s_t = \frac{d_{t}}{\tau _t}\) of the movement, where d is the Euclidean distance, but are normalized (i) by the average speed: \( s_t^{(avg)} = \frac{s_t}{\frac{1}{T}\sum _{t=1}^{T}\frac{d_t}{\tau _t}} \) and (ii) with a log-variant with \(\tilde{s}_t= \log (1+s_t)\) \( s^{(log)}_t = \frac{\tilde{s}_t - \tilde{s}_{\max }}{\tilde{s}_{\max } - \tilde{s}_{\min }}, \) respectively, where \(\tilde{s}_{max}=\max _t \log (1+s_t)\) and \(\tilde{s}_{min}\) analogously. A click view is simply containing indicators at click positions which are visualized by black crosses in the figure. The pause view contains the length of the pauses at the observed positions and is visualized by circles with radii corresponding to the length of the pause. We scale every pause so that the radius of the smallest pauses starts at a fixed radius. The different layers are aggregated to a multi-dimensional tensor, using one channel each, and serve as input to the model. We experiment with several combinations of input information and report the results in Table 1 (left).

Fig. 2.
figure 2

Views of mouse movement with different splitting criterions.

3.2 Splitting Sessions

The overall sequences of our data cover a whole session of a user. Therefore we divide the total session into sub-sequences. The length of those sequences in the images is another aspect of representing mouse trajectory data. Since it is not obvious how to split a long user session into smaller meaningful pieces, we study three different splitting criteria in the empirical evaluation in Sect. 6.1 and describe them in the following.

Time Difference Split (TD). TD [7] splits a sequence when the time difference between two consecutive mouse operations (movement or click) exceeds a predefined threshold \(\rho \in \{1s, 60s\}\). Since this may result in very short sub-trajectories, we only split if the resulting sub-sequences contain at least 100 data points.

Equal Length Split (EL). EL [25] splits the data into sub-sequences of the same length, \(\omega \in \{200, 1000\}\), irrespective of occurring events or movements. The last sequence is naturally shorter and usually discarded. In contrast to the TD method, the resulting sequences have the same length. Note that the identical number of data points does not result in the same amount of coloured pixels in the generated image.

Equal Time Split (ET). There exist two main ways of recording trajectories: (i) record the position in equal time stamps, so that a static position will results in duplicates coordinates or (ii) recording on movements, meaning the distances between consecutive points will not be the same. Since the latter is the case in the Balabit data, we introduce an additional method: The Equal Time Split (ET). It is a temporal analogy to the previous splitting criterion and splits the trajectory after a fixed amount of time. We experiment with the thresholds \(\upsilon \in \{10s, 120s\}\). Since the used mouse data is not recorded using equal time stamps but rather recorded on movements, this splitting method will not result in equally sized sub-sequences. Although this extension is straightforward, there does not seem to exist related work on this method.

Fig. 3.
figure 3

Depiction of the network (a) structure and of (b) the process for training and testing.

4 Deep Anomaly Detection and Outlier Exposure

The main goal in an AD task is to obtain a feature map \(\phi \) which separates the normal data from the outliers either linearly [33] or spherically [37]. Following the latter approach, a neural variant of the so-called Support Vector Data Description (SVDD), maps the data into a feature space \(\mathcal {F}\) and finds the minimal enclosing sphere with center c and radius \(R >0\) that contains the majority of points. We derive two important characteristics for features for our one-class setup similar to [28].

  1. (i)

    Compactness. Following the above-described cluster assumption [32], we want a similar feature representation extracted from the same class lie compactly in an enclosing hypersphere within feature space. Similar to the SVDD, the objective is to minimise the R of the hypersphere will result in reducing its volume and a more compact representation. However, if no further constraints are included this will directly result in the hypersphere collapse [8], when all data is mapped to the same point.

  2. (ii)

    Descriptiveness. A feature map that gives rise to compact representations may not be rich enough to distinguish other users. We thus aim to devise a rich feature representation that is general enough to not only summarize individual user data well but, at the same time, allows us to identify other users because of their unique traits in moving the mouse. Thus, we need descriptiveness but cannot give up on compactness either (cf. [27]). Producing descriptive features is likewise a desired characteristic in multi-class classification, where those features would ensure a large inter-class distance.

4.1 Outlier Exposure

The idea of OE originates in the observation that when learning a target concept, myriads of labelled examples exist that live in the same space but are known to not match the target concept [19]. While this insight borders on triviality, it is particularly powerful in unsupervised learning tasks like one-class and density estimation problems. Instead of only feeding observations of the desired target concept, additional data from possibly very different origins and sources is made available to the learner that now faces contrastive tasks: Ultimately, the goal is to provide a minimal description of the desired target concept but additionally, there is a classification problem that needs to be solved simultaneously using only the auxiliary data. The goal of the learning process is to identify a set of features that not only accounts for minimal description of the target concept but also induces high predictive accuracies on the auxiliary data. Note that OE-based approaches are generally classifies as unsupervised methods across literature [19, 27, 28, 31] since the standard approach uses only data of one data (normal data) from the target dataset during training.

5 Authentication of Users

For the underlying architecture for our model, we consider the AlexNet CNN architecture [23]. We modify the network and separate it into the two parts \(\phi \) und \(\psi \) for feature extraction and classification layers respectively. The network architecture is depicted in Fig. 3a. The feature extraction component of the network \(\phi :\mathcal {X}\rightarrow \mathbb {R}^p\) acts on both sources, the target classes \(D_u\) and the OE data \(D_{OE}\), to render learning compact as well as descriptive representations feasible. While the auxiliary data then branches into a standard feed-forward classification component \(\psi : \mathbb {R}^p \rightarrow \mathbb {R}^{|\mathcal {Y}|}\) with a final softmax layer for descriptiveness, the compactness of the user data is evaluated by a variance-based criterion. The task in the optimization is now to find an appropriate feature extraction \(\phi \), such that the classification error and variance of user data is small, while simultaneously ensuring a compact representation of the features for \(D_u\). This is achieved by deploying dedicated loss functions for controlling compactness and descriptiveness, respectively, and minimizing the two losses simultaneously (cf. [28]). The loss-controlling compactness measures the squared intra-batch distance

$$\begin{aligned} E_C = \frac{1}{N}\sum _{n=1}^{N} (\phi (\boldsymbol{x}_n; \theta )- \bar{\boldsymbol{x}}_{\lnot n}) ^2, \end{aligned}$$
(1)

with mean \(\bar{\boldsymbol{x}}_{\lnot n} = \frac{1}{N-1} \sum _{j \ne n} (\phi (\boldsymbol{x}_j; \theta ))\) of the leave-one-out set \(D_u\setminus \{\boldsymbol{x}_n\}\). The descriptiveness loss is given by the cross entropy over all involved classes \(\mathcal {Y}\), given by

$$\begin{aligned} E_D = -\frac{1}{M}\sum _{m=1}^M\sum _{\bar{y}\in \mathcal {Y}} \delta _{\bar{y},y_{N+m}} \, \log \left( \psi (\boldsymbol{x}_{N+m}; \theta )\right) , \end{aligned}$$
(2)

where \(\delta \) is the Kronecker delta. The joint objective function for the entire architecture is given by aggregating Eqs. (1) and (2). We minimize \( \,\,E_C(D_u) + \lambda E_D(D_{OE}), \) where \(\lambda >0\) is a balancing term. In addition to user data \(D_u\), we introduce an auxiliary and labelled M-sample \( D_{OE}=\{(\boldsymbol{x}_{N+1},y_{N+1}),\ldots ,(\boldsymbol{x}_{N+M},y_{N+M})\} \) with \(\boldsymbol{x}_{N+m}\in \mathcal {X}\) and \(y_{N+m}\in \mathcal {Y}\) for \(1\le m \le M\) and \(\mathcal {Y}\) denotes the set of (arbitrary) class labels of the auxiliary data. Recall that both user observations \(\boldsymbol{x}^{(u)}_n\) and auxiliary data \(\boldsymbol{x}_{N+m}\) live in the same space \(\mathcal {X}\) for all nm. Here, we propose a neural architecture that combines learning a compact representation of target data \(D_u\) and a descriptive feature space on target and auxiliary data.

6 Empirical Results

We evaluate on the Balabit Mouse Dynamics Challenge [12] for sampling \(D_u\) and incorporate instances of the Wolf of SUTD (TWOS) [18] data as \(D_{OE}\). Balabit contains mouse movements from 10 users from 65 sessions between 13640 and 83091 data points each, recorded during a set of unspecified but common administrative tasks. Since the screen resolution is not given, we normalize the trajectories based on the maximum coordinates. The TWOS data is the outcome of a gamified competition among competing companies over five days. It consists of 320 h of activity of 24 users and comprises a mouse, keyboard and other actions and logs, where we only use legal mouse movements in our experiments.

Setup. The two parts of the network \(\phi \) and \(\psi \) are trained jointly with samples from both \(D_u\) and \(D_{OE}\). For each user u, we train an individual model using only data from that user \(D_{u_{\text {train}}}\) (Balabit) plus a sample from the 24 users as additional OE data \(D_{OE}\) (TWOS). More formally, let the size of a batch be n. Then, for every \(i^{th}\) sample \(x_i^{D_{u}} \in \mathbb {R}^k\), where \(1 \le i \le n\), we calculate the distance between the networks output and the rest of the batch. For every \(i^{th}\) sample \((x_i^{D_{OE}}, y_i) \in \mathbb {R}^k\), where \(1 \le i \le n\), we calculate a loss for each class label \(y_i\) and sum the result. Hyperparameters are found via grid search and given by \(\lambda \) = 1.0, 300 epochs and a learning rate of \(\eta = 0.0001\) on balanced batches containing 100 user and OE samples. We observe that a rather low learning rate results in better performance since it prevents overfitting on the OE data while still assuring convergence on the compactness loss. At test time, we use independent data of the target user \(D_{u_{\text {test}}}\) and the nine remaining users from Balabit, similar to [15, 20, 28, 30]. Note that the model has never seen the other users from Balabit during training. In this way, we ensure that the model can even distinguish from unseen users. This allows scalability and does not require retraining of the model, even if more new users are added.

Table 1. Results of the experiments on different views (left) and splitting criteria (right). * Results are averaged over 9 users, since some images of user 07 resulted in numerical issues for those representations (see further details in Sect. 8).

6.1 Results

We first execute preliminary experiments, learning the optimal input and splitting strategy as described in Sect. 3.2. Using the resulting best-performing representation of mouse trajectories, we train the presented model in two different setups: (i) To show the influence of the utilisation of the OE data, we first train the model without the usage of the additional data. (ii) We build upon that and show the improvement in performance reached through the usage of OE data. We report average Areas under the ROC curve (AUCs) and equal error rates (EERs) over five repetitions.

Results for Optimal Representation. Table 1 (left) shows the results for different views (layers of input tensors), presented in Sect. 3. The results lay out that including likewise the pause and click view in the tensor leads to higher detection rates and the log-average performs slightly better than the global average. Table 1 (right) shows the results for different splitting criteria, also presented in Sect. 3. Firstly, the table nicely shows that the heuristics lead to considerably different numbers of training instances, however, recall that fewer instances contain longer parts of the respective user sessions. While most splitting methods are just slightly better than random guessing, the EL split performs notably better. With \(\omega = 1000\), decreases the EER by almost half.

Results for User Authentication. We now use the best-performing representation to compare to related work. As a baseline, we use the performance of the deepSVDD proposed in [30] as well as the user authentication approach using features proposed for Balabit from [25]. To show the influence of OE data, we also compare our approach to a variant that does not leverage OE data. To prevent hypersphere collapse, we incorporated an additional regularizer into Eq. (1) similar to [37]. To additionally show the benefit of our representation over state-of-the-art handcrafted features, we train a deepSVDD on features taken from [25] and another one on our tensor representation.

The results are shown in Table 2. Interestingly, the baseline on features and tensors leaves a mixed picture in terms of AUC (right part of the table). For users 07, 09, and 20, hand-crafted features outperform the tensor-based deepSVDD as well as the proposed approach. For the other users, the image-based representation is favourable, often by a large margin as seen for users 12, 23, or 35 which is also reflected by a slightly better average AUC over all users. However, even without including OE data, our proposed approach performs either on par or improves over the stronger baselines. This result impressively improved by including OE. Our proposed approach already constitutes an improvement in AUC by a factor of 1.2 over the baselines when no OE data is included in the training. Note that in this case, a regularizer has to be added to Eq. 1 to avoid a collapse of the hypersphere. The results for including OE are even better and raise the improvement in AUC by a factor of 1.5.

Table 2. Detection performance per class. Results are retrieved over 5 random seeds.

7 Visualisation of Important Information of the Mouse Dynamics

Since the performance using all three views and the splitting method EL1000 result in the best authentication performance, we can now investigate which parts of the input lead to creating compact and descriptive features for each user. To achieve this, we utilise the layer-wise relevance propagation (LRP) [5]. It can help identify the parts of the input while highlighting input features that were decisive for the network’s decision. The relevance R of every neuron is computed as follows: \(R^{(l)}_i = \sum _j{\frac{a_{i} w_{ij}}{\sum _{i\prime }{a_{i\prime } w_{i\prime j}}} R^{(l+1)}_j}\), where \(R^{(l)}_i\) and \(R^{(l+1)}_i\) represent the relevance score of the neurons i and j in the layers l and \(l+1\) respectively. The activation of neuron i is represented as \(a_i\) and the weight connecting neuron i and j as \(w_{ij}\). The LRP heatmap is then obtained by applying this principle to all layers. In addition, we implement the \(z^+\)-rule and a relevance filter as suggested in [10]: We adapt the threshold value for the filter to \(k= 0.05\).

The results are shown in Fig. 4. Note that the trajectory is reconstructed for better legibility. The images used as input carry one channel per view, where different shadings are hard to detect for the human eye. The trajectory of user 07 (a) shows clearly the advantage of incorporating pauses into the input image. It can be seen that the pauses got a much higher relevance score than the clicks (black cross). Locally overlapping occurrences of pauses and clicks are likewise relevant. In contrast, the clicks are much more relevant to user 20 as can be seen in example a). When no clicks are made, the pauses are getting more relevant. In [7] only the plotted trajectory was used as input for the CNN. It was shown that the edges are the relevant element for the decision process of the network. Added pauses and clicks carry even more relevant information for user authentication and should not be left out in image-based deep learning approaches.

Fig. 4.
figure 4

Two examples per user of trajectories images and their LRP visualisation using the splitting criterion EL1000. We refer to the left example as (a) and the right as example (b).

8 Discussion and Limitation

In this study, we cast user authentication base on mouse dynamics as a one-class problem. Multiple views of the trajectories are used as input to a CNN for extracting features using the objective of compactness and descriptiveness. Related work using deep neural networks for mouse trajectory data view the problem as a purely supervised task and often rely on pieces of information that is not always present, such as screen resolution [1, 7]. We remove this implicit dependency on the screen to avoid identifying users based on their personal preferences or hardware but still report state-of-the-art results.

In our setup, we reached the best performance by equally weighted both losses, setting \(\lambda \) = 1.0. We did not detect a large difference in performance though. With the EL1000 split the trajectories of users 07, 09 and 20 cover much shorter (pixel-wise) distances than the remaining users. Interestingly, these are exactly the users for which the hand-crafted features were performing well. However, our results are in line with [25] and show that even shorter sequences for the remaining users did not enhance the performance. Mouse trajectories are not translation invariant. While some movements, like patterns of confidence (e.g. straight and direct movements), can still be detected in mirrored or rotated images, other mouse movement motifs can not be orientation invariant and lose their idiosyncratic characteristic. Therefore, we did not include additional data augmentation to generate more data (e.g. through mirroring or rotation).

In contrast to our setup, a binary one-vs-rest strategy as used in [1] assumes that the “rest” classes (e.g. anomaly samples) are representative for all other occurring anomalies. Often the same classes are used in training as well as testing, resulting in a high accuracy, but introducing a selection bias. Using an auxiliary dataset from the same field as the target dataset has been shown beneficial to increase authentication performance in mouse trajectories. Since the auxiliary dataset is just used for training but not for testing, the model even performs well when testing against trajectories of unseen users. In comparison to previous methods [25], the CNN-based model overcomes the dependency on hand-crafted features while learning to extract an individual feature set for every user.

For the presented approach, we compare different setups and views. To ensure a fair comparison we left the structure of the underlying model untouched. However, when taking the [\(s_{\text {avg}}\), pause] view or the [\(s_{\text {log}}\), pause] view, some images form user 07 caused numerical instabilities. There is no obvious visual difference in data between user 07 and other similar users. We excluded the models with these setups for further analysis and emphasize to utilise other combinations of trajectory representations when using this particular model.

9 Conclusion and Future Work

In this paper, we proposed an unsupervised learning approach for user authentication using only the data of one user for training. We showed that incorporating additional data can enhance the model’s performance so that a distinction even to unknown users, which were never seen during training, becomes possible. This enables a deeper understanding of mouse cursor movements by visualising important key parts of the mouse trajectory for single users. Future research efforts should be directed to improve the discovery of mouse cursor motifs for individual users and their interplay with pauses. We thank the web-netz GmbH for funding this research and all former reviewers for the valuable feedback.