1 Introduction

Handwritten signatures have a long tradition of use in commonly encountered verification tasks, such as financial transactions and document authentication. They are easily used and well accepted by the general public, and signatures are straightforward to obtain with relatively cheap devices. These are important advantages of signature recognition over other biometrics. Yet, signature recognition also has some drawbacks: it is a difficult pattern recognition problem due to large possible variations between different signatures made by the same person. These variations may originate from instabilities, emotions, environmental changes, etc., and are person dependent. In addition, signatures can be forged more easily than other biometrics.

The signature recognition task can be split into two categories depending on the data acquisition method:

  • Off-line (static), the signature is scanned from a document and the system recognizes the signature, analyzing its shape.

  • On-line (dynamic), the signature is acquired in real time by a digitizing tablet and the system analyzes shape and the dynamics of writing, using for example: position with respect to the x and y axes, pressure applied by the pen, etc.

Using dynamic data, further information can be extracted, such as acceleration, velocity, curvature radius, etc. [22]. In this paper, we will focus on the online or dynamic signature recognition task.

For a signature verification system, depending on testing conditions and environment, three types of forgeries can be established [22]:

  • Simple forgery, where the forger does not attempt to simulate or trace a genuine signature.

  • Substitution or random forgery, where the forger uses his/her own signature as a forgery.

  • Freehand or skilled forgery, where the forger tries and practices imitating as closely as possible the static and dynamic information of the signature to be forged.

From the point of view of security, the last one is the most damaging and, for this reason, some databases suitable for system development include some trained forgeries [21, 25].

The remaining sections of this paper will be devoted to the task of dynamic signature verification, also known as authentication. This paper is organized as follows: Sect. 2 is focused on template matching dynamic signature recognition, and proposes a signature recognition methodology based on vector quantization (VQ). This approach is compared with the state-of-the-art algorithm for on-line signature recognition, which is Dynamic Time Warping (DTW). Section 3 provides experimental results, and Sect. 4 summarizes the conclusions of this work.

2 Signature recognition based on multi-section VQ

In this section, we present a multi-section vector quantization algorithm for on-line signature recognition. This method is an improved version of the classical vector quantization approach that we proposed in [8], and can also be interpreted as a variant of split-VQ [10].

It has been recently found that the Dynamic Time Warping (DTW) algorithm outperforms HMM for signature verification [14]. For this reason we will use DTW as the baseline algorithm for performance comparison.

Appendix 1 describes the DTW algorithm. In our case, we will use five signatures per person, acquired during enrollment. DTW works out the distance between the test set of vectors and each of the five training sets.

Distance computation implies a warping by means of dynamic programming. We compute five distances, each one being the result of comparing the test sequence with each of the five training repetitions. These five distances are combined using three different approaches, min{}, mean{}, and median{}, in order to obtain the final distance. A more detailed explanation of VQ and DTW algorithms can be found in [8].

Template matching approaches are especially appropriate when a small number of samples are available for training a model. This is the case with on-line signature recognition. However, VQ presents one drawback: it is unable to model the temporal evolution of the signature (it averages all the vectors), which is certainly important for signature recognition. For this reason, we propose a multi-section codebook approach. This proposal is described in detail in Sect. 2.1.

2.1 Proposed algorithm based on multi-section codebook approach

DTW offers one advantage over VQ: it takes into account the temporal evolution of the signature. However, a simple model called the multi-section codebook [1, 2] was proposed in the mid-1980s in speech and speaker recognition. Although this approach was discarded due to the higher accuracies of HMM, we should take into account that signature recognition differs from speech/speaker recognition, because the length of the training set is rather short and it is hard in this situation to estimate an accurate statistical model. This observation is well known in the field of speaker recognition, where higher recognition rates using VQ compared with HMM have been reported for short training/testing sets.

The multi-section codebook approach consists of splitting the training samples into several sections. For example, Fig. 1 represents a three-section approach, where each signature is split into three equal length parts (initial, middle, and final sections). In this case, three codebooks must be generated for each user, each codebook being adapted to one portion of the signature. Each branch works in a similar fashion to the VQ approach, and the final decision is taken by combining individual contributions of each section by simple averaging.

Fig. 1
figure 1

A multi-section codebook approach for signature verification based on 3 sections

If the database contains P users and the splitter provides S sections, we will have S codebooks for each person. Thus, we will have one codebook per person and section, named CBp,s for p = 1,…,P and s = 1,…,S.

This proposal is a generalization of the VQ approach, which can be seen as a multi-section approach with just a single section. The multi-section system will be operated as described next.

2.1.1 User model computation

For each person, we split the signature into S sections. We concatenate the feature vectors resulting from each of the five training signatures belonging to the same section. Thus, we obtain S training sequences per person. The LBG algorithm is performed in the classical way in order to obtain one codebook per person and section. Each person is thus modeled by a set of S codebooks. Figure 2 schematically represents the process of splitting and generating the training sequences for a given user p. It is assumed that five signatures of the same person are used for training the user model.

Fig. 2
figure 2

Schematic representation of the procedure followed to obtain the training sequences and the codebooks for a given user p

Our strategy is similar to a constrained vector quantization approach named split-VQ (also known as partitioned VQ) [10]. The simplest and most direct way to reduce the search and storage complexity in coding a high dimensional vector is simply to partition the vector into two or more sub-vectors. However, rather than splitting vectors, we split the training and testing sequences into sections, which have the same original vector dimension.

2.1.2 User recognition

The quantization distortion for a given signature and person p must be obtained by computing the combination of the distortions obtained for each part. This can be achieved by a generalization of the procedure described in [8]: dist p  = combination(d 1,…,d S ). The most straightforward combination is the mean{} function. However, the experimental results revealed that the min{} function outperforms simple averaging.

The individual d s values, for s = 1,…,S and person p are obtained by applying the equation

$$ d_{s} = \sum\limits_{i = 1}^{I'} {{\rm NNER}\left( {\vec{x}_{i} ,CB_{p,s} } \right)} = \sum\limits_{i = 1}^{I'} {\mathop {\min }\limits_{{\vec{y}_{j} \in CB_{p,s} }} \left\{ {d\left( {\vec{x}_{i} ,\vec{y}_{j} } \right)} \right\}} $$

where NNER is the Nearest Neighbor Encoding Rule [10]. It is interesting to point out that \( I^{\prime} \cong {\frac{I}{S}} \) because we split the whole signature into S sections of equal length.

2.2 Computational requirements comparison

In this section, we will compare the computational burden of the state-of-the-art DTW algorithm and the proposed multi-section codebook approach. We will use the following nomenclature:

  • J is the average single-signature reference template length.

  • I is the candidate’s signature length.

  • K is the number of reference templates per user.

  • L is the number of vectors inside the codebook for the VQ approach.

  • S is the number of sections in the multi-section VQ approach.

In our experiments, we have set K = 5, and in our database (MCYT) the average length per signature is J = 454 vectors.

It is interesting to observe that, due to the vector quantization of the K reference templates per user, the number of reference vectors has been significantly reduced for the VQ algorithm, because all the reference signatures have been clustered together in a single codebook per user. For a codebook of 4–7 bits, we get L = 16, 32, 64, and 128 vectors, respectively, while the original average number of vectors per signature is 454. In addition, for DTW, the procedure must be executed for each reference signature per user (in our experimental data we have used K = 5). Thus, even for a 7-bit codebook, the VQ approach requires dealing with approximately 18 times (5 × 454/128) less data.

Dynamic time warping requires the computation of KIJ distance measures. However, the search region can be restricted to a parallelogram region with slopes 1/2 and 2. Search over this parallelogram requires about \( O\left( {{{KIJ} \mathord{\left/ {\vphantom {{KIJ} 3}} \right. \kern-\nulldelimiterspace} 3}} \right) \) distances measures to be computed and the DP equation (1) (see Appendix 1) to be used about \( O\left( {{{KIJ} \mathord{\left/ {\vphantom {{KIJ} 3}} \right. \kern-\nulldelimiterspace} 3}} \right) \) times. This latter figure is often referred to as the “number of DP searches” [3].

VQ requires the computation of O(IL) distance measures. It is interesting to observe that the number of computations is the same for VQ and multi-section VQ, because the unique difference between them is the change of codebook depending upon which section a given vector belongs to. Taking into account that each DTW distance computation requires the computation of at least three distances between vectors, we can establish that VQ is approximately 47 times faster than DTW (for a codebook of 4 bits).

In terms of database storage requirements, DTW implies the storage of the whole set of reference signatures, which implies KJ vectors per user. VQ requires L vectors per user, where L is the number of vectors inside the codebook, and this figure must be increased by the number of sections for the multi-section VQ approach.

Table 1 summarizes the computational and database memory requirements. The next section presents the identification and verification rates for several experimental scenarios.

Table 1 Database storage and computational requirements for DTW and VQ approaches

3 Experimental results

3.1 Databases

In this section, we present experimental results obtained with two publically available databases.

3.1.1 MCYT database

We used our previous database MCYT [21] acquired with a WACOM graphics tablet. The sampling frequency for signal acquisition is set to 100 Hz, yielding the following set of information for each sampling instant:

  1. 1.

    position along the x-axis, x t : [0–12 700], corresponding to 0–127 mm;

  2. 2.

    position along the y-axis, y t : [0–9700], corresponding to 0–97 mm;

  3. 3.

    pressure p t applied by the pen: [0–1024];

  4. 4.

    azimuth angle γt of the pen with respect to the tablet: [0–3600], corresponding to 0–360°;

  5. 5.

    altitude angle φ t of the pen with respect to the tablet: [300–900], corresponding to 30–90°;

We have used feature vectors composed of these five measurements. We recruited 330 different users. Each target user produces 25 genuine signatures, and 25 skilled forgeries are also captured for each user. These skilled forgeries are produced by the 5 subsequent target users by observing the static images of the signature to be imitated, trying to copy them (at least 10 times), and then, producing valid forgeries in a relaxed fashion (i.e. each individual acting as a forger is requested to sign naturally, without artefacts, such as breaks or slowdowns). In this way, highly skilled forgeries with shape-based natural dynamics are obtained. Following this procedure, user n (ordinal index) realizes a set of 5 samples of his/her genuine signature, and then 5 skilled forgeries of client n – 1. Then, again a new set of 5 samples of his/her genuine signature; and then 5 skilled forgeries of user n – 2. This procedure is repeated by user n, making further samples of the genuine signature, and imitating previous users n – 3, n – 4 and n – 5. Summarizing, user n produces 25 samples of his/her own signature (in sets of 5 samples) and 25 skilled forgeries (5 forgeries of each user, n – 1 to n – 5). In a similar way, for user n, 25 skilled forgeries will be produced by users n + 1 to n + 5.

We calculate the center of mass of each signature and displace this point to the origin of coordinates.

3.1.2 SVC database

The SVC database [25] is very similar to MCYT. The sub-database released for Task 2 of the First international signature verification competition also includes the same five features as MCYT acquired by a WACOM Intuos graphic tablet with a sampling rate of 100 Hz. The complete SVC database had 100 sets (users) of signature data, but just a subset of 40 users was made available for research after the competition. This database also contains skilled forgeries samples produced by the contributors. There are 20 genuine signatures per user collected through two sessions, 10 signatures per session, with a minimum of 1 week between sessions. Additionally there are 20 skilled forgeries produced by at least four other contributors. The skilled forger was provided with a software animation viewer of the signature to forge. Thus, in this work, we have used a final set of 16,000 signatures (8,000 genuine signatures plus 8,000 skilled forgeries). This is about 10% of the MCYT database size.

It must be pointed out that the signatures in the SVC database are mostly in either English or Chinese, and no ‘real’ signatures were used. Instead, the contributors were advised to design a new signature and practice it before the acquisition sessions.

The best results in this competition were EER of 2.84% with skilled forgeries using a DTW-based algorithm and EER of 1.70% with random forgeries using a system based on HMMs.

With both databases, we set the origin of coordinates to the center of mass.

3.2 Conditions of the experiments

Training and testing signatures have been chosen in the following way:

  • MCYT database

    • We performed identification experiments, using the first 5 signatures per person for training and 5 different signatures per person for testing (signatures 6–10). This implies a total number of 330 × 330 × 5 tests.

    • We performed verification experiments, using the first 5 signatures per person for training and 5 different genuine signatures per person for testing (signatures 6–10). In addition, we used the 25 available forgeries made by 5 other users. This implies a total number of 330 × 5 genuine tests plus 330 × 25 impostor tests (skilled forgeries) and 330 × 329 × 5 impostor tests (random forgeries).

  • SVC database

    • We performed identification experiments, using 5 signatures per person for training and 5 different signatures per person for testing. This implies a total number of 40 × 40 × 5 tests.

    • We performed verification experiments, using 5 signatures per person for training and 5 different genuine signatures per person for testing. In addition, we used the 20 skilled forgeries made by other users. This implies a total number of 40 × 5 genuine tests plus 40 × 20 impostor tests (skilled forgeries) and 40 × 39 × 5 impostor tests (random forgeries).

However, further study needs to be done on whether these databases can produce statistically significant results. In [11], the minimum size of the test data set N, which guarantees statistical significance in a pattern recognition task, is derived. The goal in this work is to estimate N so that it is guaranteed, with a risk α of being wrong, that the error rate P does not exceed the estimation \( \hat{P} \) from the test set by an amount larger than ε(N, α), that is,

$$ \Pr \left\{ {P > \hat{P} + \varepsilon \left( {N,\alpha } \right)} \right\} < \alpha $$

Let ε(N, α) = βP, and supposing recognition errors to be Bernoulli trials (i.i.d. errors), after some approximations, the following relationship can be derived:

$$ N \approx {\frac{ - \ln \alpha }{{\beta^{2} P}}} $$

For typical values of α and β (α = 0.05 and β = 0.2), the following simplified criterion is obtained:

$$ N \approx {\frac{100}{P}} $$

If the samples in the test data set are not independent (due to correlation factors that may include variations in recording conditions, in the type of sensors, etc.), then N must be further increased. The reader is referred to [11] for a detailed analysis of this case, where some guidelines for computing the correlation factors are also given.

Table 2 shows the number of tests done in each condition and, with 95% confidence, the statistical significance in experiments with an empirical error rate, down to \( \hat{P}. \) Thus, the experiments of this section are statistically significant, because our errors are higher than those presented in Table 2.

Table 2 Statistical significance in experiments, with 95% confidence

Verification systems can be evaluated using the False Acceptance Rate (FAR, those situations where an impostor is accepted) and the False Rejection Rate (FRR, those situations where a user is incorrectly rejected), also known in detection theory as False Alarm and Miss, respectively. A trade-off between both errors usually has to be established by adjusting a decision threshold. The performance can be plotted in a Receiver Operator Characteristic (ROC) or in a Detection error trade-off (DET) plot [18].

We have used a single point of the DET plot for comparison purposes: the minimum value of the Detection Cost Function (DCF). This parameter is defined as [18]:

$$ DCF = C_{\rm miss} \times P_{\rm miss} \times P_{\rm true} \, + \, C_{\rm fa} \times P_{\rm fa} \times P_{\rm false} $$

where C miss is the cost of a miss (rejection), C fa is the cost of a false alarm (acceptance), P true is the a priori probability of the target, and P false = 1 − P true. C miss = C fa = 1.

3.3 Experimental results

Table 3 shows the experimental results from modeling each user with a codebook, with the experimental conditions as described in the previous section. We have studied codebook sizes ranging from 0 to 9 bits (1–512 clusters). It is interesting to observe that the 0-bit codebook corresponds to modeling each user with a single vector, equal to the average of each dynamical parameter. This is equivalent to the static feature extraction described in the introductory section.

Table 3 Experiments with VQ for MCYT and SVC databases. Size ranging from 0 to 9 bit

Table 4 shows the results obtained with the state-of-the-art algorithm DTW. For the sake of simplicity, in the multi-section codebook approach, we have used equal size sections. Figure 3 presents the identification rates for different numbers of sections and bits per section. We can observe that the multi-section approach outperforms the baseline VQ algorithm. In addition, the best identification result is obtained for three sections (98%), which compares favorably with the highest identification rate obtained with DTW (see Table 4; 98.9%).

Table 4 Results obtained with DTW and MCYT database. The model size is 5 × J×5 per user
Fig. 3
figure 3

Identification results for several multi-section codebook approaches, ranging from 1 to 5 sections for MCYT (on the left) and SVC (on the right) databases

Figures 4 and 5 show the minimum detection cost function value for different numbers of sections, bits per person, random and skilled forgeries, respectively. We can achieve a slight improvement (2.29% for random forgeries and 7.84% for skilled ones) over DTW (see Table 4; 2.33 and 7.84%, respectively).

Fig. 4
figure 4

Minimum DCF for random forgeries and multi-section codebook approach MCYT (on the left) and SVC (on the right) databases

Fig. 5
figure 5

Minimum DCF for skilled forgeries and multi-section codebook approach MCYT (on the left) and SVC (on the right) databases

4 Conclusions

In this paper, we have proposed a multi-section codebook approach for on-line signature recognition. This algorithm enables us to take into account the temporal evolution of the signature (thanks to the use of several sections), and obtain a significant speed improvement, which has been estimated to be around 47 times. This improvement is due to the lower computational burden of the VQ approach, which has been neglected for signature recognition so far, although it has proven useful in the past for other biometric traits, such as speech, especially for short training and testing sets.

Experimental results on a large database of 330 users (MCYT) reveal:

  • Optimal configuration consists of three sections and codebooks of 4 bits per section.

  • Identification rates of 98% (3 sections and codebooks of 4 bits per section).

  • Minimum detection cost function equals 2.29% for random forgeries and 7.75% for skilled forgeries.

These results are very similar to those provided by DTW (98.91% identification rate and minimum DCF equal to 2.33% for random forgeries and 7.84% for skilled forgeries).

In addition, our system improves the database storage requirements due to vector compression, and is more privacy-friendly because it is not possible to recover the original signature using the codebooks.