1 Introduction

1.1 Background

CAPTCHA, as a network security technology, can protect passwords from being hacked violently and prevent a large number of automatic machine requests to keep normal operation of websites [1, 39]. In order to avoid bots attacking and keep user-friendly for users, a CAPTCHA design should follow three principles [19]. Firstly, randomness is necessary in generating a CAPTCHA to ensure that the recognition task varies at each time. Secondly, a CAPTCHA should be easy for humans to recognize. Finally, it should be difficult for a machine to crack. According to the above requirements, there are various types of CAPTCHAs [34], such as text-based CAPTCHAs [38], image-based CAPTCHAs [12], sound-based CAPTCHAs [20], video-based CAPTCHAs [11], etc. Currently, the text-based CAPTCHA is the most widely used type due to low generation costs, which typically consists of letters and digits [8, 41]. To make text-based CAPTCHAs unrecognizable for the state-of-the-art of pattern recognition algorithms, noises are always added to blur CAPTCHAs images which include different background colors, text distortion and clutter. Straight lines and arcs, background textures and background colors are typical examples of clutter [2, 9]. However, with the development of pattern recognition technology, the traditional simple character set composed of digits and letters is incapable to guarantee the security of websites. Thus more complicated CAPTCHAs are needed to conquer the state-of-art cracking algorithms. The set of Chinese characters is much bigger than English letters and Arabic digits, which increase the complexity of automatic recognition. Furthermore the structure of Chinese characters is more complicated than English letters and Arabic digits. Thus text-based CAPTCHAs with Chinese character set have been widely used in most Chinese websites. Figure 1 displays an example of Chinese character CAPTCHA.

Fig. 1
figure 1

A Chinese character CAPTCHA

Research on cracking CAPTCHAs can help designers discover security vulnerabilities so as to encourage more secured and user-friendly CAPTCHAs to be designed. The standard approach to crack CAPTCHAs follows the procedure: preprocessing, denoising and binarizing the original CAPTCHA images, character segmentation and character recognition. This procedure is illustrated in Fig. 2.

Fig. 2
figure 2

Procedure of attacking Chinese CAPTCHAs

During the preprocessing, different types of noises are removed to obtain the binary images [6, 36]. Image denoising methods contain spatial filters (e.g. mean filter, median filter and Gaussian filter), frequency domain filters (e.g. high-pass filter and low-pass filter) and morphological operators (e.g. open-close operations and morphological reconstruction). Morphological reconstruction is a powerful transformation method which is often used in image segmentation [29, 32, 40]. Binarization methods include local threshold methods and global threshold methods [27], among which is Otsu method [46].

In image segmentation part, positions for each character in the text-based CAPTCHA image is identified which will be segmented for recognition. Character segmentation methods include vertical projection methods [28, 43], connectivity analysis [5], etc. Besides, mean shift is a robust feature space analysis method, which could be used for clustering [3, 4], image segmentation [30], target tracking [17], etc. Actually, preprocessing and segmentation are much more difficult and time-consuming than recognitions since machine learning algorithms can efficiently solve the recognition problem but there are no general algorithms for preprocessing and segmentation problems in various scenarios.

Segmented characters are classified for character recognition, in which feature vectors are extracted as the descriptions of each character images. The recognition accuracy usually depends on pattern recognition and machine learning methods, such as template matching, feature extraction and pattern classification, neural networks and deep neural networks [18]. Methods based on feature extraction and pattern classification, Gabor feature [21, 31], local binary pattern(LBP) [10, 25] and primary component analysis (PCA) [23] are mainly used to describe character images. Pattern classification methods include logistic regression, support vector machine(SVM), etc. Obviously, above mentioned pattern recognition methods rely on feature extraction results. With the rise of deep neural network, researchers have been using them to solve character recognition problem with its strong feature learning ability [26]. CNN, as a deep neural network model, has made breakthroughs in the field of pattern recognition, which is good at learning spatial features in images [15, 33, 37]. For recognizing characters, CNN is able to learn features automatically. Furthermore, CNN achieves deformation invariance, so that it is suitable to deal with the character recognition problem [24, 42].

1.2 Contribution

There are three main contributions of our research as listed below:

  1. i

    Firstly, we are committed to dealing with variable-length Chinese character CAPTCHAs with different types of noises. To the best of our knowledge, our research is the first attempt to attack variable-length Chinese character CAPTCHAs. The Chinese character set, consisting of 5,000 frequently used Chinese characters, is much bigger than the sets of 26 English letters and 10 Arabic digits. What’s more, the complicated structure of Chinese characters also increases the complexity of CAPTCHAs segmentation and recognition.

  2. ii

    Secondly, two novel methods are proposed to crack variable-length Chinese character CAPTCHAs. One method is called MGLCR which is a combination of pattern recognition methods, which are tailored and combined in an elaborate order. The other one is called CCR which is inspired by CNN to extract features and recognize characters automatically. The network architecture is properly designed to meet the challenge.

  3. iii

    Finally, the proposed methods are also valid for the mixed character CAPTCHAs consisting of English letters, Arabic digits, Chinese characters and mathematical operators, which outperform state-of-the-art methods.

1.3 Structure

The remainder of this paper is organized as follows. In Section 2, the related work is discussed. Section 3 presents the proposed approaches for attacking Chinese character CAPTCHAs, including the procedure of preprocessing, segmentation and recognition. The experimental design and results are elaborated in Section 4. Finally, conclusions and future work are drawn in Section 5.

2 Related work

The text-based CAPTCHA is one of the reverse Turing tests that distinguish an access by a computer program such as a crawler from an access by a human being, which uses the difference between computers’ and humans’ shape recognition ability. There are some researches focusing on cracking it with the help of pattern recognition methods and deep neural networks. Most of the existing CAPTCHA cracking methods work for English letters and Arabic digits, while methods for Chinese character set are rare. What’s more, it’s worth noting that there are no existing methods available for attacking variable-length Chinese CAPTCHAs. Thus methods introduced in this section are mainly for CAPTCHAs composed of English letters and Arabic digits, or fixed-length Chinese character CAPTCHAs [26], which play a referential role in our research.

Initially the cracking methods for CAPTCHAs based on pattern recognition are introduced. Yin et al. proposed an approach based on dense scale invvariant feature transform(DENSE SIFT) and random sample consensus(RANSAC) algorithm for recognition of distorted and merged CAPTCHAs [45]. Their dataset is composed of three kinds of CAPTCHAs captured from websites, which includes 220 pictures for training set and 600 pictures for test set respectively. This method achieved good performance on CAPTCHAs with different difficulty levels. Elie Bursztein et al. introduced a method to solve CAPTCHAs in a single step which used machine learning to address the segmentation and recognition problems simultaneously [7]. Their dataset is created by a methodology focused on unbroken real-world CAPTCHA schemes. Starostenko et al. presented a method for automatic segmentation and recognition of CAPTCHAs with variable orientation and random collapse of overlapped characters based on SVM classifier [35]. Their dataset consists of 5,000 downloaded images. A system attacking method for text-based CAPTCHAs was proposed by Hussain et al. , which includes simple image processing techniques, pixel count methods and an artificial neural network [16]. They used the dataset which consists of 1,000 CAPTCHAs of Taobao, 500 CAPTCHAs of MSN and 1,000 CAPTCHAs of eBay respectively. Gabor filters are firstly used to extract character components along four different directions for a generic attack with a dataset of 1,000 images downloaded from Amazon [44]. Our MGLCR method is inspired by this research.

Additionally, some methods with neural networks for CAPTCHA segmentation are presented. Grag et al. used CNN layers followed by a dense layer and a recurrent neural network layer instead of the conventional method to segment and recognize individual characters to break text-based CAPTCHAs [14]. Their dataset is generated, which contains 20,000 images. Wang et al. combined convolutional neural network and self-adaptive algorithm to break synthetic multi-digit text-based CAPTCHA [41]. Their dataset consists of 300,000 downloaded images. More recently, Gao et al. analyzed the security of the two-layer CAPTCHA and proposed a method to attack the Microsoft two-layer CAPTCHA based on LeNet-5, which is a radical CNN model [13].

Last but not the least, the CAPTCHA recognitions with deep neural networks are discussed. In recent years, researchers have made progress in the attacking of CAPTCHAs based on deep neural networks, especially convolutional neural networks. Karthik et al. proposed a CNN method to crack the Microsoft CAPTCHAs which are composed of English letters and Arabic digits [18]. Lv et al. proposed an approach based on CNN to improve the recognition accuracy of fixed-length Chinese character CAPTCHAs with distortion, rotation and background noises [26]. They used the dataset which is generated by CAPTCHA generator for the training set of 5,000 Chinese character CAPTCHAs and the test set of 15,000 Chinese character CAPTCHAs. However the segmentation method for the CHAPTCHAs is not included in this paper and the proposed method deals with the fixed-length CAPTCHA only, which contains four Chinese characters. Compared with it, the proposed MGLCR and CCR methods can deal with variable-length Chinese character CAPTCHAs due to the proposed segmentation method discussed in Section 3.2, which is of practical usage. The proposed network structure is similar to LeNet-5 [22] while the network structure in the proposed CCR method is designed for Chinese character CAPTCHAs.

3 The proposed approach

As shown in Fig. 2, an automatic attacking approach includes three steps: preprocessing, character segmentation and character recognition. Accordingly, the proposed approach is introduced in three parts. Section 3.1 discusses the preprocessing. Section 3.2 introduces the character segmentation. Section 3.3 presents two recognition methods: the MGLCR(Multi-scale Gabor and Logistic regression based CAPTCHA Recognition) and the CCR(Convolutional neural network based CAPTCHA Recognition) methods.

3.1 Preprocessing

In the proposed methods, the preprocessing is used to binarize and denoise CAPTCHA images, which is based on Gaussian smoothing, morphological reconstruction and Otsu method. The preprocessing can be adapted to different types of noises. Particularly, it is able to effectively binarize a CAPTCHA image with gradient background color. The flow chart of proposed preprocessing is shown in Fig. 3.

Fig. 3
figure 3

Flow chart of preprocessing

As shown in Fig. 3, the CAPTCHA image is grayed firstly, then Gaussian smoothing and morphological reconstruction are used to denoise the grayscale image. Furthermore Otsu method is applied to binarize the image, which takes the noises of the CAPTCHA image as the background and the characters as the foreground.

It should be noted that, the denoising is an optional step according to the characteristics of different types of CAPTCHAs.

3.1.1 Graying

The graying method aiming at converting RGB to greyscale is designed to perform a weighted average operation. The weight of each RGB channel should follow the recommended grayscale conversion formula.

$$ V = 0.299R + 0.587G + 0.114B $$
(1)

where R, G and B are the RGB three-channel pixel values of the CAPTCHA image and V is the gray values of the grayscale CAPTCHA image.

A grayscale image is denoted as X = xi,i = 1,⋯ ,|X||xi ∈ [0,255] after graying.

3.1.2 Denoising and binarization

A global threshold obtained from Otsu method is not valid for a CAPTCHA image with gradient background color, as shown in Fig. 4c. To cope with this problem, Gaussian smoothing and morphological reconstruction are adopted to remove noises and to average gradient background color. As shown in Fig. 3, the denoising and binarization part has two branches before bitwise AND operation. Two branches both have morphological reconstruction and Otsu method-based binarization parts. Nevertheless the right branch has an additional Gaussian smoothing part which is adopted before morphological reconstruction. The Gaussian smoothing part works to denoise straight lines, but the side effect is that characters are also blurred. To obtain a denoised image which combines advantages of two branches, a bitwise AND operation is performed for these two branches. With denoising and binarization, the image with gradient background color can be effectively processed as shown in Fig. 4d.

Fig. 4
figure 4

Usage of morphological reconstruction on an CAPTCHA image with gradient background color. a Original image; b Grayscale image; c Binarized image; d Binarized image with morphological reconstruction

For the left branch in Fig. 3, morphological reconstruction and binarization are included. During morphological reconstruction process, the grayscale image is used as the mask image I. The number 102 is subtracted from the gray value of each pixel in the grayscale image to get the seed image J. Both the mask image and the seed image have a range of pixel values of 0,1,⋯ ,255. The grayscale reconstruction is performed according to the following formula:

$$\begin{array}{@{}rcl@{}} &&\forall{p}\in{D},\gamma^{rec}(J,I)(p)\\&&\quad=max\{k\in[0,N-1]|p\in\gamma^{rec}(T_{k}(J),T_{k}(I))\} \end{array} $$
(2)

where Tk(⋅) represents a binary image obtained by binarizing a grayscale image at threshold k.

Actually, the output of morphological grayscale morphology is an image after morphological expansion of the local extreme points, as shown in Fig. 6(b), denoted as Xdilated. Then Xdilated is subtracted from the grayscale image X to get the difference image Xr. Then, the binarization based on Otsu method is conducted on Xr.

The Otsu method selects a global threshold to binarize the image. The image gray level is L and the binarization threshold is k. With the binarization threshold, the image pixels are divided into two classes: the foreground and the background which are denoted as f and b respectively. The variances of two classes are denoted as \({\sigma ^{2}_{f}}\) and \({\sigma ^{2}_{b}}\) respectively. The sizes of two classes are denoted as ωf and ωb respectively. The average gray values of two classes are denoted as μf and μb respectively. Then the intra-class variance and inter-class variance are defined as follows:

$$\begin{array}{@{}rcl@{}} {\sigma^{2}_{W}}&=&\omega_{f}{\sigma^{2}_{f}}+\omega_{b}{\sigma^{2}_{b}}\\ {\sigma^{2}_{B}}&=&\omega_{f}\omega_{b}(\mu_{f}-\mu_{b})^{2} \end{array} $$
(3)

where \({\sigma ^{2}_{W}}\) is the intra-class variance and \({\sigma ^{2}_{B}}\) is the inter-class variance. These variances should obey the formula described as follows.

$$ {\sigma^{2}_{W}}+{\sigma^{2}_{B}}=\sigma^{2} $$
(4)

where σ2 is the variance of all pixels.

Therefore, minimizing the intra-class variance is equal to maximize the inter-class variance. The threshold k is obtained by minimizing the intra-class variance or maximizing the inter-class variance.

The global binarization threshold α is calculated from the difference image Xr. With this threshold, each pixel in the grayscale image is divided into foreground class or background class. The background pixel values are set to 255 and the foreground pixel values are set to 0.

$$ {x^{b}_{i}}=\left\{\begin{array}{ll} 0 &x_{i}<\alpha,\\ 255 &x_{i}\ge\alpha. \end{array}\right. $$
(5)

where \({x^{b}_{i}}\) is the i-th pixel of the binary image, xi is the i-th pixel of the grayscale image. Eventually, the binary image is describled as follows:

$$ X^{b}={{x^{b}_{i}}, i = 1,\cdots,|X^{b}||{x^{b}_{i}}\in{0,255}} $$
(6)

where Xb is binary image.

It should be noted that it is necessary to standardize the foreground and background pixel values after binarization. After that the foreground characters are black, and the background is white. If the background color is darker than the foreground color, the binary image after the preprocessing will have a white foreground and a black background. For unification, the foreground and background pixel values of substandard binary images are exchanged based on the assumption that the number of foreground character pixels is less than the number of background pixels.

After binarizing Xr, a binary image Xr + b is obtained. As shown in Fig. 5d, the main noises in the output image have been removed.

Fig. 5
figure 5

Procedure of morphological reconstruction. a grayscale image; b image after morphological expansion of the local extreme points; c difference image; d binary image

For the right branch in Fig. 3, Gaussian filtering(σ = 0.7) is conducted firstly to get image Xg. After that, morphological reconstruction is used to find local maximum points. After morphological expansion of these local extreme points, the image Xg + dilated is obtained. Then Xg + dilated is subtracted from the grayscale image X to get the difference image Xg + r. Then binarization based on Otsu method is conducted on Xg + r. Binarization procedure is same as the left branch. After binarization, the binary image Xg + r + b is obtained.

Thus, two binary difference images Xr + b and Xg + r + b are obtained. Xr + b can maintain the original structure of the characters and avoid the noises of gradient background color, which still have lines and arcs. As a result of the Gaussian smoothing, clutter can be removed partly, but the characters are also blurred in Xg + r + b, which will bring difficulties to the recognition part. Thus, bitwise AND operation on Xg + r + b and Xr + b is performed to combine advantages of these two kinds of binary images. The denoised image is denoted as Xclean. Figure 3 shows two binary images and the denoised image Xclean.

$$ X^{clean}=X^{g+r+b} \quad AND \quad X^{r+b} $$
(7)

After denoising and binarization steps, the image may still have some lines or arcs, thus connectivity analysis is needed for further denoising. The binary image Xbin is divided into eight connected regions and the area of each region is calculated. Then the regions with area less than 9 are filtered out.

3.2 Segmentation

After preprocessing, character segmentation will be carried out on denoised and binary images. The segmentation procedure includes data dimension classification, clustering, clustering correction and morphological open operation. During segmentation procedure, mean shift clustering is used on both spatial coordinates and the color space coordinates of character pixels. The obtained cluster center represents the center point of each character. Because nonparametric mean shift is a density estimation method, it is unnecessary to specify the length of characters in advance when clustering the CAPTCHA images with variable-length characters. Besides, the segmentation method based on mean shift clustering not only takes into account the spatial information, but also considers the color space information. Therefore, it is adaptive when there are different colors between characters and clutters, even between characters and characters. In addition, different clustering strategies are used for different types of CAPTCHAs to archive satisfactory results.

3.2.1 Data dimension classification

Before clustering, classifying the pattern of the CAPTCHA images is necessary so as to determine the data dimension required for clustering. It should be noted that the image pattern here refers to the color pattern of the original CAPTCHA image, so that the color information is preserved to enhance the segmentation effect.

Basically, there are three color patterns for CAPTCHA images: grayscale image, color image with uniform character color and color image with non-uniform character colors. Examples of three color patterns are displayed in Fig. 6. For the first two patterns, the spatial coordinates (r, c) and the gray value v of the character pixel are used for clustering, denoted as (r, c, v). For the third pattern, the set of data points is denoted as (r, c, l, a, b), in which r, c are the spatial coordinates and (l, a, b) are values of LAB color space.

Fig. 6
figure 6

CAPTCHA images of different color patterns. a grayscale image; b color image with uniform character color; c color image with non-uniform character colors

3.2.2 Clustering

The feature space depends on dimension of data points. After the dimension has been fixed, mean shift is used to cluster the data points by classifying character pixels and finding cluster centers.

Bandwidth used in nonparametric mean shift is estimated automatically by the following steps:

  1. (1)

    Taking all the foreground pixel sets in the binary image as the data set \(X^{b}_{fg}={x|x\in {X^{b}},x = 0}\) for estimation.

  2. (2)

    Taking 20% of the foreground pixels as the K parameter of the K-nearest neighbor method.

  3. (3)

    Calculating the furthest distance di to the nearest K pixels for each point xi in \(X^{b}_{fg}\).

  4. (4)

    Calculating the average farthest distance \(\frac {1}{|X^{b}_{fg}|{\sum }_{1}^{|X^{b}_{fg}|}d_{1}}\) of all points in \(X^{b}_{fg}\).

The estimated bandwidth is used as the clustering window diameter in the mean shift method. Calculations for each window are listed as follows:

  1. (1)

    Calculating the mean shift vector for the center of the window:

    $$\begin{array}{@{}rcl@{}} ms\!&=&\!\left[\sum\limits_{i = 1}^{n}x_{i}g \left( {\left\Arrowvert{\frac{x-x_{i}}{h}}\right\Arrowvert^{2}}\right) \right]\\&&/\left[\sum\limits_{i = 1}^{n}g \left( {\left\Arrowvert{\frac{x-x_{i}}{h}}\right\Arrowvert}^{2}\right) \right]-x \end{array} $$
    (8)

    where g(⋯ ) is a Gaussian kernel function, h is the bandwidth and x is the vector of the window center.

  2. (2)

    Moving the window center along the mean shift vector and the end point is the new window center.

  3. (3)

    If the window center does not converge, it will repeat step 1. Otherwise, the algorithm outputs an extreme point for each window and a number of parallel windows finally converge to obtain extreme points as the center of the cluster. All points converging to the same extreme point belong to a class.

3.2.3 Clustering correction

Figure 7 displays two segmentation cases without clustering correction, in which an original image, a binary image after preprocessing and the result of segmentation are listed. The orange stars marked in the binary image indicates the clustering center of each character, which is calculated by the mean shift method. In the segmentation results, different colors represent different pixel classes. Due to the adhesion between characters, it is easy to misclassify the adjacent characters into the same class as shown in Fig. 7a. What’s more, some noises are misinterpreted as characters as shown in Fig. 7b. Thus, clustering correction is needed to optimize the results of character segmentation.

Fig. 7
figure 7

Segmentation cases without clustering correction. a case of adjacent characters misclassified into one class; b case of noises misinterpreted as characters

In clustering correction, if the number of pixels in a class is less than 10 or less than 5% of total pixels, the class is neglected. Furthermore, if the pixel spatial distribution is too long or too wide, the mean shift clustering is implemented for this class again. Examples of clustering correction is shown in Fig. 8.

Fig. 8
figure 8

Examples of clustering correction. a clustering correction for adjacent characters; b clustering correction for noises

3.2.4 Morphological open operation

Since the segmentation process does not take full account of the character connectivity, there are many specks and holes in the segmented Chinese characters, as shown in Fig. 9, which is not conducive to subsequent feature extraction and character recognition.

Fig. 9
figure 9

Segmented characters with specks and holes

In order to solve this problem, a segmented character image is implemented with morphological open operation. The results are displayed in Fig. 10, which are more suitable for feature extraction and character recognition.

Fig. 10
figure 10

Segmented characters after filling holes. a original segmented character; b segmented character after filling holes

3.3 Recognition

In this part, two kinds of character recognition methods are proposed. One is based on multi-scale Gabor filter and logistic regression, which is denoted as MGLCR (Multi-scale Gabor and Logistic regression based CAPTCHA Recognition). The other one is based on CNN, which is denoted as CCR (Convolutional neural network based CAPTCHA Recognition).

3.3.1 MGLCR

MGLCR relies on the multi-scale Gabor filter to extract features and classify characters with logistic regression. The texture features of different characters are quite different in the frequency domain, thus the frequency descriptor is used for the description of texture features. It is a set of Gabor filters with different scales, rotation directions and frequencies that is chosen to describe an image. By convoluting images with these Gabor filters, the frequency distribution of each point is calculated.

Before extracting character features, it is necessary to convert pixel value to the number within [0,255]. The converted image is denoted as Xchar. Feature extraction steps based on Gabor filters are described as follows:

  1. (1)

    Based on four direction parameters \(\theta _{1}= 0, \theta _{2}=\frac {\pi }{4}, \theta _{3}=\frac {\pi }{2}, \theta _{4}=\frac {3\pi }{4}\), three scale parameters σ1 = 1,σ2 = 2,σ3 = 4 and two frequency parameters f1 = 0.05,f2 = 0.25, 24 Gabor filters are generated as shown in Fig. 11, denoted as Gabor𝜃,σ,f.

  2. (2)

    The convolution results of four directions are obtained by convoluting a character image with four filters which have same scale and frequency parameters but different direction parameters. The multiplication result of four filtering responses is the proximate isotropic decomposition of the image at this scale.

    $$ \mathcal{C}(x,y,\sigma,f)={\prod}_{i = 1}^{4}Gabor_{\theta,\sigma,f}\times{X_{char}} $$
    (9)
  3. (3)

    Multi-scale Gabor responses are calculated as follows:

    $$ \mathcal{C}(x,y)={\sum\limits_{i}^{3}}{\sum\limits_{j}^{2}}C(x,y,\sigma_{i},f_{j}) $$
    (10)
Fig. 11
figure 11

Gabor filters

In order to make the eigenvectors of character images have same dimension, PCA (principal component analysis) is used to preserve the main components in the response results and the dimensionality is reduced to 128 dimensions.

The character recognition is a typical machine learning classification process. It is divided into two parts: training and testing. During the training phase, character features are extracted on the training set to train the logistic regression classifier. During the testing phase, character features are extracted on the test set and characters are classified with previously trained classifier.

The logistic regression is a multi-class classifier, which is equivalent to applying a logistic function to ordinary linear regression so that it can only get [0,1] range of values. Assuming that the output of the function is the probability that a data class belongs to y = 1. The closer the value is to 1, the more likely it belongs to y = 1, while the closer the value is to 0, the more likely it belongs to y = 0.

$$ h_{\theta}(x)=\frac{1}{1+e^{-\theta^{\tau}x}} $$
(11)

where h𝜃(⋯ ) is the logistic regression model, 𝜃 is the model parameter.

Since the character recognition is a multi-class task, multiple logistic regression is adopted, i.e., softmax regression. The loss function is defined as:

$$ \mathcal{L}(X,Y,{\Theta}) = -\frac{1}{N}\sum\limits_{i = 1}^{N}\sum\limits_{k = 1}^{m}1(y_{i} = k)log(Pr(\hat{y}_{i} = k|x_{i},{\Theta})) $$
(12)

where X is a batch of input characters images, Y is the set of character labels corresponding to these images and Θ is the model parameter. N is the number of images, m is the total number of classes. 1(⋯ ) is a characteristic function and 1(yi = k) means that if the i-th tag yi belongs to the class k, the function value is 1, otherwise it is 0. \(Pr(\hat {y}_{i}=k|x_{i},{\Theta })\) means that when inputting xi, the estimated output \(\hat {y}_{i}\) is the probability of class k. This probability is obtained by the following formula:

$$ Pr(\hat{y}_{i}=k|x_{i},{\Theta})=\frac{e^{\theta_{i}^{\tau}x}}{{\sum}_{k = 0}{m}e^{\theta_{k}^{\tau}x}} $$
(13)

With the multi-scale Gabor, features can be extracted from segmented character images. The following figure shows the convolution result of a Chinese character image with 24 Gabor filters.

Figure 12 reveals that the larger the scale parameter σ is, the fuzzier the features will be. The smaller the scale parameters σ is, the more detailed the features will be. The filters with different directions make the feature extraction more robust, so as to achieve rotation invariance.

Fig. 12
figure 12

Results of character images under multi-scale Gabor

Each response obtains a 128-dimensional vector after the dimension reduction. Then the vectors are joined. 24 Gabor filters perform feature extraction and obtain a 3072-dimension vector, which will describe the features of a character image.

The Gabor feature extraction is carried out on the character images of training set and the obtained feature vectors are used for training the logistic regression classifier, by which character images are predicted.

3.3.2 CCR

The proposed CNN is used to extract features and recognize character images automatically. Compared with the LeNet-5, the proposed network has additional structures such as batch normalization and dropout as shown in Fig. 13, which improves the performance of networks.

Fig. 13
figure 13

Network architecture of the CCR method

The marks, like 46 × 46@64, represent the size and number of feature maps. As shown in Fig. 13, the network architecture consists of five convolution layers, two max pooling layers and one fully connected layer. All convolution layers in the network are composed of 64 convolution filters with 3 × 3 size. Predicted results are outputted through the softmax classifier.

Each max pooling layer is followed by a batch normalization layer and a dropout layer. The batch normalization layer is used to normalize the middle layer characteristics of the network, which can significantly improve the convergence efficiency and thev generalization ability of the network. Besides, the dropout layer randomly sets a part of the neuron’s output value to zero in the training process, which can improve the generalization ability of the network and avoid over-fitting.

Additionally, except the fully connected layer, ReLU is adopted as the activation function in other layers, which can avoid gradient disappearance. At the same time, because of the fact that the output values are zeros of some neurons, ReLU function can avoid over-fitting. Besides, the ReLU function is simpler than the sigmoid function so as to alleviate the complexity of the network.

The categorical cross entropy between the outputs and the ground truth is calculated by the following formula.

$$ \mathcal{L}(X,Y;{\Theta}) = -\frac{1}{N}\sum\limits_{i = 1}^{N}\sum\limits_{k = 1}^{m}1(y_{i} = k)log(Pr(\hat{y}_{i} = k|x_{i};{\Theta})) $$
(14)

Obviously, the loss function is the same as the logistic regression described in MGLCR method. X is a batch of characters images, and Y is the set of character labels corresponding to these images and Θ is the network parameter. N is the number of images, and m is the total number of classes. 1(∙) is a characteristic function, in which 1(yi = k) means that if the i-th tag yi belongs to the class k, the function value is 1, otherwise it is 0. \(Pr(\hat {y}_{i}=k|x_{i};{\Theta })\) means that when inputting xi, the estimated output \(\hat {y}_{i}\) is the probability of class k. This probability is obtained by the following formula:

$$ Pr(\hat{y}_{i}=k|x_{i};{\Theta})=\frac{exp(o_{i})}{{\sum}_{k = 0}{m}exp(o_{i})} $$
(15)

where oi is the output of the last layer in the network.

During the training phase, character images are predicted by feed-forward propagation firstly. Then the loss between the estimated values and the ground truth is minimized by back-propagation. Until convergence, the trained network is able to recognize characters.

Automatic feature extraction and classification can be accomplished with the proposed CCR method. In the training procedure of CCR method, simple stochastic gradient descent with learning rate of 0.01, weight decay of 0.00001 and momentum of 0.9 is applied for optimizing the training loss. The size of mini-batch in training is 64. After trained with around 800 epochs, the model was ready for predicting characters.

4 Experiments

4.1 Dataset

To prove the general efficiency of proposed methods, twenty subclasses of CAPTCHA images with different type of noises are tested, which are divided into four classes. Figure 14 lists all kinds of CAPTCHAs within four classes and twenty subclasses. Class One includes groups of English letters or Arabic digits or mixtures of English letters and Arabic digits. Although, the proposed method is for Chinese character CAPTCHAs, Class One is still included in the dataset to check the general validity of proposed methods. Class Two contains CAPTCHAs with variable-length Chinese characters. Class Three includes mathematical formulas consisting of Chinese characters only. Class four contains other types of mathematical formula which is the combination of Chinese characters, Arabic digits and mathematical operators. In view of the above four classes of CAPTCHAs, a CAPTCHA generator is developed to generate training sets with labels. The CAPTCHA generator has two sets of parameters: one set is the color, rotation angle, position, size and font of characters, and the other set is the background colors, density of straight lines and arcs and random layout of noises.

Fig. 14
figure 14

CHAPTHAs of four classes

According to 2,000 common Chinese characters, 26 capital letters of English, 26 lowercase letters of English, 10 digits and 5 operational symbols, training sets of 60,000 CAPTCHAs are generated, which include 20,000 CAPTCHAs for Class One, 20,000 CAPTCHAs for Class Two, 10,000 CAPTCHAs for Class Three and 10,000 CAPTCHAs for Class Four. For each class, half of CAPTCHAs have no noises such as straight lines and arcs and the other half have noises. Furthermore, background colors are randomly generated for all classes. The generated CAPTCHAs are displayed in Fig. 15.

Fig. 15
figure 15

Examples of synthetic CAPTCHA images

Test set consists of 10,000 images from the website of Chinese National Enterprise Credit Information Publicity System, where are 2500 CAPTCHAs for each class.

4.2 Experimental results of preprocessing

With the proposed preprocessing method, experiments are carried out on the test set. Figure 16 depicts 10 successful preprocessing cases, in which the left column lists original images and the right one lists denoised images after preprocessing. Experimental results show that the proposed preprocessing method can denoise CAPTCHA images efficiently with crossed straight lines, background textures and gradient background colors from Fig. 16a to Fig. 16j. What’s more, proposed preprocessing method is self-adaptive to do local binarization on a CAPTCHA with gradient background as shown in Fig. 16d. Last but not the least the proposed preprocessing method could deal with circular noises as shown in Fig. 16a.

Fig. 16
figure 16

Successful cases of preprocessing

Unfortunately the proposed preprocessing method is not applicable with intersecting thick straight lines as shown in Fig. 17b whose original CAPTCHA is Fig. 17a. Especially, it is difficult to remove the straight lines with high-density layouts. It is thick straight lines that could not be filtered by the denoising method with morphological operators, which will probably cause difficulties in subsequent segmentations.

Fig. 17
figure 17

Failure cases of preprocessing

Besides thick straight lines with high-density layouts, the distribution of background color also has a side effect on the denoising results. If the background color is similar to the foreground color, it could lead to failure as shown in Fig. 17 (d) whose original CAPTCHA is Fig. 17c.

4.3 Experimental results of segmentation

The segmentation results are presented with the proposed self-adaptive mean shift. A good segmentation should follow two principles. On the one hand the amount of segmented characters should be correct. On the other hand, the segmented character can be recognized easily. Based on above principles, 10,000 images for all and 500 images for each subclass (shown in Fig. 14) are tested. The segmentation accuracies are listed on Fig. 18.

Fig. 18
figure 18

Segmentation accuracies of all classes CAPTCHAs

According to the Fig. 18, the proposed segmentation method performed well in the CAPTCHAs with groups of English letters or Arabic digits, whose accuracy is about 100%. If the CAPTCHAs have gradient background colors, the proposed segmentation method can still achieve a high accuracy around 96.8%. As the background color is similar to the foreground character color, the segmentation accuracy decreases slightly to 90%. It is unfortunate that the accuracy of CAPTCHAs consisting of thick straight lines with high-density layouts is about 50% which is obviously lower than other subclasses. Figure 19 displays the segmentation results, which reveals that variable-length CAPTCHAs can be successfully segmented by the proposed method without the parameter of text length. Furthermore proposed segmentation method could also achieve good results in CAPTCHAs with colorful characters because it can distinguish the color space distribution in clustering.

Fig. 19
figure 19

Successful segmentation results on training set

4.4 Experimental results of recognition

After segmentation, 453,280 segmented character images from 60,000 CAPTCHAs are obtained. The ratio of training set and test set is 7:3. The recognition results of proposed MGLCR and CCR methods are compared by precision and recall rates. Traditionally, equations of precision and recall rate are defined as follows:

$$ Precision=\frac{N_{tp}}{N_{tp}+N_{fp}} $$
(16)
$$ Recall=\frac{N_{tp}}{N_{tp}+N_{fn}} $$
(17)

where Ntp is the number of true positive cases, Nfp is the number of false positive cases and Nfn is the number of false negative cases.

The above (16) and (17) concern binary classification whereas CAPTCHA classification is a multiclass problem. Thus the equations for precision and recall rate should be deformed as shown in (18) and (19).

$$ Precision_{i}=\frac{M_{ii}}{{\sum}_{j = 1}^{n} M_{ji}} $$
(18)
$$ Recall_{i}=\frac{M_{ii}}{{\sum}_{j = 1}^{n} M_{ij}} $$
(19)

where Mij is the corresponding value in confusion matrix and Precisioni and Recalli is the precision and recall of the i th class.

The recognition precisions and recall rates of MGLCR and CCR are compared in the Table 1, which lists the average values of all four classes. From the table, we can see that both MGLCR and CCR achieve good results on Class One, Class Three and Class Four. However the MGLCR only get 82.3% precision on Class Two with variable-length Chinese character CAPTCHAs, which is much lower than CCR. There are pros and cons of proposed MGLCR and CCR methods. MGLCR is less time-consuming and suits for the CAPTCHAs with small dataset of characters such as English letters and Arabic digits. Whereas CCR has relatively longer training time but it performs well in all cases no matter how large the dataset is and how complicated the CAPTCHA character structure is.

Table 1 Comparison between two recognition methods

Besides making comparisons between MGLCR and CCR methods, state-of-the-art methods are also evaluated. For ease of presentation, the method breaking Microsoft CAPTCHAs with CNN is called M-CNN [18]and the CNN-based method cracking fixed-length Chinese character CAPTCHAs is called C-CNN [26]. The datasets used by [18, 26] and our proposed method are different. The segmentation in [18] was done assuming equal distance among the characters, so the image was divided in six equally distributed segments, which could not meet the challenge in our proposed scenarios. The segmentation method for the CHAPTCHAs in [26] is not included in their paper. To enable a fair comparison of previously mentioned three methods, the proposed Class-One and Class-Two CAPTCHAs are segmented with our proposed segmentation method into single characters and classified into two separate datasets for experiments.

On one hand, M-CNN applies LeNet-5 to break Microsoft CAPTCHAs which are groups of English letters or Arabic digits or mixtures of English letters and Arabic digits. Thus the dataset for the comparative experiments is composed of 20,000 letter-digit CAPTCHAs from Class One. The ratio of training set and test set is 7:3. The Table 2 lists the precisions and recall rates of M-CNN and the proposed MGLCR and CCR. The precisions and recall rates of the proposed MGLCR and CCR are higher than M-CNN . The reason could be that M-CNN uses a similar structure of LeNet-5 instead of a carefully designed network structure. Besides, the number of training epochs can also affect the results.

Table 2 Comparison on fixed-length letter-digit CAPTCHAs

On the other hand, C-CNN can crack fixed-length Chinese character CAPTCHAs, which is similar to proposed CCR. Thus the dataset for the comparative experiments is composed of 10,000 Chinese character CAPTCHAs from Class Two and all the length of the CAPTCHA is four. The ratio of training set and test set is also 7:3. The comparative results of C-CNN and the proposed MGLCR and CCR are listed in Table 3. As can be seen from the table, the proposed CCR has higher precision than C-CNN and they achieve the similar recall rates. However the recognition results based on the good segmentation results with our proposed method, which is not included in [26]. Unfortunately MGLCR performs not so well in fixed-length Chinese character CAPTCHAs with intersecting thick straight lines. Thus, the methods based on CNN are more recommended than the method based on multi-scale Gabor to crack Chinese character CAPTCHAs.

Table 3 Comparison on fixed-length Chinese character CAPTCHAs

5 Conclusions

A new attempt is made on attacking variable-length Chinese character CAPTCHAs, which includes preprocessing, character segmentation and character recognition.

Although variable-length Chinese character CAPTCHAs can be cracked within acceptable time, there are some future works can be done to improve the performance.

  1. (1)

    Removing intersected thick straight lines more effectively.

  2. (2)

    Exploring a more effective segmentation method for sticky characters.

  3. (3)

    Improving the recognition precision on the CAPTCHAs with similar foreground color and background color.

Hints for better design of the text-based CAPTCHA include that increasing the density of straight lines and arcs and improving the similarity of the foreground color and the background color of CAPTCHAs. However the balance between user friendliness and the security of CAPTCHAs should be noticed. Furthermore other types CAPTCHAs, like interactive CAPTCHAs or sound-based CAPTCHAs are alternative options.