Keywords

1 Introduction

Although passwords have some security problems and a variety of new authentication methods are constantly proposed, passwords promise to be the dominant authentication method in the foreseeable future due to their simplicity to deploy, easiness to change [2, 3]. Thus, it is of great importance to understand password security, and a number of guessing algorithms have successively been proposed, such as statistical-based ones (e.g., probabilistic context free grammars (short for PCFG) [18] and Markov [10, 12]) and deep-learning based ones (i.e., PassGAN [5] and FLA [11]). Password guessing algorithms study password security from the perspective of attackers who focus on the vulnerability of passwords, and they in turn can be used to build protection countermeasures, such as constructing the password strength meter (PSM) to evaluate password strength.

Password guessing attacks can be divided into targeted guessing attacks and trawling guessing attacks [16]. The former is to crack the password of a given user as quickly as possible [10], and the latter is to crack as many passwords as possible in a given password set under the limited guess number [16, 19]. This paper focuses on trawling guessing attacks. A key challenge for trawling guessing attacks is to extract the password features effectively, and data preprocessing is a feasible method to improve the effect of the deep learning based models.

1.1 Related Work

Major Password guessing models based on statistical probability include PCFG [18] and the Markov model [10, 12]. The main idea of PCFG is to divide a password into several segments according to the character types, and these segments can be regarded as the password features. For instance, the password abc123 is parsed into the letter segment “abc” and digit segment “123”. PCFG can also be integrated with other guessing models as a data preprocessing method [9, 17, 20]. The Markov model records the frequency of different characters after the password substring in the training phase, and then generates the guessing password character by character according to the statistical frequency distribution. With various improvements made on the base of PCFG, Xie et al. [19] focused on the targeted guessing attack and added the recognition of special dates and names. Wang et al. [15] added the recognition of Chinese pinyin and the six-digit dates (e.g., 201862) to further exploit the features of Chinese passwords. Houshmand et al. [6] added the recognition of keyboard patterns. Yang et al. [21] also studied keyboard patterns and analyzed the frequency distribution of the keyboard patterns. However, these PCFG-based methods are not optimal because most of these added recognition rules only consider the targeted guessing, and are not comprehensively considered in trawling guessing. In this paper, we focus on the password features, and add the most comprehensive recognition rules of keyboard patterns, words, website names, years for trawling guessing.

Recently, deep learning technology provides a new way for password attacking, and the model based on supervised learning was first used. In 2016, Melicher et al. [11] used Recurrent Neural Network (RNN) to build a password guessing model (i.e., FLA) which can be considered as a character-level model because the smallest unit it handles is each character in the password. In 2018, Liu et al. [9] proposed a multi-source PCFG+LSTM model (The LSTM based model using PCFG for preprocessing, short for PL) with the adversarial generation which can maintain high accuracy for different datasets. The password guessing model using PCFG for preprocessing can be regarded as a segment-level model, which divides passwords into different segments. In 2021, Wang et al. [17] found that the PL model could crack more passwords than PCFG and Markov within 50 million guesses. However, they only use the original PCFG which can not comprehensively extract password features. Xu et al. [20] proposed a new preprocessing method based on the Byte-Pair-Encoding (BPE) algorithm to divide passwords into chunks that consist of frequently occurring characters, and then built three models: the Markov based model using the chunk based preprocessing method, the model using the Chunk+PCFG preprocessing method, and the LSTM based model using chunk based preprocessing method. Password guessing models using the chunk-based preprocessing method are considered chunk-level. In this paper, we integrate the Chunk+PCFG preprocessing method with neural networks.

Unsupervised learning methods are also used in password cracking. In 2019, Hitaj et al. [5] first proposed a password guessing model based on Generative Adversarial Networks (GAN) and named it PassGAN. However, the cracking result of PassGAN is not ideal, and even lower than the traditional methods. We apply three preprocessing methods to PassGAN model, and find that using the basic PCFG for preprocessing can dramatically improve the cracking rate.

1.2 Our Contributions

In this work, we make the following key contributions:

  1. (1)

    Character feature based encoding method. Character-level models usually adopt the canonical one-hot encoding method which can not fully utilize the character features. Therefore, we propose a new approach based on the type of characters and the corresponding keyboard positions, where each character is represented as a 4-dimensional vector: (character type, character serial number, keyboard row number, keyboard column number). Although this encoding method does not improve the effect, it still provides a new feasible direction for password guessing.

  2. (2)

    Refined PCFG. Existing PCFG [6] divides passwords into four segments (letters, digits, special characters, and keyboard). We propose a refined PCFG based preprocessing method which adds the recognition rules of words, website names, and years to enable a more comprehensive password feature extraction. Inspired by Wang et al.’s work that the distribution of passwords follows the Zipf’s law [14], we find that the frequency distribution of extracted segments also follows PDF-Zipf.

  3. (3)

    An extensive evaluation. We perform a series of experiments on nine models, including two baseline ones (i.e., the LSTM based model using one-hot encoding method in Wang’s work [17] and the original PassGAN model in Hitaj’s work [5]), and seven models using preprocessing(i.e., the LSTM based model using our new encoding method, the LSTM based model using basic PCFG for preprocessing [17], the LSTM based model using our refined PCFG for preprocessing, the LSTM based model using Chunk+PCFG for preprocessing, the PassGAN model using basic PCFG for preprocessing, the PassGAN model using our refined PCFG for preprocessing, and the PassGAN model using Chunk+PCFG for preprocessing). Our empirical results show that character-level models can improve their effect by using PCFG based preprocessing methods. In particular, the PassGAN model using PCFG for preprocessing can improve the success rates drastically (avg. 26.79 %) compared to the original PassGAN model.

2 Background

In this section, we briefly introduce the background on deep learning based password guessing models (i.e., LSTM based models and GAN based models).

2.1 LSTM Based Models

Fig. 1.
figure 1

The training process of LSTM based models using one-hot encoding method. The character set is {a, b, c, Bos, Eos}, where Bos represents the beginning of a password and Eos repreasents the end of a password.

Recurrent Neural Network (RNN) and its Variants, such as Long Short-Term Memory neural networks (LSTM), can all be used in password guessing models [11, 17]. To avoid gradient vanishing problems [8], we use LSTM instead of RNN. The one-hot encoding is usually performed on each character to convert a password string to a matrix. Moreover, it is necessary to construct the corresponding label y for the input password x due to the supervised learning method. We use an example to illustrate this training process (see in Fig. 1). Suppose the character set that contains all the characters appearing in the dataset is {a, b, c, Bos, Eos}, where Bos represents the beginning of a password and Eos represents the end of a password. Then the password abc is converted to a matrix: [[0, 0, 0, 1, 0], [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0]]. The corresponding label y is: [[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]].

The LSTM based model is a probability model which assigns probabilities to the guessing passwords. The probability of the next character is obtained by entering the prefix of the password string into LSTM. For example, a password generation process could be: B\(\rightarrow \)Ba\(\rightarrow \)Bab\(\rightarrow \) Babc\(\rightarrow \)BabcE, where each time we select the character with the highest probability as the next character. The LSTM based model with this training and generating method is character-level.

2.2 PassGAN

PassGAN, proposed by hitaj [5], is based on WGAN-GP [4] due to the difficulty of training original GAN. PassGAN consists of a generator and a discriminator. The generator captures the real data distribution by building a mapping function from prior noise distribution to real data space and generate fake samples. The discriminator learns to determine whether a sample comes from the fake samples or the real data. PassGAN are trained adversarially in this way until the discriminator cannot identify the source of the data. The main structure of the PassGAN model is shown in Fig. 2.

Original PassGAN is character-level, but not as effective as LSTM based character-level models [17]. Another problem of original PassGAN is that it can not assign probabilities to the generated guessing passwords, so we can not obtain the priority of different guessing passwords. These problems will all be solved by using preprocessing methods in this paper.

Fig. 2.
figure 2

The training process of PassGAN model. Passwords are first converted into matrices and then input to the discriminator. The score is the output of the discriminator which will be used to calculate the gradient.

3 Preliminaries

In this section, we first explicate the datasets used in this paper, including the basic information, the length distribution, the character composition, and the top-10 passwords. In addition, since passwords are confidential data related to personal privacy, ethical considerations are also explained.

3.1 Datasets

We compare the password guessing models with different preprocessing methods based on six large password datasets (see in Table 1) with a total of 57 million passwords. These datasets are different in terms of service, size, user localization, and language, which suggests that our models can be used to well characterize different user-chosen passwords. We name each dataset according to its website’s domain name. The first three datasets, namely CSDN, YueJunYou, Renren, are all from Chinese websites. CSDN is a well-known community website of Chinese programmers, founded in 1999. Renren is a real-name social networking platform, founded in 2005. YueJunYou is a website for making friends and traveling, founded in 2014. The last three datasets, namely Rockyou, Yahoo, Youporn, are all from English websites. Rockyou is a game website that contains 320 million passwords, which is the largest dataset among six datasets. Yahoo is a famous internet portal site in the United States, founded in 1995. Youporn is a video website only for adults, founded in 2006.

Table 1. Basic information about six web services.

Datasets Cleaning. We note that these original datasets contain some abnormal passwords that are either too long (>40) or too short (<4), which are unlikely to be user-chosen passwords or simply junk information. Thus, we launch the work of dataset cleaning before any experiment. We first remove the passwords that contain symbols beyond the 95 printable ASCII characters, and then we also remove the passwords with length <4 or length >30, because these passwords do not comply with the password policy of most websites or may not be considered by the attackers who care about cracking efficiency [1]. Generally, the removed passwords are less than 1% for each dataset.

Here we also provide a concrete grasp of user-chosen passwords: 1) The length of most passwords is between 6 and 9, accounting for 62.08%–83.84% of each web service (see details in Table 7 of Appendix A); 2) Chinese users love to use digits (avg. 54.64%) and this figure for English users is 18.62%, while English users love to use characters (avg. 42.79%) and this figure for Chinese users is 15.43% (see details in Table 8); 3) Top-10 passwords account for 7.18% 10.43% of Chinese users, and this figure for English users is 2.05% 5.29%, indicating Chinese passwords are more concentrated, as found in [15] (see in Table 9).

3.2 Ethical Considerations

Although these datasets are widely used in the literature [5, 10, 11, 20, 21], they are still private data. Therefore, we only report the aggregated statistical information and treat each individual account as confidential, so that using them in our research will not increase the risk to the corresponding victim. Furthermore, these datasets may be utilized by attackers as cracking dictionaries, while our use is both beneficial for the academic community to understand the strength of users’ password choices, and for security administrators to secure their passwords. In addition, we have consulted privacy experts a number of times. Since our datasets are all available from the Internet, the results in this work are reproducible.

4 Preprocessing Methods

In this section, we describe different preprocessing methods to improve the effect of the deep-learning based password guessing models.

4.1 Important Abbreviations

To facilitate the reading process, we introduce the important abbreviations used in this article. GAN means Generative Adversarial Networks; PassGAN is short for the password guessing model based on GAN; LSTM is short for Long Short-Term Memory neural networks; PCFG is short for probabilistic context-free grammars; LSTM/PassGAN+X means the LSTM/PassGAN based password gurssing model using X for preprocessing.

4.2 Character Feature Based Encoding Method

Canonical one-hot encoding method as used in [11, 17, 20] only classifies different characters, but can not fully reflect other character features. Moreover, the matrices converted by the one-hot encoding method are sparse [13]. The main challenge for character encoding is how to distinguish characters and reflect their features with as little space as possible. Therefore, we comprehensively consider different kinds of character features and then propose a new character encoding method that greatly reduces the occupied space.

The password character has two important features, one is the type, and the other one is the keyboard location since keyboard pattern is also a popular way in password creation [16]. Thus, we represent each character in four dimensions. The first dimension represents the type of characters, where we use 1, 2, 3, and 4 to represent digits, uppercase letters, lowercase letters, and special characters. The second dimension is the serial number of the characters in each type. For example, a–z can be represented by 1–26 according to the dictionary order, and the digits can be represented by themselves. The third dimension represents the keyboard row number, and the fourth dimension represents the keyboard column number. The row number of the keyboard increases from top to bottom, and the column number increases from left to right. For example, the string “1234567890-=” is in the first row, and the string “1qaz” is in the first column according to keyboard coordinates. Using our new encoding method, the password 1234 can be converted to a matrix [[1, 1, 1, 1], [1, 2, 1, 2], [1, 3, 1, 3], [1, 4, 1, 4]].

4.3 Refined PCFG

The basic PCFG [18] only divides passwords into letters, digits, and special characters, which may destroy the integrity of some segments and ignore user’s habit of creating passwords. For example, the string “1!2@3#” should be considered as a complete segment due to the adjacency of characters on the keyboard, while it would be converted to \(D_1S_1D_1S_1D_1S_1\) in the basic PCFG. Therefore, we add the recognition rules of 4 important password features.

Table 2. The proportion of top-10 mostly used years (1900–2100) for each web service\(*\).

Year Recognition: Years have been found popular in passwords [7, 15]. We count the number of the years (from 1900 to 2100, a total of 201), and record the proportion of top-10 most widely used years in Table 2. Results show that some years occupy a large proportion, which indicates users may focus on some years. However, the basic PCFG would extract years into digits. Thus, we extract digit segments with length of four and value between 1900–2100 from the passwords.

Table 3. The proportion of top-10 mostly used websites for each web service.

Website Name Recognition: We for the first time count the number of the website names in each dataset, and record the proportion of top-10 most widely used website names in Table 3. Although the website names are less concentrated than the years, some website names still account for a large proportion, and the passwords with website name segments may be considered strong passwords in the basic PCFG. For instance, the password “csdn.net” is converted to the base structure \(L_4S_1L_3\), and may be assigned a low probability of being cracked by basic PCFG. To address this problem, we add the recognition of website names. First, common website name suffixes, such as “.com”, “.net”, are used to construct a suffix list, and then the complete website segments are extracted from the passwords according to the suffix list. For example, the website name segment “csdn.net” is extracted from the password “123csdn.net123”.

Keyboard Pattern Recognition. Since keyboard pattern is also a popular way in password creation [6, 16], we add the corresponding recognition rule. Furthermore, keyboard patterns with only one character type can be extracted completely by basic PCFG, so our refined PCFG (i.e., basic PCFG with the additional recognition rules) focus on the keyboard patterns with multiple character types. For example, the password q1w2e3 is converted to the base structure \(L_1D_1L_1D_1L_1D_1\) by basic PCFG, which destroys the integrity of the keyboard pattern, while it is converted to \(K_6\) by our refined PCFG (k represents the keyboard pattern and 6 represents the length of the segment).

Word Recognition: The keyboard pattern has inherent limitations that may extract wrong segments [21]. For instance, the segment “password” should be regarded as a complete segment, but “assw” would be recognized as a keyboard pattern. Although our keyboard pattern recognition method avoids most of the error cases, wrong results may still occur. Our solution is to use a dictionary that contains common words, and extract segments that appear in the dictionary from the passwords before the keyboard pattern extraction. However, the effect of this method depends on the quality of the dictionary. Once the dictionary does not contain the corresponding word, keyboard pattern extraction will still cause errors. Therefore, we choose to construct the word dictionary through the training set. This process is described in Algorithm 1.

figure a

Frequency Distribution: Wang et al. [14] found that the distribution of popular passwords follows PDF-Zipf:

$$\begin{aligned} f_r=\frac{C}{r^s} \end{aligned}$$
(1)

where \(f_r\) is the frequency of the password, r is the rank of the password, C and s are constants depending on the datasets. To verify that the data conforms to this distribution, we use the following equation:

$$\begin{aligned} log(f_r)=logC-s \cdot log(r) \end{aligned}$$
(2)

where \(log(f_r)\) and log(r) are linearly related. We verify that the extracted segment meets the Zipf distribution based on Eq. 2. The extracted results are sorted in descending order of frequency on six datasets and the \(log(f_r)\)-log(r) graphs are shown in Fig. 3. Moreover, all the coefficients of determination (\(R^{2}\)) which can measure the fitting degree of the regression line to the sample data are shown in Table 4. The closer the determination coefficient is to 1, the better the fitting effect is. The results indicate that the frequency distribution of the keyboard patterns, words, website names, and years with a frequency of more than five can meet the PDF-Zipf model well.

Table 4. The coefficient of determination (\(R^2\)) for fitting different extracted segments.
Fig. 3.
figure 3

Frequency distribution of different types of extracted segments.

Recognition Order: Some recognition rules may conflict with each other. For example, “989” in “1989” is recognized as a keyboard pattern, and “csdn” in “csdn.net” may be recognized as a word. Therefore, it is essential to set a reasonable extraction order and begin with the least conflict one. Since the year recognition may conflict with the keyboard pattern recognition, and the website name recognition may conflict with the word recognition. Our recognition order is: year\(\rightarrow \)website name\(\rightarrow \)word\(\rightarrow \)keyboard pattern\(\rightarrow \)basic PCFG recognition. Overall, our refined PCFG is different from the basic PCFG on the tags of base structures, where some tags are added, such as K (keyboard patterns), W (words), E (website names), and Y (years). The complete extraction process of our refined PCFG is described in Algorithm 2. The structure of the neural networks using the PCFG based preprocessing method is shown in Fig. 4.

figure b

4.4 PassGAN Using PCFG for Preprocessing

The result of PassGAN [5] is worse than the LSTM based models within \(10^7\) guessing passwords, so we infer that the ability of GAN to learn text features is weaker than LSTM. Character-level passwords are relatively complex for GAN due to the length and the multiple character types. Therefore, we use PCFG based preprocessing method to simplify the data, and train the model with the base structures obtained from PCFG. In the generation process, PassGAN would generate duplicate base structures without probability, so we count the number of different base structures until the number of unique base structures reaches the target value. Then we assign each base structure with the probability \( f_i / total \), where \(f_i\) represents the frequency of the corresponding base structures, and total represents the total number of all base structures.

Fig. 4.
figure 4

An illustration of deep learning based model using PCFG based preprocessing method. K represents the keyboard pattern and W represents the word.

figure c

4.5 Chunk+PCFG Preprocessing Method

We adopt Xu et al.’s PCFG improvement with chunk segmentation at CCS’21 [20], and integrate the Chunk+PCFG preprocessing method with LSTM and PassGAN separately. Byte-Pair-Encoding (BPE) algorithm is used to divide passwords into chunks, and then the chunk-level passwords are converted to the base structures by performing PCFG on each chunk of the passwords. Since chunk-level passwords are fine-grained enough, we only use the basic PCFG and a chunk can be represented as L (with only letters), D (with only digits), S (with only special characters), Two (with two character types), Three (with three character types). For example, the password iloveu4ever can be firstly converted to the chunk-level password [“iloveu”, “4ever”], and then converted to the base structure \(L_6Two_5\). The process of the Chunk+PCFG preprocessing method is shown in Algorithm 3. The remaining training and generation process is the same as that of the LSTM based models using PCFG for preprocessing.

5 Experiments

In this section, we first describe the attacking strategies, and then evaluate the result of five different preprocessing methods combined with neural networks on multiple datasets. The details of datasets are described in Sect. 3.1.

5.1 Attacking Strategies Design

The combination of five different preprocessing methods with two leading deep-learning models gives rise to a total of ten guessing strategies, and we focus on seven promising ones: the LSTM based model using one-hot encoding method [17], the LSTM based model using our new encoding method, the LSTM based model using the basic PCFG for preprocessing (short for PL) [17], the LSTM based model using our refined PCFG for preprocessing, the LSTM based model using Chunk+PCFG for preprocessing, PassGAN using the basic PCFG for preprocessing, PassGAN using our refined PCFG for preprocessing, and PassGAN using Chunk+PCFG for preprocessing. Here LSTM includes a hidden layer that contains 128 neurons and a softmax layer, and the structure of GAN is the same as the original PassGAN model which is described in Sect. 2.2. For chunk-level models, the minimum average length of chunks is 3.0. All models are trained and tested on six datasets, which are divided into the training set, test set, validation set according to 8:1:1 as recommended in [17].

5.2 Evaluation Results

To extensively evaluate the effect of five different preprocessing methods, 50 million guessing passwords are generated from each model and sorted by probability in order to obtain a smooth curve of the result. We use the guess-number-graph and record cracking results to intuitively reflect the effects of different models.

Overall Analysis. Table 5 and 6 and Fig. 5 and 6 show that, with 50 million guessing passwords: (1) The LSTM based model using one-hot encoding outperforms using our new encoding by 4.22% on average; (2) The LSTM based model using the basic PCFG for preprocessing (short for PL) outperforms the LSTM based model using one-hot encoding by 7.85% on average; (3) Refined PL model outperforms the basic PL model by 1.36% on average; (4) The basic PL model outperforms the LSTM based model using Chunk+PCFG for preprocessing (short for CKPL) by 6.49% on average; (5) The PassGAN model using the basic PCFG for preprocessing (short for PCFG+PassGAN) outperforms the original PassGAN model by 26.79% on average; (6) The PCFG+PassGAN model outperforms refined PCFG+PassGAN model by 1.41% on average; (7) The original PassGAN model outperforms the PassGAN model using Chunk+PCFG for preprocessing (short for CKP+PassGAN) by 13.85% on average.

Table 5. Cracking results of LSTM based models with five different preprocessing methods (Guess number = \(5*10^7\))\(\dagger \)

Encoding Method. The experimental results on six datasets show that our new encoding method does not perform well, which can be attributed to two reasons. First, the sparsity problem caused by the one-hot encoding may not have serious side effects because passwords are generally short in length. Second, since our new encoding method is only used to represent each character, its advantage which reflects multiple character features has not been fully utilized.

Table 6. Cracking results of GAN based models with four different preprocessing methods (Guess number = \(5*10^7\))\(\dagger \)
Fig. 5.
figure 5

Cracking results of LSTM based models with five different preprocessing methods (Guess number = \(5*10^7\)). The training set and test set are from the same dataset, with a division ratio of 8:1. LSTM one-hot represents the original model using one-hot for encoding [17]; LSTM 4-dim means using our new encoding method; b_PCFG means with basic PCFG for preprocessing [17]; r_PCFG means with our refined PCFG; CK_PCFG means with the Chunk+PCFG method. The results show that our refined PCFG method increases the effect by 9.21% on average for original LSTM model [17].

Fig. 6.
figure 6

Cracking results of GAN based models with four preprocessing methods (Guess number \(=5*10^7\)). The training set and test set are from the same dataset, with a division ratio of 8:1. PassGAN represents the original model in [5]; b_PCFG means with basic PCFG for preprocessing; r_PCFG means with our refined PCFG; CK_PCFG means with the Chunk+PCFG method. The results show that using basic PCFG for preprocessing increases the effect by 26.79% on average for PassGAN model [5].

PCFG Based Preprocessing Method. Using PCFG for preprocessing can improve the effect, and the LSTM based model with our refined PCFG is even better than with the basic PCFG, which indicates that the fine-grained rules can extract more effective features. However, PassGAN with our refined PCFG decreases the effect due to the complexity of our method. Furthermore, since the Chunk+PCFG method first performs PCFG on each chunk of the passwords, it would generate even more complicated base structures than our refined PCFG, which can be the reason why all models using the Chunk+PCFG preprocessing method do not perform well compared to the original models. To explore the impact of different extraction rules added to the basic PCFG, we set up a series of experiments where only one extraction rule is added each time. The results (see deatails in Table 10 of Appendix B) indicate that word recognition has the best performance among all four rules and using separate recognition rules is not as good as our refiend PCFG (i.e., with all four rules).

Limitations. Firstly, our new encoding method can not improve the performance of the LSTM based models compared with the one-hot encoding, which may indicate that password guessing models can not be improved only by the character encoding. Secondly, we only improve the effect of PassGAN from the aspect of preprocessing, but not change the structure of PassGAN.

Future Directions. Firstly, new training methods to match up with our new encoding method can be a viable direction due to the underutilization of character features. Secondly, the research on PCFG based preprocessing method should be more fine-grained. One viable direction is to add more fine-grained recognition rules based on our refined PCFG, and another direction is to apply natural language processing(NLP) technology to extract the characteristics of passwords. Thirdly, this paper contributes to a better understanding of deep-learning based guessing models in that: preprocessing indeed can effectively enhance the use of the neural networks’ learning ability, which is in turn intrinsically determined by the deep-learning model’s network structure. For instance, the text learning ability of GAN is weaker than that of LSTM, which leads to the poor effect of fine-grained preprocessing methods integrated with GAN. Thus, the ability of GAN to learn text features may be improved by changing its structure, for example, using LSTM to compose the generator of GAN.

6 Conclusion

This paper studies the deep-learning based password guessing models from the aspect of preprocessing. Firstly, considering the limitations of the one-hot encoding method, we propose a new encoding method that comprehensively reflects the character features. Secondly, considering that basic PCFG does not fully extract the password features, we propose a refined PCFG with comprehensive recognition rules. Thirdly, we adopt the idea of chunk segmentation at CCS’21, and apply the chunk+PCFG preprocessing method to LSTM and GAN.

Extensive experimental results show that: 1) Our refined PCFG outperforms the basic PCFG by 1.36% on average when integrated with LSTM; 2) Using basic PCFG for preprocessing improves the effect of the PassGAN model drastically by 26.79% on average; 3) Although our new encoding method does not improve the effect compared with the one-hot encoding, it still provides a feasible new research direction; 4) The performance of Chunk+PCFG preprocessing method is not ideal due to the complexity of its base structures.

Our results suggest that using PCFG for preprocessing is an effective way to improve the deep-learning based guessing models. Still, it should be used with care: although more fine-grained PCFG (e.g., our refined PCFG and Chunk+PCFG) extracts the passwords more comprehensively, it also generates more complicated base structures, which increases the training complexity for neural networks, and may even reduce the cracking rates for these neural networks with weak text feature learning ability.