1 Introduction

Web Application Penetration Testing (WAPT) is a proactive strategy employed by companies to identify vulnerabilities in web applications using a black-box approach. The process encompasses various phases, including information gathering, enumeration, exploitation, and analysis. Within the information gathering phase, exploring the web application’s structure and identifying libraries and frameworks are crucial components. OWASP [1] provides a standard methodology for testing web application security, emphasizing the significance of discovering the application’s structure in the “Fingerprinting Web Application” Security Test (WSTG-INFO-09).

Despite the black-box nature of Web Application Penetration Testing, where testers lack a priori knowledge about the web application’s structure, techniques such as spidering and dirbusting are employed. Spidering automates the process of analyzing internal links within HTML pages. However, the discovery of hidden links calls for alternative approaches. Dirbusting is a technique that involves brute-forcing a target with predictable folder and file names. It proves valuable in discovering hidden pages by monitoring HTTP responses. In this context, the choice of wordlists in dirbusting becomes critical, determined by factors like developer conventions, framework or CMS (Content Management System) used, and the web application’s development language.

Although the technique is used extensively by security experts, it requires time-consuming manual efforts to select proper wordlists. At the current time of writing, no research works have attempted to analyze if it is possible to automatically select such wordlists in order to optimize the performance of the dirbusting process. This study aims to fill the gap by introducing “dirclustering”, a novel approach to dirbusting optimization through semantic clustering. The paper demonstrates the effectiveness of this strategy in emulating the decision-making process of security experts when selecting wordlists.

We conduct a preliminary experiment based on eight well-known web applications that shows a performance improvement close to 50% compared to legacy brute-force approaches. The remainder of this paper is structured as follows. Section 2 delves into the details of the dirbusting process and provides an introduction to semantic clustering approaches. Section 3 shows related works using semantic clustering techniques for cybersecurity purposes, while in Section 4 our approach is illustrated. The designed experimental campaign is described, along with related results, in Section 5 and Section 6, respectively. The last two sections summarize the results and provide details about future evolutions of the proposed work.

2 Setting the context

In this section, we introduce the main concepts behind the dirbusting process, providing information about the techniques it employs. Additionally, we discuss semantic clustering along with the most well-known approaches it leverages.

2.1 Dirbusting

Dirbusting is a technique used to brute force a target with predictable folder and file names while monitoring HTTP responses to enumerate server contents. This technique uses wordlists to send HTTP requests to a target website and discover hidden pages. It is useful during the first phase of a Penetration Testing activity to discover the target application’s structure. It is important to remark that Penetration Testing is usually carried out as a black-box activity. The security expert has no access to information about the web application under test and typically has just low-privileged access to the system. For this reason, she/he cannot see all the pages of a web application by just using spidering. Thus, one of the goals of dirbusting is to discover pages that are not visible by using standard spidering techniques. Hidden pages might allow a security expert to find sensitive content on the website or valid entry points to perform other vulnerability injection tests. Dirbusting accepts a properly constructed list of words as input and starts sending HTTP requests to the website to discover new pages. In order to successfully complete its task, the dirbusting process needs a proper discovery condition. A common approach is to use the response code for that purpose: a new page is found when an HTTP response contains a status code other than 404 (Page Not Found). In this work, we define “valid requests” as those HTTP requests that have a response code other than 404. The choice of wordlists plays a crucial role in obtaining good results in terms of discovered pages. Such a choice depends on the acquired knowledge about the web application. Security experts choose the wordlists based on several criteria, such as:

  • which convention the developer has used to define paths;

  • which framework/CMS (Content Management System) has been used;

  • which language has been adopted to develop the web application.

If the web application contains a page whose name comes with camel case notation (e.g., loginPage), it is advisable to use camel case wordlists (logoutPage, adminPage, etc.). Similarly, if the fingerprinting phase detected the existence of a Wordpress Content Management System, an optimized wordlist should contain Wordpress-specific words (wp-login.php, wp-logout.php, etc.). On the other hand, if the web application contains files with well-known extensions (e.g., JSP, PHP), it is better to use a wordlist whose stems properly fit them.

In this work, we demonstrate that a semantic clustering strategy is able to optimize dirbusting activities by correctly mimicking the behavior of a security expert when it comes to choosing the most appropriate wordlist. Before delving into the details of the proposed approach, we will briefly introduce semantic clustering in the following subsection.

2.2 Semantic clustering

Clustering is the process of partitioning a set of data objects into subsets in such a way that items in the same group are more similar to each other than to those in other groups. The objective is to maximize intra-cluster similarity while at the same time minimizing inter-cluster similarity. It is a widely used technique in data mining for text domains, where the items to be clustered are textual, and they can be of different granularity (documents, paragraphs, sentences, or terms).

Simple text clustering algorithms represent textual information as a document-term matrix. Features are computed based on term frequencies, and semantically related terms are not considered. Thus, documents clustered in this way are not conceptually similar to one another if no terms are shared, as semantic relationships are ignored.

Semantic clustering, instead, consists in grouping items into semantically related groups [2, 3]. This requires measuring the semantic similarity between textual information, which can be accomplished by vectorizing the text corpus using, among the others, one of the following resources:

  • Semantic networks like WordNet [4]: a large lexical database of more than 200 languages. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked using conceptual-semantic and lexical relations.

  • Word embedding techniques such as Word2Vec [5, 6] and GloVe [7]: a word embedding is a simple neural network trained to reconstruct the linguistic contexts of words. Its input is a large corpus of words and produces a vector space, with each unique word assigned to a corresponding vector. Word embeddings allow us to use an efficient, dense representation in which similar words have a similar encoding.

  • Sentence Embeddings: while word embeddings encode words into a vector representation, sentence embeddings represent a whole sentence in a way that a machine can easily work with. These are capable of encoding a whole sentence as one vector. Examples are Doc2Vec [8], an adaptation of word2vec for documents, or more recent approaches such as the Universal Sentence Encoder (USE) [9] and InferSent [10].

  • Language representation models like the Bidirectional Encoder Representations from Transformers (BERT) [11]: BERT is a method for pre-training language representations, meaning that a general-purpose “language understanding” model is trained on a large text corpus (e.g., Wikipedia), and then used for downstream Natural Language Processing tasks. Pre-trained representations can either be context-free or contextual. Context-free models such as word2vec [5, 6] or GloVe [7] generate a single word embedding representation for each word in the vocabulary, so, for example, the word basket would have the same representation in sports and e-commerce. Contextual models, instead, generate a representation of each word that depends on the other words in the sentence.

Our approach leverages the Universal Sentence Encoder (USE) [9] as the chosen sentence embedding technique. Indeed, one of the main tasks for training a USE encoder is identifying the semantic textual similarity (STS) [12] between sentence pairs scored by Pearson correlation with human judgments. This is a task that perfectly fits our needs.

After the text corpus encoding phase, it is required to use a clustering algorithm in order to create the semantic clusters. Several clustering techniques can be effective for this purpose [13], and among the available choices, the K-means algorithm is used for its simplicity and accuracy.

One of the issues with K-means is the effective choice of the parameter K, i.e., the number of target clusters. In our case, such an issue is solved by leveraging the well-known elbow method, a heuristic used in determining the number of clusters in a data set. Figure 2 provides a graphical representation of such a heuristic.

This paper demonstrates that the proposed method allows us to improve dirbusting techniques by leveraging artificial intelligence. The approach was tested on eight web applications with 30 repetitions each, demonstrating a substantial performance improvement in each case..

3 Related works

The use of artificial intelligence techniques to improve security tasks is well proven, especially for realizing anomaly detection [14,15,16,17,18] and malware detection systems [19,20,21,22]. Despite the extensive literature, to the best of our knowledge, the exploration of how to leverage artificial intelligence in dirbusting has not been undertaken before. Relevant works typically fall in the wider area of semantic clustering methodologies that have been extensively explored in other application domains [2, 3]. Regardless of their application to dirbusting, Natural Language Processing techniques have been extensively used in security. Karbab [23] uses Natural Language Processing and machine learning techniques to create a behavioral data-driven malware detection tool. Malhotra [24] shows that NLP can help evaluate the completeness, contradiction, and inconsistency of security requirements of a software system.

Fig. 1
figure 1

The semantic clustering process used to group similar words. Each word in the wordlist is encoded by using the Universal Sentence Encoding (USE), then K-means clustering is used to group similar words

In general, the use of Artificial Intelligence techniques for Penetration Testing has not been fully explored yet. However, several techniques, such as fuzzing, have been used in other domains. As an example, in the software testing field, several works show how it is possible to optimize fuzzing techniques by using machine learning [25]. Our work shows how it is possible to optimize a bruteforce technique (i.e., dirbusting) by using Artificial Intelligence. Hitaj [26] demonstrates that a Deep Learning approach can outperform both rule-based and state-of-the-art password guessing methods. Since password guessing is fundamentally a brute-force attack, our work asserts that it is feasible to enhance Penetration Testing tasks through AI. Specifically, Natural Language Processing techniques have the potential to improve tasks related to word usage. With special reference to dirbusting, semantic clustering can indeed optimize a brute-force approach by finding both syntactic and semantic relations among words.

Semantic clusters can be modeled in different ways, including the usage of external resources such as Wikipedia like in [27, 28], where authors clustered the text corpus with an ensemble approach using knowledge and concepts from Wikipedia. Other works [29,30,31,32] leverage semantic networks, such as WordNet [4], which is used as word sense disambiguation to capture the main theme of the text and identify relationships among words. More recent applications use word embedding techniques [33,34,35] and sentence embedding techniques [36,37,38] where unsupervised embeddings models are used to encode the text corpus prior to the clustering process.

Our semantic clustering approach follows the works described in [36] and [37]. In [36], Universal Sentence Encoding (USE) [9] and InferSent [10] are used to find semantic similarities among user questions and cluster them by using the K-means clustering algorithm. In [37], such techniques are instead used to group similar tweets in a semantic sense.

4 Proposed solution

The proposed approach groups common files and directories contained in the wordlist based on the chosen semantic clustering technique. Semantic clusters define the execution order of the entries in the wordlist, with the aim of optimizing the dirbusting process. Namely, the approach involves a data pre-processing step on the entries of the wordlist and the subsequent creation of semantic clusters using the Universal Sentence Encoder (USE), in conjunction with the K-means clustering algorithm, as shown in Fig. 1.

4.1 Semantic clustering

The first step of the clustering process is data-processing, which consists of splitting each entry of the wordlist according to the naming convention (camelCase, snake_case, kabab-case), and punctuation characters: (! " $ # % & ‘ ( ) * +, -. /:;<=>? [ ] ⌃% _ { \(\mathtt {\sim }\) @ ).

For instance, the sentence “comments/add_comment.php” becomes “comments / add _ comment. php” and “UnicodeTest.txt” becomes “Unicode Test. txt”. This is a fundamental step because the naming convention and the punctuation characterizing each entry of the wordlist affect the encoding of the entry itself into embedding vectors. This may lead to a wrong similarity measure. For this reason, we detect the words in each entry to treat them as sentences instead of single words. This approach allows us to capture the semantic similarity among names contained in a wordlist, irrespective of the specific naming convention adopted by developers.

Then, a sentence embedding technique is used to encode each wordlist entry as a 512-dimensional vector so that similar words, often used in similar contexts, have a similar embedding vector representation.

More specifically, the Universal Sentence Encoder (USE) [9], version 4, implemented in TensorFlow 2.2.0 [39] is used. This model encodes text into high-dimensional vectors used for text classification, semantic similarity, clustering, and other natural language tasks. The model is trained on a variety of data sources and optimized for greater-than-word length text, such as sentences, phrases, or short paragraphs. One of the main tasks for the USE training is identifying the semantic textual similarity (STS) between sentence pairs, a task that perfectly fits our needs and justifies our choice.

Finally, extracted embeddings are used with clustering techniques to create the semantic clusters. Our approach uses the K-means clustering technique for its simplicity and accuracy. The number of clusters is chosen using the elbow method, a heuristic used to determine the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking a point slightly right to the elbow of the curve as the number of clusters to use, 20 in this case.

In Fig. 3, Principal Component Analysis (PCA) is used to show in a two-dimensional space the similarities of the words of the wordlist encoded using USE. Each of the points in the picture represents a word (word_ik), and each color represents the cluster (cluster_k) where the word belongs. As shown in the picture, semantically similar words are closer in the embedding space and are grouped in the same cluster.

Other examples are presented in Table 1 where words belonging to 5 different clusters are analyzed.

Fig. 2
figure 2

Elbow method, used to select the optimal number of clusters. The number of clusters corresponding to the “elbow” point is considered to be optimal. It is a compromise between the number of clusters and the quality of clustering

Fig. 3
figure 3

The figure shows semantic clusters in a two-dimensional space generated by the Principal Component Analysis (PCA). The space shows the similarities of the words in the wordlist encoded using the Universal Sentence Encoder (USE)

4.2 Intelligent dirbusting strategy

We implemented an intelligent dirbusting strategy that uses the semantic clusters created according to the proposed approach to improve the legacy brute force techniques.

Commonly used dirbusting techniques require a huge number of requests to the target website, attempting to guess the names or identifiers of hidden functionality based on the common files and directories contained in the wordlists.

Table 1 Examples of similarities in semantic clustered words

The choice of the wordlist depends on the information gathered by the security experts during the spidering process, whose main task is to enumerate the target’s visible content and functionality. Based on the knowledge acquired during this phase, the experts select the proper wordlist following these criteria:

  • Naming conventions: developers are used to following a naming convention (camel case, snake case, kebab case) when implementing a web application. For this reason, if a security expert identifies a specific naming convention, the wordlist is chosen or adapted to it.

  • Content Management Systems (CMS): if the fingerprinting phase detected the existence of a certain CMS, a corresponding wordlist is chosen.

  • Used programming language: if the web application under test contains files with well-known extensions (e.g. .jsp, .php), it is advisable to use a wordlist whose stems properly fit them.

As a general rule, in order to discover as much hidden content as possible, it is fundamental to choose the wordlist that best suits the target’s characteristics. Once the wordlist is chosen, the dirbusting process starts by addressing HTTP requests to the target according to the directory and file names in the list. The order of execution of the requests follows the wordlist order, as described in algorithm 1 below.

Algorithm 1
figure a

Dirbusting Process. Send an HTTP request and check the response code for each word in the wordlist. If the response code is different from 404 (i.e., the “not found” status code), then store the path.

Our approach aims at making dirbusting more intelligent by leveraging artificial intelligence. As described in the flow in Fig. 4, the process starts by choosing a random word (word_ik) from a common wordlist. When a valid URL is detected, the proposed strategy consists of choosing the cluster (cluster_k) where the current word belongs. In this way, the next words picked from the chosen cluster will likely target another valid URL. Examples of words grouped in the same cluster are given in Tab. 1.

While common dirbusting techniques require that the expert is forced to manually select a wordlist according to the target characteristics, the intelligent dirbusting strategy accomplishes this task by building the above-described semantic clusters while considering the following aspects:

  • the CMS used in web applications: as shown in Table 1, in clusters 1 and 2, words related to different CMSs are grouped together. In cluster 1, Joomla-related words are included, while in cluster 2 we can find words commonly used in WordPress;

  • web application programming languages: clusters 3 and 4, in Table 1, include words related to specific languages. In these clusters, the programming languages PHP and JSP, respectively, are represented;

  • semantic similarities: semantic clusters are able to consider semantic similarities as well. In this way, dirbusting is able to automatically understand the context of the website. In Tab. 1, cluster 4 contains words related to e-commerce, whereas in cluster 5, words associated with images are considered.

Fig. 4
figure 4

The semantic clustering strategy flow is used to optimize the dirbusting process. At the start, a random word is chosen from a common wordlist. Then, depending on the found valid URLs, the word selection extracts words belonging to the cluster obtained by the semantic clustering process

5 Trials and experimentation

The experiments aim to demonstrate that a dirbusting strategy based on semantic clustering enhances the discovery of a website structure by reducing the number of HTTP requests required to successfully complete the entire process. To this purpose, we developed a virtualized environment composed of 8 distinct target websites. Then, we created a wordlist containing full paths of each website and instrumented a dirbusting tool that can be configured to run either in bruteforce mode or by leveraging the semantic clustering strategy introduced by us. In order to create the wordlist, we have started all of the target applications and retrieved full paths by executing OS-level commands. In Tab. 2, the word count for each web application under attack is reported.

Table 2 Web applications words count

After obtaining the wordlist, we applied semantic clustering in order to group words according to their semantic meaning. The resulting clusters are stored in an ad hoc configuration file that is used by the instrumented dirbusting module when carrying out the clustering-based testing campaigns.

Finally, for each website, we have performed the experiments by executing the dirbusting tool both in bruteforce mode and in semantic clustering mode. Each experiment’s results have been logged to enable further offline analysis of the collected data.

The virtualization environment is a useful alternative to a real-world setup for several reasons:

  • we do not have to deal with network issues that might affect the environment;

  • we do not create potential Denial Of Service conditions. Indeed, as the dirbusting process sends lots of requests against a web application, if the tested webserver is not designed to support high traffic loads, it might crash;

  • we do not run into legal issues: a bruteforce directory listing might be tagged as a bruteforce attack. Dirbusting is an inner part of Penetration Testing. As such, it should be regulated by contracts.

As described in Section 1, we define a “valid request” as an HTTP request that has revealed a new path within the target website. We compare the total number of executed HTTP requests with the number of valid HTTP requests. To improve the significance of the results, we have repeated each experiment 30 times for each website.

With the bruteforce approach, the unique wordlist is shuffled, and words are used to perform the classic dirbusting procedure. On the other hand, with the clustering approach, we use the algorithm we have described in the previous section to select the words from the wordlist in a non-random way.

We compared the results graphically by plotting the relationship between the total requests sent to the target website and the number of valid requests.

Our goal is to compare the above-mentioned approaches by measuring the overall number of requests sent to the web application in order to complete the discovery of a website’s structure thoroughly. Such a comparative evaluation aims to show that performance increases when using the semantic clustering approach.

5.1 Experiment information

The following information is useful to describe the performed experiments better. Firstly, we do not evaluate the time required to complete the task. Instead, we aim to verify that our solution reduces the number of HTTP requests sent to reconstruct the entire structure of the target website. Hence, we do not compare the execution time of the two approaches.

As already anticipated, we have created an isolated environment by using container-based virtualization. No highly-intensive processes have been run during the tests. We continuously monitored CPU, RAM, and disk usage during each experiment to ensure that none went under pressure during any of the trials. We also verified that the tool did not crash during any of the runs.

Network issues might impact the results of the campaign. While we are not specifically focusing on response times, a stable network is crucial. Network problems could lead to response timeouts, potentially invalidating results. For these reasons, websites are deployed in an isolated Docker network. The environment operates within the context of the dirbusting instrumented tool. Therefore, we can safely assume that there are no network reliability issues during the experiments.

5.2 Architecture of the benchmarking tool

To perform experiments, we instrumented a dirbusting tool that is illustrated in Figure 5

Fig. 5
figure 5

Architecture of the dirbusting tool used to perform experiments. The input parameters are the target URL, a boolean value that enables the semantic clustering approach, and a seed value used to increase the randomness of the wordlist’s shuffling. The tool is instrumented through a configuration module, while the HTTP requests are performed through the dirbuster module

The tool in question was developed in Python 3.6 and is made of three components:

  • Entrypoint: accepts input parameters needed to set up a trial associated with a specific target;

  • AiDirBuster: dirbusting module that implements the dirbusting process either in bruteforce mode or by leveraging the semantic clustering strategy proposed by us;

  • Config: configuration module containing several configuration parameters, such as groups of words identified through semantic clustering.

The tool accepts the following parameters as inputs:

  • use_clustering: a boolean value. If true, dirbusting uses the semantic clustering strategy; otherwise, a brute-force approach is adopted;

  • target_url: the target web application used to run the experiment;

  • seed: a seed used to increase randomness when the wordlist is shuffled during the brute-force approach.

Semantic clusters are computed offline and subsequently stored in the above-mentioned Config module.

Universal Sentence Encoder (USE) [9], version 4Footnote 1, implemented in TensorFlow 2.2.0 [39], is used to extract sentence embeddings.

We leveraged the K-means clustering algorithm implementation made available by the sklearn python library to find and collect clusters. The default set of parameter values was used, except for the factor ‘K’, which was properly configured with the elbow method.

5.3 Experimental environment setup

To simulate the dirbusting process, we have built a docker environment composed of 8 publicly available web applications, some of which are also typically used for experimenting with vulnerability assessment and penetration testing.

Table 3 shows the characteristics of the web applications in question.

Table 3 Web applications used for the experiment

Among the applications reported in the table, the ones that are usually used to experiment with web application penetration testing are Bodgeit,Footnote 2 bricks,Footnote 3 DVWSFootnote 4 (Damn Vulnerable Web Services), XVWAFootnote 5 (Xtreme Vulnerable Web Application) and Wacko.Footnote 6

On the other hand, Wordpress,Footnote 7 DrupalFootnote 8 and JoomlaFootnote 9 are among the most widely spread PHP Content Management Systems used to create web applications.

The environment realized for the experiment is built by using an Infrastructure as Code (IaC) approach. A docker-compose file including eight services describes the system’s architecture under test, as shown in Fig. 6. Each service exposes the standard HTTP service (port 80) and maps it onto an unassigned TCP port of the hosting machine. The semantic clustering dirbusting tool sends requests to the eight web applications by targeting such exposed TCP ports on the host. With this approach, it is possible to add new web applications in an easy way, as well as to extend the experiment by including new target applications.

Fig. 6
figure 6

The container-based testbed architecture used to evaluate the proposed algorithm. Each web application runs inside a Docker container and exposes an unassigned TCP port in the host machine

5.4 Wordlist acquisition

Fig. 7
figure 7

Performance plots for Bodgeit, Bricks, Drupal, and DVWS web applications. The plots depict the mean and standard deviation trends of the detected valid requests relative to the total requests sent to the target server, averaged over 30 experiments. As observed, the semantic clustering approach (blue) demonstrates a performance improvement of nearly 50% compared to the legacy brute-force strategy (orange)

Fig. 8
figure 8

Performance plots for Joomla, Wacko, Wordpress, and XVWA web applications. The results mirror those in Fig. 7, with the exception of Joomla, where the performance improvements are less pronounced. This is attributed to the fact that the wordlist used for the experiment contains more than half of the words related to Joomla. Hence, selecting a comprehensive wordlist is crucial for achieving favorable results

We extracted paths from each webserver to create the integrated wordlist and merged them in a single file. Given n the number of webservers used for the experiments, the following formula applies:

$$\begin{aligned} UniqueWordlist = rand(\sum _{w=1}^{n} absolute\_paths_{w}) \end{aligned}$$

In a nutshell, UniqueWordlist can be obtained as the randomized concatenation of all absolute paths contained in each webserver.

The path extraction task for a specified webserver can be carried out by executing OS-level commands inside the related Docker service. The following one-line command is a practical example of how the above-mentioned task might be completed:

figure b

The concatenation of all of the collected words creates a unified wordlist. As described in the previous section, the words in the wordlist are basically absolute paths. For our experiments, the final unified wordlist is composed of 8367 words.

6 Experimental results

Our experiments allowed us to demonstrate that the enhanced dirbusting strategy we propose outperforms the legacy brute-force approach. Indeed, for each of the eight web servers under test, we achieved a performance improvement of up to 50%.

In Fig. 7 and Fig. 8, we show the results of our campaign. For each web server, we plot both the mean and the standard deviation (std) trend of the detected valid requests over the number of total requests addressed to the target server. Each experiment has been replicated 30 times in order to improve the significance of the collected results.

As we anticipated above, a “valid request” is a request with a response code other than 404 (Not Found HTTP error message).

In each of the plots, the two approaches are compared. In orange, we show the results of the legacy random brute-force approach, while in blue, we report the performance of the proposed semantic clustering strategy.

As it is possible to observe, the random brute-force strategy shows a linear trend. Indeed, as we apply randomization for each experiment, the number of requests needed to find all the paths is, on average, equal to the number of paths. In this way, the longer the wordlist, the higher will be the number of requests required to find the valid paths.

On the other hand, semantic clustering shows a steeper growth rate. As a matter of fact, with this approach, the curve stops increasing much earlier than with the brute force one.

With all web servers under test, the trend of the two approaches always remains the same. This indicates the gain in performance that can be achieved by leveraging the proposed semantic clustering approach.

Furthermore, our approach can identify almost all available target URLs with only half of the requests compared to the brute-force approach. This results in a performance improvement of close to 50%.

The only exception is Joomla, where the performance improvements are comparatively lower than those observed in the other web applications under test. The reason behind this finding is that the number of words collected for Joomla is 4672, which is more than half the total number of words included in our wordlist. This is evident when looking at the word count in Table 2. As observed, Joomla covers about half of the integrated wordlist.. This means that with a random approach, we can find valid URLs in Joomla with a probability of about 50%, reducing the potential gain achievable by leveraging the alternative approach we propose. Nevertheless, there is still a clear improvement for this web application as well, with around 2000 requests less than those needed to find all of the available paths with the brute-force approach.

7 Discussion

Many studies explore security testing in the web application domain, but current works often neglect the problem of optimizing the enumeration phase. The primary objective of our work has been to fill this gap. Specifically, we explore a novel approach based on semantic clustering to optimize the “dirbusting” technique, which is one of the most common approaches used to discover the structure of web applications during the enumeration phase of web application penetration testing.

This technique employs a “brute-forcing” fuzz of known path names to uncover hidden folders and files within the web application. The approach is well-defined in business security standards, such as Common Weakness Enumeration (CWE) [40] and Common Attack Pattern Enumeration and Classification (CAPEC) [41] methodology. Additionally, it can be instrumental in identifying security disclosure flaws, as the discovered paths may contain hidden sensitive data that could be exposed to malicious attackers.

Regrettably, conventional approaches are typically time-consuming, involving security experts who manually inspect the components of the web application and attempt to guess potential valid paths.

We demonstrate the feasibility of optimizing this activity through a semantic clustering approach. Our experiments indicate that the total number of HTTP requests needed to uncover the structure of a web application can be reduced by up to 50% compared to the classic dirbusting approach.

It is important to note that performance is highly dependent on the chosen wordlist. In our experiment, where the wordlist consists of approximately 50% Joomla paths, the approach does not yield significant benefits. Therefore, it is crucial in the initial setup to properly configure the solution with a comprehensive wordlist.

It is crucial to emphasize that this study serves as an introduction to the effectiveness of semantic clustering in addressing the challenge of web application structure discovery. The primary goal was to demonstrate that a semantic clustering approach can be applied in the unexplored realm of the enumeration phase, optimizing the typically manual dirbusting process. As such, trials and experiments were conducted in a controlled, small, and isolated environment, primarily focused on vulnerable applications and Content Management Systems.

Further studies should be undertaken to assess the solution with more comprehensive wordlists, diverse features, and a broader range of applications.

This assumption is valid, as the main goal of a penetration tester activity is to completely discover web application vulnerabilities.

The proposed approach should be further analyzed in production environments. In real-world scenarios, web applications are often deployed behind web application firewalls that could potentially block certain HTTP requests, impacting the effectiveness of dirbusting. Bypass techniques for such firewalls are an important consideration but are outside the scope of this work.

For the purposes of this study, the assumption is made that the security tester is authorized to perform the assessment, and defense firewalls are disabled. This assumption aligns with the primary goal of penetration testing, which is to thoroughly discover vulnerabilities in web applications.

The proposed ideas are generalizable and can be adapted to specific needs. Future work could involve extending experiments to include real-world applications for a more comprehensive evaluation. Additionally, exploring the application of large language models (LLMs) could be beneficial to identify further improvements to the approach. Combining different techniques, such as standard web spidering with dirbusting, is another avenue for improvement. This could involve using spidering to detect the overall structure of the target web application and then leveraging dirbusting to uncover hidden or private pages.

It is important to note that our work focuses on demonstrating the effectiveness of a semantic clustering approach compared to a brute-force one for discovering the web application structure. As a result, strategies for exploring subpaths are not considered in this study. The wordlists used during the experimentation consist of full paths (e.g., /users/mooney/ir-course/). However, dirbusting techniques can be recursive, meaning that whenever a new path is found, dirbusting can be recursively applied to it to discover new subpaths. This recursive aspect could be explored in future work for a more comprehensive analysis. In view of the above considerations, it would be interesting to investigate the use of our semantic clustering approach in a recursive way while also evaluating the possible alternative strategies for applying recursion while navigating through the dynamically identified sub-paths (e.g., breadth-first, depth-first, etc.).

8 Conclusions

Web application security testing is paramount for shielding digital assets against potential threats. Security testers, pivotal in identifying and addressing vulnerabilities, often engage in directory busting during the enumeration phase. In this study, we introduce “dirclustering”, a semantic clustering approach, as a significant optimization to traditional dirbusting techniques. Through meticulous experimentation in a controlled testbed, our results demonstrate the superior effectiveness of “dirclustering” compared to conventional dirbusting methods. This work opens a promising avenue for enhancing the enumeration phase of web application security testing, providing valuable insights for security testers to optimize their processes and achieve more efficient results in subsequent assessment phases.