1 Introduction

The DNS, first introduced in 1983, has undergone significant advancements and emerged as an indispensable element of the Internet [1]. Serving as the fundamental framework for mapping human-readable domain names to machine-routable Internet Protocol (IP) addresses of online resources; DNS represents a vital standard in contemporary Internet infrastructure [2, 3]. Beyond its initial objective of facilitating the translation between domain names and IP addresses, DNS has assumed a substantial role in Content Distribution Networks (CDNs) for effective traffic redirection [1, 4]. However, DNS poses security threats by transferring unencrypted queries and answers over network connections. Consequently, DNS traffic becomes a target for eavesdropping, compromising user privacy, and enabling Malicious activities such as botnet command and control (C &C) servers, phishing websites, and spamming [5,6,7,8,9,10].

To address growing security concerns, various enhancements have been introduced. The latest advancements are DNS-over-TLS (DoT) [7, 11] and DoH [12, 13]. These protocols improve security by establishing a TLS session between the client and the resolver. DoH, in particular, utilizes an HTTP connection within the TLS session. While DoT has gained limited traction, DoH has gained substantial momentum with support from influential players like Mozilla, Cloudflare, and Google [1, 14, 15]. DoH enables DNS queries and responses to be transmitted via the HTTPS protocol, making it accessible to any application supporting HTTPS [15].

Compared to DoT, which requires a specialized stub resolver, DoH has lower deployment overhead on the client side [15]. DoH enhances privacy by encrypting DNS traffic within the HTTPS protocol, effectively shielding it from third-party observers. This encryption safeguards user privacy, preventing ISPs and network intermediaries from monitoring or tampering with DNS data. Leveraging the widely used HTTPS infrastructure, DoH traffic blends in with regular web traffic, further strengthening privacy [14, 16, 17]. The adoption of the HTTP/2 protocol in DoH also offers benefits such as multiplexing, header compression, and other optimizations, potentially improving the efficiency and speed of DNS queries and resulting in faster response times compared to DoT [16].

Despite its advantages, DoH poses security threats due to its encrypted nature. Most network security analysis tools rely on unencrypted DNS information to identify and detect Malicious activities and attacks. As a result, DoH traffic reduces visibility for current security tools and applications, making it challenging to analyze and detect Malicious behavior patterns [18, 19]. Due to the significance of DoH, this paper concentrates on profiling Malicious behavior within DoH traffic, considering it a primary concern.

After analyzing the Malicious and Benign data, behavior profiling, which covers identification and characterization, is a crucial aspect of cybersecurity [20, 21]. As the volume of data generated and collected by organizations grows, the risk of encountering Malicious data that can compromise the security of systems and networks also increases [22,23,24]. Profiling Malicious data involves analyzing patterns and characteristics that may indicate Malicious intent or behavior [25]. This can include identifying anomalous data points, detecting data tampered with, or recognizing patterns that indicate the presence of Malicious data [25, 26].

Previous studies have applied different pattern-recognition methods to find data patterns, including Template matching, Neural networks, Hybrid, Statistical, Structural, and Fuzzy (Fig. 2). Also, they tried to identify various forms of Malicious data patterns, including Local Binary (LB), Structural, Frequent, and Sequential patterns (Fig. 1), using a range of pattern recognition methods. However, our investigation revealed that none of these studies have tried to detect the statistical relationship between data features. As a result, statistical profiling is still a topic that has not been explored in this area of research.

Several studies have attempted to use advanced machine learning and deep learning methods like Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN) [27, 28] to detect and profile Malicious behavior. However, discovering statistical connections in data offers several advantages, including simplicity and compatibility, efficient data usage, computational efficiency, and the ability to choose essential features over these complex approaches [29,30,31].

This paper proposes two statistical models for recognizing Malicious behavioral patterns using linear and logistic regression to cover these issues. The first model is based on linear regression and aims to generate a data pattern using Malicious data. The second model uses logistic regression to create a single pattern for classifying Benign and Malicious data. To evaluate the performance of these models, we used the most updated related dataset, CIRA-CIC-DoHBrw-2020.

Fig. 1
figure 1

Different types of pattern

Fig. 2
figure 2

Malicious behavior pattern recognition methods

The major contributions of this research work can be stated as follows:

  • To propose two statistical pattern recognition solutions based on linear and logistic regression models.

  • To optimize the performance of the linear regression-based model using the Extra Tree Classifier (ETC) technique.

  • To identify the most suitable feature set that maximizes the performance of proposed models using ETC and Various Inflation Factor (VIF) techniques.

  • To profile Malicious DoH activities by calculating and analyzing the correlation coefficient.

This paper is structured as follows. Firstly, Sect. 2 provides an overview of the existing approaches for detecting Malicious DoH traffic, including statistical pattern recognition and synthesis. Section 3 reviews previous studies on methods for recognizing Malicious behavior patterns and introduces our proposed models for detecting such patterns. The implementation details of the proposed models are then discussed in Sect. 5. Section 6 summarizes the results obtained by employing the proposed linear and logistic regression-based models. Next, Sect. 7 presents the results of the proposed logistic and linear regression-based models and compares their performance with state-of-the-art classifiers. Finally, Sect. 8 concludes the study.

2 Literature review

This section presents the state-of-the-art approaches for DoH Malicious traffic detection and summarizes prior research endeavors.

2.1 Detecting Malicious DoH traffic

DoH boosts security by using a secure channel for DNS transactions via HTTPS. Encryption is a significant advantage of DoH over traditional DNS [19, 32]. However, encrypted DoH traffic can pose security risks because many security tools rely on readable DNS data to spot attacks. Also, only users and service providers can see encrypted DoH, making profiling challenging [12, 18, 33, 34].

Mitsuhashi et al. [35] pointed out the increasing presence of DoH-supported operating systems, inadvertently using the Domain Generation Algorithm (DGA) malware to hide their generated domain names. To address this issue, the authors proposed a novel system to detect and analyze communications of DGA-based malware within DoH traffic. The system uses a hierarchical machine learning approach, using Gradient Boosting Decision Tree (GBDT) and tree-ensemble models to classify network traffic.

Jerabek et al. [36] proposed several proposals for DoH detection that rely on specialized flow monitoring software. These models export complex features that are challenging to compute in real-time or have low accuracy. Also, their widespread implementation is hindered. To address this issue, they proposed a novel DoH detector that uses IP-based, machine learning, and active probing techniques to detect DoH using standard flow monitoring software. Using traditional flow features enables the deployment of this method in any network infrastructure equipped with flow-monitoring appliances, including intelligent switches, firewalls, or routers.

Hynek et al. [18] discussed several methods for analyzing and detecting DoH abuse and categorized security and privacy analysis of DoH work into two groups: detection of DoH presence in the network and the detection of the Malicious DoH traffic. Vekshin [37] leveraged the machine-learning models to distinguish DoH connections from other traffic. However, their proposed model cannot recognize the DoH connection with a single query and masked traffic shape.

Bushart and Rossow [38] discussed a traffic analysis method for encrypted and padded DNS by developing a DNS sequence classifier using the K-Nearest Neighbor (KNN) algorithm. KNN classifier searches for the most similar DNS transaction sequences in a previously trained model. Singh and Roy [39] also developed machine learning-based classification methods to detect the Malicious DNS in DoH using Naive Bayes (NB), Logistic Regression (LR), Random Forest (RF), KNN, and Gradient Boosting (GB). They reported that machine learning-based algorithms would be a better option for preventing DNS attacks on DoH traffic.

MontazeriShatoori et al. [40] presented a two-layered approach to detect and characterize the DoH traffic using time-series classifiers. Similarly, a two-layer method is presented in [41] for detecting DoH traffic from Malicious DoH traffic where layer one is used for classifying DoH traffic from non-DoH traffic, and layer two is used for characterizing Benign DoH from Malicious DoH traffic.

Wu et al. [42] developed an Auto-Encoder (AE)-based detection of DoH resolvers. It first detects the DoH traffic based on the AE algorithm in encrypted traffic, and then verification is used to discover the DoH resolvers. Gonzalez Casanova and Lin [43] proposed a deep-learning-based approach for DoH traffic classification with Long Short Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) models. In this work, authors developed a data processing pipeline to process the CIRA-CIC-DoHBrw-2020 time series dataset for deep learning models, including feature selection and data imbalance handling. Chen et al. [44] adopted a long short-term memory (LSTM) model to detect DNS cover channels using FQDNs. They used the FQDNs of DNS packets as the input and implemented an end-to-end detection using LSTM.

Zhan et al. [45] presented a classification method to detect DoH-based data exfiltration with TLS fingerprints of DoH clients. Wang et al. [46] proposed a one-dimensional CNN-based approach for encrypted traffic classification. Le et al. [47] presented a character-level and word-level CNN model to detect Malicious URLs. The CNN model was adapted to the URL String’s characters and words to learn the URL embedding. Authors claimed that this approach would capture several types of semantic information, which was not possible by other existing models. Liu et al. [48] proposed a byte-level deep learning approach to detect the DNS tunnels. This method converts bytes into vectors and then uses the CNN method to detect the DNS tunnels.

2.2 Classifying DoH traffic

Mitsuhashi et al. [49] proposed three levels of hierarchical traffic classification method to identify the DoH traffic with statistical features analysis of the packets. In each stage of this work, XGBoost, LightGBM, and CatBoost machine-learning models are used to find the best model and parameters for DoH traffic.

Moustafa et al. [50] presented a method for extracting and analyzing the statistical features (flow-based and service-based features) from network traffic with MQTT, DNS, and HTTP protocols. The flow-based and service-based statistical features (e.g., packet length, length, and mean of the DNS queries, etc.) define the basic information of the protocols that have patterns of legitimate and Malicious user behaviors. They adopted a correntropy-based measure for analyzing the feature vectors.

Liu et al. [51] introduced an attention-based Bidirectional Gate Recurrent Unit (BGRU) to classify encrypted HTTPS traffic. BGRU model learned the features of packets through forward and backward GRU operations and attention mechanism to assign greater weights to valuable features for encrypted HTTPS traffic classification. Similarly, Wang et al. [52] proposed a feature-fusion (FF)-based DoH-encrypted DNS covert channel detection method using Multi-Head Attention [53] and Residual Neural Network. This method fused the statistical session features with byte sequence features extracted by Residual Neural Networks. Essential features with the Multi-Head Attention mechanism are used to detect and identify the encrypted DNS covert channels.

Zhan et al. [45] proposed a method to detect DoH data exfiltration (DoH tunneling) using the TLS handshake information and traffic analysis. In traffic analysis, a difference between DoH tunneling and Benign DoH traffic was estimated and used to detect DoH tunneling. They extracted the flow-based features from the TLS fingerprints of DoH clients and then built classifiers for detecting the DoH tunneling. A statistical analysis of the features presented by calculating the minimum, maximum, mean, and standard deviation shows the promised results.

Recent work [54, 55] presented feature representation learning with advanced RNN to detect Malicious DoH traffic. Ding et al. [54] suggested an anomaly detection model that encompasses an end-to-end architecture, leveraging a Variational Autoencoder (VAE) enriched with an attention mechanism. Similarly, Du et al. [55] developed an Autoencoder (AE)-based Anomaly Detection using Bidirectional Long and Short-Term Memory (Bi-LSTM). The proposed model consists of the input, embedding, encoder, decoder, and reconstruction loss layers, which are trained with the stochastic gradient descent method and minimize loss.

2.3 Shortcomings and remain issues

Based on this review, the state-of-the-art methods employed for DoH traffic analysis can be grouped into the following categories:

  • Conventional machine learning models (e.g., LR, RF, KNN, NB, and GB) for the DoH traffic classification.

  • Deep neural networks models (e.g., AE, LSTM, and CNN) for the DoH traffic classification

  • Correntropy-based statistical methods for features analysis

  • Ensemble tree-based gradient boosting (e.g., XGBoost, LightGBM) methods for classification and feature selection

  • Time-series and attention-based approaches (e.g., AE, VAE, LSTM and Bi-LSTM, Feature Fusion (FF), and Multi-head) to detect and recognize the Malicious DoH traffic.

Even with several solutions like statistical pattern recognition, deep learning models, and ensemble techniques, it is essential to recognize that we still face various challenges and issues, including:

  • Insufficient identification and analysis of statistical patterns in DoH traffic

  • Lack of profiling for detecting Malicious behavior

  • Performance limitations of feature-fusion models with attention mechanisms

  • Limited extraction and utilization of traffic features

  • Narrow focus on data exfiltration detection and limited applicability

  • Inadequate detection of obfuscated and disguised Malicious traffic

  • Vulnerability to adversarial attacks

This research addresses the first three shortcomings mentioned above by proposing a new pattern recognition model for profiling Malicious data behavior.

3 Pattern recognition background

A pattern can be defined as a sequence or rule that repeats according to a particular rule or set of rules. Figure 1 presents an overview of various pattern types and their respective subsets as identified in our research. The categorization of patterns can be classified into five distinct groups: Statistical, Local Binary (LB), Structural, Frequent, and Sequential. Frequent patterns, as outlined in previous studies [56, 57], frequently occur in data, such as patterns found within a single sequence or a set of sequences. Structural patterns encompass patterns with inherent structures, such as graphs and trees. Statistical patterns highlight the statistical relationships between data features.

Another pattern type relevant to image analysis is Local Binary Pattern (LBP), a suitable image texture descriptor. LBPs employ a thresholding technique based on the current pixel’s value to determine the neighboring pixel values. LBP descriptors prove valuable in image analysis by efficiently capturing local spatial patterns and grayscale contrast [58]. Figure 2 provides an extensive taxonomy of pattern recognition methods and the corresponding algorithms utilized in these methods. This taxonomy was established based on current research advancements. As depicted in Fig. 2, our research classifies pattern recognition methods into six groups: Template matching, Statistical, Neural networks, Structural, Hybrid, and Fuzzy. Each of these methods employs specific algorithms tailored to their respective approaches. For instance, the structural pattern recognition method can use KNN, FP-growth, and RF algorithms to identify Malicious behavioral patterns. This section defines the common pattern recognition methods and presents the state-of-the-art approaches that utilize pattern recognition methods to detect Malicious and Benign data. Following this summarizes prior research endeavor in Sect. 3.2.

3.1 Available pattern recognition methods

This sub-subsection defines the common pattern recognition methods and reviews the current research on Malicious behavior patterns.

  1. I.

    Statistical method in pattern recognition involves using statistics and probabilities to represent patterns as numerical vectors, often converting features into numeric representations and comparing them using distance measures within a statistical space [59]. For example, Fan et al. [60] introduced MSPMD, a malware detection framework combining sequential pattern mining (MSPME) and Artificial Neural Network (ANN) classification, while Nawaz et al. [61] developed MalSPM, an intelligent system for recognizing Malicious behavior during execution based on sequential pattern mining (SPM).

  2. II.

    structural pattern recognition, particularly suited for complex patterns composed of simple sub-patterns, structures like trees and graphs are employed to represent patterns [62]. For example, Nguyen et al. [56] used frequent pattern theory in data mining to mine combinations of requested permissions.

    Additionally, Tao et al. [63] proposed the MalPat system for recognizing patterns in Malicious and Benign Android applications through permission-related APIs. Roseline et al. [64] introduced a vision-based method using a hybrid, non-deep learning model for analyzing Malicious patterns represented as 2D grayscale images. Likewise, Liu et al. [65] presented a multi-layer learning framework that extracts feature descriptors from Malicious images. Furthermore, Arzu et al. [66] introduced a metamorphic Malicious data recognition method called Higher-level Engine Signature-based Metamorphic Malware Identification (HLES-MMI) utilizing a unique co-opcode graph structure and higher-level signatures.

  3. III.

    Template matching method involves defining a measure or a cost to find the "similarity" between the (known) reference patterns and the (unknown) test pattern by performing the matching operation. It finds its application in speech recognition, robot vision automation, motion estimation for video coding, and image database retrieval systems [67]. For instance, Taha et al. [68] utilized API call functions to recognize Malicious data families alongside five pattern-matching technique algorithms (Naïve, Rabin-Karp, Brute-Force, Knuth Morris Pratt, and Boyer Moore) that can be potentially used for string similarity detection, which in turn can be used to differentiate a Malicious data from another. Based on their analysis, the algorithms that produced the best results when comparing Malicious API-call functions were Naïve, Karp, Brute Force, and KMP.

  4. IV.

    Neural network approach (ANN) approach is a self-adaptive trainable process that can learn to resolve complex problems based on available knowledge [59]. Neural networks are massively parallel structures composed of neurons like sub-units. This approach provides efficient results in pattern recognition and data classification [62]. Recent research has tried utilizing neural networks to recognize Malicious behavioral patterns from visualized data.

    For example, Xiao et al. [27] proposed a Malicious data classification framework (MalFCS) that is based on Malicious data visualization and automated feature extraction. The model showed that combining a deep CNN with an entropy graph can provide a better discriminative pattern of Malicious data families.

    Similarly, Bendian et al. [28] proposed a Malicious traffic analysis approach using deep learning and visual representation for Malicious behavior recognition and classification. The experiments and comparison with multiple neural networks show that the Residual Neural Network with 50 layers (ResNet50) is the most effective in recognizing Malicious behavior patterns.

  5. V.

    Fuzzy-based method can partition the patterns by soft boundaries. As a result, a pattern may be classified into one or more classes with a certain degree of membership to belong to each class [69]. Fuzzy sets in pattern recognition are essential because the modeling forms of uncertainty cannot be understood entirely using probability theory [62]. Modiri dovom et al. [70] implemented a fast fuzzy pattern tree. The first modification of the short fuzzy pattern tree tries to speed up the detection of the best model among a set of candidate models. The second restricts the total number of candidates.

  6. VI.

    Hybrid method combines multiple methods. In most emerging applications, it is clear that a single Malicious recognition method may not work efficiently. Hence, multiple methods must be combined, resulting in a hybrid Malicious recognition model [71]. For example, Yoo et al. [72] presented a hybrid decision model. This model combines an RF and a deep learning model using 12 hidden layers to determine Malicious and Benign files. For making a final decision, this model includes specific proposed voting rules. Moreover, Jerbi et al. [73] developed a new Android Malicious detection approach named artificial malware-based detection approach (AMD). This model uses an algorithm to generate artificial patterns to detect Malicious apps.

Fig. 3
figure 3

Statistical pattern recognition model using linear regression

Fig. 4
figure 4

Statistical pattern recognition model using logistic regression

3.2 Synthesis

Based on this review, the classification of data patterns encompasses four categories: Frequent, Structural, Sequential, and Local Binary. In Fig. 2, a comprehensive taxonomy of pattern recognition methods and the corresponding algorithms employed by these methods are provided. As illustrated in Fig. 1, the statistical pattern represents a pattern that has not been adequately explored in previous research regarding modeling the statistical relationships between Malicious data features for pattern creation.

While existing research has applied statistical pattern recognition methods to identify data patterns, they have not specifically focused on uncovering the statistical relationships between features inherent in the data. Therefore, the main objective of this research is to propose behavioral pattern recognition models utilizing statistical methods. This study introduces two models for creating Malicious behavior patterns and classifying data into Malicious and Benign categories, leveraging statistical relationships between features.

We chose linear and logistic regression models over more advanced techniques like CNN and Bidirectional Bi-LSTM for several reasons and gaps, including [29,30,31]:

  • Simplicity and interpretability: Linear and logistic regression models are relatively simple and are inherently interpretable. They provide straightforward explanations of the relationship between the input features and the output variable. In contrast, more advanced models like CNN and Bi-LSTM are often more complex. They may lack interpretability, making it difficult to understand the underlying patterns and factors driving the predictions.

  • Data efficiency: Linear and logistic regression models can perform well even with limited data. They require fewer parameters to estimate and are less prone to overfitting, which can be advantageous when the available dataset is small or the feature space is not high dimensional.

  • Computational efficiency: Linear and logistic regression models are computationally efficient compared to deep learning models like CNN and Bi-LSTM. These advanced models typically require more computational resources and training time, making them less suitable for scenarios with limited computing power or time constraints.

  • Feature importance and selection: Linear and logistic regression models can effectively identify and rank the importance of features in the prediction task. This can be particularly useful when the goal is to understand the impact of specific features on the outcome or when feature selection is essential for interpretability or model simplification.

4 Proposed models

This research addresses the mentioned issues in Sect. 3.2 by proposing two pattern recognition models. The proposed models recognize a Malicious pattern and classify Malicious and Benign data based on statistical relationships. The first model employs linear regression to generate a data pattern utilizing Malicious data. The second model uses logistic regression to create a single pattern for classifying Benign and Malicious data. The proposed models could be summarized in Fig. 3 and 4. Sections 4.1 and 4.2 explain each part of the proposed models separately.

4.1 Linear regression-based model

In this model, we applied linear regression to reveal the statistical associations among the data features. As outlined in [74], linear regression is a statistical approach that seeks to establish a linear correlation between one or more independent variables and a dependent variable. This was achieved by identifying the line of best fit for the data, which is mathematically represented as:

$$\begin{aligned} y=ax+b \end{aligned}$$
(1)

The dependent variable is represented by y, the independent variable by x, the slope of the line by a, and the intercept variable by b; when there are multiple independent variables, multiple linear regression is employed as an extension of linear regression; multiple linear regression is utilized when there are multiple independent variables to predict the dependent variable and is mathematically represented as:

$$\begin{aligned} y=\beta _0+\beta _1 x_1+ \beta _2 x_2+...+\beta _p x_p \quad \end{aligned}$$
(2)

Considering a dataset with N rows and M features \((x_1, x_2, \ldots , x_M)\) from applications. A system of equations is formed for linear relationships between rows and features, as shown in Eq. 3:

$$\begin{aligned} {\left\{ \begin{array}{ll} \begin{aligned} y'_1 &{}= \beta _0 + \beta _1 x_{1,1} + \beta _2 x_{1,2} + \ldots + \beta _M x_{1,M} \\ y'_2 &{}= \beta _0 + \beta _1 x_{2,1} + \beta _2 x_{2,2} + \ldots + \beta _M x_{2,M} \\ y'_3 &{}= \beta _0 + \beta _1 x_{3,1} + \beta _2 x_{3,2} + \ldots + \beta _M x_{3,M} \\ &{}\vdots \\ y'_n &{}= \beta _0 + \beta _1 x_{n,1} + \beta _2 x_{n,2} + \ldots + \beta _M x_{n,M} \end{aligned} \end{array}\right. } \end{aligned}$$
(3)

In Eq. 3, \(y'_1, y'_2, \ldots , y'_n\) predict linear combinations of rows using features \((x_1, x_2, \ldots , x_M)\). \(\beta _i\) indicates the effect of features on \(y'_1, y'_2, \ldots , y'_n\). The goal is to find suitable \(\beta _i\) parameters for the linear regression model, making \(y_1, y_2, \ldots , y_n\) approximately equal to \(y'_1, y'_2, \ldots , y'_n\). The y-intercept (constant term) is represented by \(\beta _0\), while the other \(\beta \) represents the slope coefficients for each explanatory variable. These \(\beta \)s are also referred to as weights. The error rate of a linear regression model can be quantified using Mean Absolute Error (MAE) or Mean Squared Error (MSE), as represented by equations 4 and 5, respectively. These equations illustrate the calculation of MAE and MSE in which \(y_i\) is the ith actual value, and \(y'_i\) is the corresponding predicted value.

$$\begin{aligned}{} & {} MAE=\sum _{i=1}^{n} \frac{(y_i-y'_i)}{n} \quad \end{aligned}$$
(4)
$$\begin{aligned}{} & {} MSE=\sum _{i=1}^{n} \frac{(y_i- y'_i)^2}{n} \quad \end{aligned}$$
(5)

As Fig. 3 shows, the linear regression-based model comprises two steps: data preprocessing and pattern recognition. The specifics of this model will be discussed in the subsequent subsections.

Linear regression is appropriate for distinguishing DoH from other data because it can capture the statistical relationships between the features and the classification label. If certain features strongly correlate with the Malicious/Benign classification, linear regression can effectively represent this relationship.

Additionally, the error rate of a linear regression model can be quantified using Mean Absolute Error(MAE) or Mean Squared Error(MSE). The MAE measures the average absolute difference between the predicted and actual values, while the MSE calculates the average squared difference. These error metrics are suitable for linear regression because they evaluate the model’s accuracy and quantify the discrepancy between predicted and actual values. Overall, linear regression is suitable for distinguishing DoH from other data because it can capture statistical associations and generate a data pattern using Malicious data. MAE and MSE are appropriate error metrics for linear regression as they provide insights into the model’s predictive performance and the magnitude of prediction errors.

  1. I

    Data preprocessing: This phase aims to select the optimal feature sets for recognizing the best data pattern. The ETC feature selection technique was employed to find the optimal feature set. The rationale behind selecting this feature selection method is elaborated in Sect. 6.1. A threshold called True Error was defined to differentiate the behavior of different data types. The True Error was determined through experimentation, utilizing either MSE or MAE, depending on the data type. The experiment and analysis section describes the procedure for determining the threshold value (Sect. 6).

  2. II

    Pattern recognition: The primary objective of the pattern recognition phase is to utilize linear regression to recognize patterns within the selected feature sets. The linear regression model is applied to the feature set, and an appropriate threshold is determined, thus enabling the recognition of patterns and data classification. Using the linear regression model, we have two data types. This implies that if the model’s output is less than the threshold, the data are mapped to one label; otherwise, it is mapped to another.

4.2 Logistic regression-based model

In the linear regression-based model, linear regression was employed for data classification and pattern recognition of Malicious and Benign data. However, the second proposed model utilizes logistic regression to classify data based on statistical relationships between features. This technique is specifically designed for binary classification problems, with two possible outcomes, namely 0 or 1. Logistic regression can be considered an extension of linear regression for classification problems as it uses a logistic function to transform the output of a linear equation between 0 and 1. The logistic function is mathematically defined as:

$$\begin{aligned} Logistic(\eta ) = \frac{1}{1+exp(-\eta )} \quad \end{aligned}$$
(6)

To ensure that the probability output of the classification model falls within the appropriate range of 0 and 1, the right-hand side of equation 2 is transformed via the logistic function. This function compresses the output, resulting in values restricted to 0 to 1. The logistic function used in this context is mathematically defined as:

$$\begin{aligned} P(y_i) = \frac{1}{1+exp(-({\beta _0}+{\beta _1} {{x_i}_1}+ {\beta _2} {{x_i}_2}+...+\beta _p {{x_i}_p}))} \nonumber \\ \end{aligned}$$
(7)

As previously established, the output of equation 7 is restricted to 0 to 1. In binary classification using logistic regression, a threshold of 0.5 is used to assign class labels. This implies that if the output of equation 7 is greater than or equal to 0.5, the instance is assigned a class label of 1. Otherwise, it is assigned a class label of 0 [75, 76].

As Fig. 4 shows, the logistic regression-based model consists of two steps: data preprocessing and pattern recognition. The specifics of this model will be discussed in the following subsection. Logistic regression’s ability to handle binary classification, its probability-based approach, consideration of statistical relationships, and threshold-based classification make it a suitable technique for distinguishing DoH from other data types. By leveraging these characteristics, logistic regression enables the creation of behavioral patterns and accurate classification based on the statistical relationships among the features.

  1. I

    Data preprocessing: This phase aims to identify the optimal feature set for recognizing patterns in the data. We employ a method to determine the most appropriate feature sets. VIF is a metric that measures the degree of linear association between an explanatory variable and the other predictor variables in a multiple regression model [74]. This is important as, for logistic regression to be practical, the features should have minimal correlation [77, 78].

  2. II

    Pattern recognition: The objective of this phase is to utilize logistic regression to recognize patterns within the data. Using the logistic regression model, one class is mapped to the value of one, and the other class is mapped to the value of zero. This implies that if the output of the logistic regression model is greater than or equal to 0.5, the data are classified as belonging to the first class; otherwise, it is classified as belonging to the second class.

Table 1 The label count for Layer 1 vs. Layer 2

5 Experimental setup

This section discusses the implementation details of the proposed models.

5.1 Dataset

To evaluate the performance of our proposed models,we utilized Layer 2 of the CIRA-CIC-DoHBrw-2020 dataset, which includes Malicious and Benign DoH network traffic. CIRA-CIC-DoHBrw-2020 dataset is a publicly available dataset generated in 2020. This dataset is the most comprehensive dataset with 2 layers and contains labels including Non-DoH, DoH, Malicious-DoH, and Benign-DoH network traffic [40]. This dataset encompasses the implementation of the DoH protocol within an application, utilizing five different browsers and tools, as well as four servers, to capture various types of traffic, including Benign-DoH, Malicious-DoH, and non-DoH traffic. The classification process involved a two-layered approach, where Layer 1 was employed to distinguish DoH traffic from non-DoH traffic. At the same time, Layer 2 was used to differentiate between Benign DoH and Malicious DoH traffic. The browsers and tools utilized to capture network traffic include Google Chrome, Mozilla Firefox, dns2tcp, DNSCat2, and Iodine. These tools were used with the following servers to respond to DoH requests: AdGuard, Cloudflare, Google DNS, and Quad9 [40].

To create the DoH flows (the combination of all request and response packets in one connection) and extract features from the captured PCAP file, the DoH Meter tool was utilized. The data in the generated CSV file are labeled flow-wise based on the IP addresses of the servers used in the network diagram. The Python programming language was employed to implement this feature extraction process. The output of this tool is in the format of CSV file. This tool can extract 28 statistical features from captured DoH traffic. Table 2 provides a comprehensive list of the features extracted from the captured traffic. As the table some of the features are flow-based, and some are packet-based. The features can be categorized as rate, length, and time-based features.

Table 1 shows the number of DoH, Non-DoH, Malicious-DoH, and Benign-DoH data in the CIRA-CIC-DoHBrw-2020 dataset. The second layer of the dataset exhibited an imbalanced distribution, with approximately 90% of the data corresponding to Malicious DoH traffic and the remaining 10% representing Benign DoH traffic. This imbalance is significant as it allows us to effectively profile the Malicious behavior within network traffic while also ensuring a higher level of certainty regarding the available variables about the response variable. [79].

Table 2 List of extracted statistical traffic features

5.2 Models implementation

In this study, we employed two modules: a linear regression-based and a logistic regression-based model, utilizing the Scikit-Learn and Statsmodels libraries, respectively. These libraries offer a comprehensive range of state-of-the-art machine learning algorithms [80, 81]. These libraries simplify the implementation of various machine learning techniques for predictive data analysis [80, 81].

6 Experiments and results

This section summarizes the results obtained by employing the proposed linear and logistic regression-based models. Firstly, Sect. 6.1 explains the used feature selection techniques for selecting the best feature sets. Following this Sects. 6.2 and 6.3 discuss the performance of the proposed models.

6.1 Feature selection

Feature selection involves identifying essential features from an initial set through the application of specific selection criteria [82]. The goal is to obtain a smaller subset of the original set [82]. Feature selection techniques are commonly divided into three categories: Wrapper, Filter, and Embedded [82]. In this work, we applied ETC and VIF techniques to select and utilize the best feature sets in the linear and logistic regression models. Sections 6.1.1 and 6.1.2 explain the applied feature selection techniques for selecting the best feature set for the proposed models.

6.1.1 Feature selection technique for linear regression-based model

We utilized the ETC feature selection technique for selecting the best feature set for the linear regression-based model. ETC is a filter method. The main advantage of filter methods over other feature selection methods is that they are generally less computationally demanding and thus can easily be scaled to very high dimensional data [83, 84]. ETC is a type of ensemble learning that aggregates the results of several de-correlated decision trees collected from a forest into the classification results of features [85]. ETC is a decision-making-based method to select relevant features [86]. The ETC methodology is similar to the RF classifier but differs in constructing decision trees in the forest [86, 87]. The ETC algorithm utilizes the dependency relation extracted from data samples to build each decision tree [86].

During the evaluation at each test node, the algorithm randomly selects k attributes and employs the Gini Index as the mathematical criterion to determine the best attribute [86]. The feature selection process involves ranking each attribute in descending order according to their Gini importance, allowing the user to choose the top k attributes [86]. In summary, the ETC is suitable for selecting the best feature set because of its ensemble learning approach, random feature selection, ability to handle nonlinear relationships between features, robustness to outliers, and computational efficiency. These characteristics make it a potentially effective pattern recognition and classification task method. Because of the randomness of ETC, we ran the ETC algorithm 500 times to make the optimal ordering of feature sets. The condition for selecting the feature of rank K is that it happened the most compared to other features of rank K. Moreover, it has yet to be chosen for ranks less than K. Table 4 shows the optimal features ranking for Layer 2 the CIRA-CIC-DoHBrw-2020 dataset generated by ETC.

Table 3 VIF of the VCIRA-CIC-DoHBrw-2020 dataset
Table 4 Optimal features ranking for the CIRA-CIC-DoHBrw-2020 dataset generated by ETC

6.1.2 Feature selection technique for logistic regression-based model

The VIF was utilized to implement the logistic regression-based model to select the best feature set. VIF is based on the concept of feature filters [74, 77]. The reason for choosing this feature selection method is that one of the critical problems in a binary logistic regression model is that explanatory variables considered for the logistic regression model are highly correlated. Multicollinearity will cause unstable estimates and inaccurate variances that affect confidence intervals and hypothesis tests. VIF shows how much the variance of the coefficient estimate is inflated by multicollinearity. In the logistic regression-based models, if values of VIF are above 2.5, it may be a cause for concern [88].

In summary, VIF is an effective feature selection technique for logistic regression. It addresses the issue of multicollinearity, where highly correlated variables can lead to unstable estimates and inaccurate results. By quantifying the inflation of coefficient variance caused by multicollinearity, VIF helps identify problematic variables. Researchers typically set a threshold of 2.5, removing variables above this value to improve the reliability and interpretability of logistic regression models. Table 3 shows the VIF of Layer 2 CIRA-CIC-DoHBrw-2020 dataset features. Based on Table 3, the features with VIF of less than 2.5 are Packet Length Mode, Response Time Time Coefficient of Variation, Response Time Time Skew From Mode, and Response Time Time Skew From Median. The difference in results between Table 4 and Table 3 can be attributed to the distinct methodologies employed by ETC and VIF. ETC focuses on ensemble learning and random feature selection, while VIF addresses the issue of multicollinearity. Each technique has advantages and suitability for specific modeling scenarios, leading to variations in the selected feature sets.

Table 5 Performance of the linear regression-based model with Top K features on e CIRA-CIC-DoHBrw-2020 dataset

6.2 Performance of the best feature set for linear regression-based model

The results from the proposed linear regression model are presented in Table 5, considering the threshold metrics of MAE and MSE. The feature ordering in the table corresponds to the optimal arrangement achieved through the ETC feature selection technique. Table 5 provides compelling evidence that employing MSE as the threshold metric yields significantly superior outcomes compared to using MAE as the threshold metric. Additionally, Fig. 5 and Table 5 demonstrate that, when considering MSE as the threshold, the accuracy of the feature set comprising 3 to 16 features remains relatively stable. The findings further indicate that the model’s performance declines when MSE is employed as the threshold and the feature set contains more than 16 features.

However, fluctuations in accuracy are observed when evaluating the model’s performance using MAE as the threshold. Furthermore, both Table 5 and Fig. 5 illustrate that training the model on Malicious data leads to improved accuracy in terms of MSE. Specifically, the optimal performance of the linear regression model, as depicted in Table 5 and Fig. 5, is achieved when using the top 13 features, with an accuracy of 94.5%. Conversely, when considering MAE as the threshold, the best performance of the linear regression model is obtained by utilizing the top 3 features, resulting in an accuracy of 82.14%. These findings suggest that the most effective feature set for the linear regression-based pattern recognition method consists of the top 13 features when MSE is used as the threshold metric.

To achieve the model’s optimal performance, the top 13 features are employed, with ’Packet Time Standard Deviation’ as the dependent feature and the remaining features as independent variables. The coefficients for the linear regression-based model with the top 13 features are provided in Table 6. The best Malicious pattern is described as equations 9 in which y is the real value of the dependent variable, which is ’Packet Time Standard Deviation,’ and y’ is the calculated value of the dependent variable. If their difference to the power of two is less than the threshold, it is interpreted as Malicious; otherwise, it is Benign.

$$\begin{aligned} y'&= -3.01 -6.63 F7 + 2.63 F16 -7.54 F5 -42.17 F9 \nonumber \\&\quad -10.62 F10 + 2.63 F6 +12.59 F14 - 8.07 F19\nonumber \\&\quad -37.50 F13 -11.47 F27 + 13.00 F28 -32.69 F12 \end{aligned}$$
(8)
$$\begin{aligned}&(y-y')^2 <2.385 \end{aligned}$$
(9)
Fig. 5
figure 5

Comparison of the accuracy of the linear regression-based models with different numbers of features * LR-MAE: Linear regression-based model trained by Malicious data considering MAE as the threshold. *LR-MSE: Linear regression-based model trained by Malicious data considering MSE as the thresh. *LR-MB: Linear regression-based model trained by both Malicious and Benign data

Table 6 Coefficients of linear regression-based model

6.3 Performance of the best feature set for logistic regression-based model

Table 7 presents the performance of a logistic regression-based model on the CIRA-CIC-DoHBrw-2020 dataset [40], utilizing features with a VIF score less than 2.5. Features with a VIF score of less than 2.5 are ’Mode Packet Length,’ ’Skew from median Request/response time difference’, ’Skew from mode Request/response time difference,’ and ’Coefficient of Variation of Request/response time difference.’ The model achieved high accuracy, with 93.61% accuracy and 96.65% F1-score. The coefficients for the logistic regression-based model are provided in Table 8. Based on Table 8, the Feature ’Mode Packet Length’ has the highest Coefficient, and the feature ’Skew from mode Request/response time difference’ has the lowest coefficient. This indicates that the feature ’Mode Packet Length’ has the highest impact in profiling Malicious data behavior, and the feature ’Skew from mode Request/response time difference’ has the highest impact in profiling Benign behavior because we mapped Malicious data to 1 and Benign data to 0. The extracted pattern by logistic regression is as equation 11. This indicates that if y is less than 0.5, the result is interpreted as Benign. Other than that, it is interpreted as Malicious.

$$\begin{aligned}{} & {} X=4.55 + 12.55 F7 + 0.24 F27 - 1.00 F28 - 0.51 F26\nonumber \\ \end{aligned}$$
(10)
$$\begin{aligned}{} & {} y= \frac{1}{1 + e^{-(X)}} \end{aligned}$$
(11)
Table 7 Performance of the logistic regression-based model
Table 8 Coefficients of logistic regression-based model
Table 9 Evaluation of the models performance using standard metrics
Fig. 6
figure 6

Comparison of the accuracy of the linear regression-based models with different numbers of features * LR-MAE: Linear regression-based model which is trained by Malicious data considering MAE as the threshold. *LR-MSE: Linear regression-based model which is trained by Malicious data considering MSE as the thresh

Table 10 Models performance comparison with the state-of-the-art classifiers

7 Evaluation and discussion

Table 9 evaluates the models’ performance on the CIRA-CIC-DoHBrw-2020 dataset [40] using several metrics, including Accuracy, F1-score, Precision, and Recall. As the table demonstrates, the linear regression-based model considering MSE as the threshold performed slightly better than the logistic regression-based model in all evaluation measures. However, the linear regression-based model could achieve the results using the feature set of the top 13 features. The logistic regression-based model could reach the results using 4 features with a VIF of less than 2.5.

To evaluate the proposed linear regression-based model, we also trained the linear regression method using Malicious and Benign DoH data. The linear regression-based model could perform best using the top 15 features. As Table 9 shows, however, the proposed linear regression-based model was trained with Malicious DOH data, and the linear regression-based method was trained with both Malicious and Benign DoH data; the performance of both models was pretty close. This shows the stability of the proposed linear regression-based model. Moreover, previous research tried classifying and recognizing Malicious DoH using state-of-the-art classifiers. For example, MontazeriShatooriet al. [40] evaluated the performance of the state-of-the-art classifiers, including RF, C4.5, Support Vector Machine (SVM), NB, Deep Neural Network (DNN), and two-dimensional CNN (2D CNN) on the CIRA-CIC-DoHBrw-2020 dataset [40]. Table 10 compares the performance of the state-of-the-art classifiers with the proposed linear and logistic regression-based models.

As the table shows, the state-of-the-art classifiers, except SVM and NB, performed better than the proposed linear and logistic regression-based models. However, there are certain issues in the training process of state-of-the-art classifiers that have not been taken into account during their training. Firstly, as mentioned previously, the proportion of Malicious DoH to Benign DoH in the CIRA-CIC-DoHBrw-2020 dataset [40] is approximately 90:10. This means that training the model with the CIRA-CIC-DoHBrw-2020 dataset [40] leads to overfitting and the classification of data is not reliable. Secondly, all 28 features in the data set are fed into the classifiers for training, which is expensive. Finally, recent papers such as MontazeriShatoori et al. [40] utilized state-of-the-art classifiers for Malicious DoH detection focused on data classification and detection. They did not try to profile the Malicious DOH data behavior.

In this paper, we aim to profile Malicious DoH traffic and find the statistical DOH data pattern utilizing statistical relationships between features. Moreover, the linear and logistic regression-based model also provides several benefits over more complicated machine learning and deep learning approaches, including the following [29,30,31]:

  1. 1.

    Simplicity: Linear and logistic regression is a versatile algorithm that can be easily applied to various problems. Its simplicity and straightforwardness make it accessible to users with varying levels of expertise.

  2. 2.

    Interpretability: Linear and logistic regression provides interpretability, allowing for a clear understanding of the relationship between dependent and independent variables. By examining the model’s coefficients, users can determine the direction and magnitude of the relationship between variables.

  3. 3.

    Fast training: Linear and logistic regression models can be trained quickly, making them ideal for large datasets and real-time applications. Their fast training time enables quick results, making them well-suited for time-sensitive tasks.

  4. 4.

    Speed: Linear and logistic regression models are computationally efficient and can provide rapid results. This efficiency makes it particularly useful for real-time applications where speed is crucial.

  5. 5.

    Regularization: Linear and logistic regression can be regularized to prevent overfitting, a common issue in machine learning and deep learning algorithms. This regularization allows the model to generalize well, ensuring it is not overly fit to the training data.

Our experimental results revealed that the linear regression-based model yielded the most optimal outcomes when utilizing the top 13 features selected through the ETC feature selection technique. Analysis of Tables 2 and 4 suggests that the selected features can be categorized as length-based and timing-based. The length-based features identified by the ETC feature selection technique include Mode Packet Length, Mean Packet Length, Standard Deviation of Packet Length, Coefficient of Variation of Packet Length, Median Packet Length, and Skew from Mode Packet Length. Conversely, the timing-related features consist of Standard Deviation of Packet Time, Variance of Packet Time, Median Packet Time, Skew from Median Packet Time, Mean Packet Time, Skew from Median Request/Response Time Difference, and Skew from Mode Request/Response Time Difference.

In our research, the logistic regression-based model demonstrated optimal results with the top 4 features selected through the VIF technique. Examination of Tables 2 and 3 indicates that the selected features can also be categorized as length-based and timing-based. The timing-based features include the Coefficient of Variation of Request/Response Time Difference, Skew from Mode Request/Response Time Difference, and Skew from the Median Request/Response Time Difference. It can be noticed that the only length-based feature is Mode Packet Length. In this section, we provide a detailed description of the top 13 selected features that exhibited the best performance in the linear regression-based model and the features chosen by VIF and utilized in the logistic regression model.

7.1 Analysis of the selected feature set

In this subsection, we presented an in-depth analysis of these top 13 features, highlighting their significance in Malicious DoH data profiling and pattern recognition.

  • Length-based features

    1. 1.

      Mode packet length is a statistical feature that measures the most frequently occurring packet length in a network flow. The mode packet length can indicate the type of traffic being transmitted. This indicates that DoH traffic may exhibit certain characteristic packet lengths that differ from other types of traffic [89, 90].

    2. 2.

      Mean packet length is a statistical feature that measures the average length of packets in a network flow. The mean packet length can indicate the type of traffic being transmitted. This shows that the reason for selecting this feature is DoH traffic can have a distinctive mean packet length compared to other types of traffic [90].

    3. 3.

      Standard deviation of packet length is a statistical metric for assessing packet length variability within a network flow. Monitoring fluctuations in the standard deviation of packet length can prove invaluable in pinpointing potentially suspicious network activity [89]. Notably, an elevation in the standard deviation of packet length can be a sign of such suspicious behavior [89]. This characteristic becomes particularly important when assessing variations in packet length distribution between DoH traffic and other traffic types. Indeed, it effectively highlights and quantifies this difference in behavior [90].

    4. 4.

      Coefficient of variation of packet length is a statistical feature that measures the variation in packet length relative to the mean packet length in a network flow. The coefficient of variation of packet length can be used to identify irregularities in packet length that may indicate the presence of Malicious traffic. This feature helps to distinguish DoH traffic, which might exhibit specific patterns of packet length variations [90].

    5. 5.

      Median packet length is a statistical feature that measures the middle value in a set of packet lengths in a network flow. Analyzing the median packet length can provide insights into the characteristics of transmitted traffic. In this study, this feature is particularly significant due to the distinctive characteristics exhibited by DoH traffic in terms of its median packet length, which sets it apart from other types of network traffic, as detailed in [90].

    6. 6.

      Skew from mode packet length is a statistical feature that measures the degree of asymmetry in the distribution of packet lengths around its mode (i.e., the most frequently occurring packet length). A high skewness from mode packet length can indicate a skewed or irregular distribution of packet lengths around the mode, which may show irregularities in the traffic. For instance, this feature may detect a pattern where a few packets are significantly larger or smaller than the rest, indicating data exfiltration, DNS tunneling, or other Malicious activities [91]. This feature indicates DoH traffic has a different distribution shape compared to other traffic types [90].

  • Time-based features

    1. 1.

      Standard deviation of packet time is a statistical feature that measures the time variability between packets in a network flow. A low standard deviation of packet time suggests that packets are being sent at a consistent rate, while a high standard deviation indicates variability in packet transmission timing. Malicious DoH behavior could exhibit irregular timing patterns due to the nature of the attack or the need to evade detection. Higher standard deviation values indicate greater deviations from the expected timing patterns [90, 92].

    2. 2.

      Variance of packet time is a statistical measure that describes how to spread out or disperse the timing of packets in network communication between the DoH client and server [45]. It measures the degree of variation in the time it takes for packets to travel between the client and server. This indicated that a high variance in packet timing might demonstrate a significant amount of delay or jitter in the network, which could negatively impact the performance of the DoH connection. Conversely, a low variance in packet timing may indicate a stable and reliable network connection. Malicious DoH activities might introduce irregularities or unusual timing patterns, leading to higher variance values.

    3. 3.

      Median packet time is a statistical feature that measures the middle value in a set of packet times in a network flow. Regular DNS traffic has a predictable timing pattern, so any significant deviation from the median packet time may indicate Malicious activity [89]. This feature can capture the typical timing pattern, even in the presence of outliers. Malicious DoH traffic might exhibit distinctive median packet times compared to legitimate DoH traffic.

    4. 4.

      Skew from median packet time is a statistical feature that measures the degree of asymmetry in the distribution of packet times in a network flow relative to the median packet time. A high positive or negative skew from median packet time can be used to identify suspicious traffic. Deviations from a symmetric distribution might indicate irregular or suspicious timing patterns associated with Malicious DoH behavior [89].

    5. 5.

      Mean packet time is a statistical feature that measures the average time between the transmission of packets in a network flow. Malicious DoH activities might introduce variations in timing intervals that deviate from legitimate DoH traffic. Anomalous mean packet times could indicate Malicious behavior [89].

    6. 6.

      Skew from median request/response time difference is a statistical feature that measures the degree of asymmetry in the distribution of the time difference between a DNS request and its corresponding response in a network flow relative to the median time difference. Skewed distributions might suggest specific patterns or irregularities in the request/response timing, potentially associated with Malicious DoH activities [89].

    7. 7.

      Skew from mode request/response time difference is a statistical feature that measures the degree of asymmetry in the distribution of the time difference between a DNS request and its corresponding response in a network flow relative to the mode time difference. Normal DNS traffic has a predictable pattern of request/response time differences, so any significant deviation from the mode time difference may indicate Malicious activity [89].

As previously mentioned, the top 4 features selected by the VIF method in our analysis are Mode Packet Length, Coefficient of Variation of Request/response time difference, Skew from mode Request/response time difference, and Skew from median Request/response time difference. While Mode Packet Length, Skew from mode Request/response time difference, and Skew from median Request/response time difference have already been described as among the top 13 features selected by the ETC feature selection technique, it is essential to further discuss the significance of the Coefficient of Variation of the request/response time difference in the context of identifying and detecting Malicious DoH activity. The Coefficient of Variation is a statistical metric that quantifies the variability or dispersion of data points around the mean. Regarding request/response time difference, it provides insights into the spread of these time differences.

Monitoring the Coefficient of Variation helps establish a baseline of normal behavior, enabling the identification of deviations from this baseline. A significant increase in the Coefficient of Variation value indicates a departure from the expected pattern, potentially indicating the presence of suspicious or anomalous DoH activity. Consequently, the Coefficient of Variation can be utilized as a part of an anomaly detection system to flag potential Malicious DoH requests for further investigation and mitigation. However, it is crucial to acknowledge that the Coefficient of Variation alone is insufficient. It should be combined with other detection techniques and metrics to ensure accurate and effective DoH detection [89]. In summary, all the mentioned features provide valuable information for analyzing the Malicious behavior of Malicious DoH data.

7.2 Malicious DoH profiling

This section examines and evaluates Malicious DoH data profiling using the correlation coefficient between features. We present the obtained Malicious DoH profile by utilizing the proposed linear regression and logistic regression based models utilizing the correlation coefficients between the features. We provided a comprehensive background on the correlation coefficient and emphasized its significance in profiling.

7.2.1 Correlation coefficient

In statistical terms, correlation is assessing a potential linear association between two continuous variables [93]. The strength of this association is measured using a statistic called the correlation coefficient. This coefficient, ranging from -1 to +1, quantifies the degree of the linear relationship between the variables [93].

A correlation coefficient of zero indicates no linear relationship, while a coefficient of -1 or +1 represents a perfect linear relationship [93]. The stronger the correlation, the closer the correlation coefficient comes to ±1 [93]. A positive coefficient indicates a direct relationship, where one variable’s increase corresponds to the other’s rise. Conversely, a negative coefficient suggests an inverse relationship, where an increase in one variable corresponds to a decrease in the other.

In multiple regression modeling, the correlation between independent variables can significantly impact the accurate estimation of the model and its subsequent results and interpretation. This phenomenon, often overlooked, is known as "collinearity," referring to the linear dependence between two independent variables [94]. This concept can be extended to "multicollinearity," which involves multiple independent variables [94]. While the term "collinearity" is frequently used to describe strongly but not perfectly correlated variables, it is essential to distinguish between exact collinearity and near-collinearity, where a strong but not perfect correlation exists between a pair of independent variables [94].

To identify near-collinearity, all pairwise correlations between independent variables intended for inclusion in the model should be less than 1. Correlation values above 0.9 strongly suggest the presence of collinearity [36, 94]. Two commonly used correlation coefficients are Pearson’s product-moment correlation coefficient and Spearman’s rank correlation coefficient [95, 96]. In this study, we employed the Pearson correlation coefficient to calculate the correlation coefficient, as it is suitable for determining the presence of a linear relationship in the data [93].

7.2.2 Linear regression-based model

In our linear regression-based model, the optimal accuracy was achieved by considering the first 13 features from the feature set. Specifically, we observed that the ’Standard Deviation of Packet Time’ feature exhibited the most vital relationship as the dependent variable, while the remaining 12 features served as independent variables. The correlation coefficient of the features used in the linear regression-based model is presented in Fig. 7.

A higher correlation coefficient indicates a more significant influence on the model’s output. From the figure, we can observe that the correlation coefficient between the features is generally equal to or less than 0.9, except for ’PacketLengthStandardDeviation’ and ’PacketLengthCoefficientOfVariation’. Although these two variables display a correlation exceeding 0.9, implying a solid association, it is essential to note that correlation does not necessarily signify redundancy [94]. Both ’PacketLengthStandardDeviation’ and ’PacketLengthCoefficientOfVariation’ offer distinct insights into the data and contribute to the overall predictive power of our model. ’PacketLengthStandardDeviation’ measures the packet length dispersion, highlighting the dataset’s variability.

Conversely, ’PacketLengthCoefficientOfVariation’ provides a normalized measure of variability relative to the mean, enabling us to capture the relative scale of deviations. By including both variables in our model, we can leverage the valuable information they provide. While their high correlation suggests overlapping characteristics, it does not warrant discarding either variable. Removing one would lose specific and distinctive characteristics contributing to the predictive model. Additionally, the correlation between ’PacketTimeVariance’ and ’PacketTimeStandardDeviation’ is higher than that between ’PacketTimeStandardDeviation’ and other features. This observation indicates that ’PacketTimeVariance’ holds more significance in predicting the value of the dependent variable.

Furthermore, the absolute value of the correlation between ’ResponseTimeSkewFromMedian’ and ’PacketTimeStandardDeviation’ is lower than the correlation between ’PacketTimeStandardDeviation’ and other features. This finding suggests that ’ResponseTimeTimeSkewFromMedian’ has the most negligible impact on predicting the value of ’PacketTimeStandardDeviation’.

Fig. 7
figure 7

Correlation coefficient between features employed in the linear regression-based model

Fig. 8
figure 8

Correlation coefficient between features employed in the logistic regression-based model

7.2.3 Logistic regression-based model

Figure 8 illustrates the correlation coefficient among the features employed in the logistic regression-based model. The depicted correlations demonstrate values below 0.9 for all feature pairs. Notably, the correlation between ’ResponseTimeSkewFromMedian’ and ’ResponseTimeSkewFromMode’ exhibits a higher absolute value than the correlations observed between other feature combinations. Conversely, the absolute value of the correlation coefficient between ’PacketLengthMode’ and ’ResponseTimeSkewFromMedian’ is the lowest, which can be attributed to the inherent dissimilarity of these features. Specifically, one feature is time-based, while the other is length-based.

8 Conclusion and future works

The DNS plays a crucial role in facilitating Internet functionality by enabling users to access websites using easily recognizable domain names, which are then mapped to numerical IP addresses for computer communication. However, DNS is vulnerable to various attacks that compromise security and availability. To mitigate these concerns, the concept of DoH was introduced to enhance the privacy, security, and reliability of DNS queries. Nonetheless, DoH is not exempt from security and privacy challenges like DoH encryption. This research paper introduces two statistical pattern recognition models to profile DoH Benign and Malicious traffic, one utilizing linear regression and the other employing logistic regression.

We tested the model on the ’CIRA-CIC-DoHBrw-2020’ dataset which is the most comprehensive DoH network traffic dataset at this moment. The ETC and VIF feature selection techniques were utilized to determine the most effective feature sets for both models. Through experimentation, the optimal feature importance was determined using ETC and VIF. The linear regression-based model achieved an accuracy of 94.50% and an F1-score of 97.17% with the top 13 features selected by ETC. On the other hand, the logistic regression-based model achieved an accuracy of 95.35% and an F1-score of 97.61% with only four features selected by VIF, each with a value of less than 2.5. Additionally, a comparison was made between these proposed models and state-of-the-art machine learning and deep learning algorithms, including RF, C4.5, SVM, NB, DNN, and 2D CNN. The results demonstrated that the linear and logistic regression-based models’ performance was inferior to some more complex machine learning and deep learning models. However, it is essential to note that our models utilized fewer features than previous studies.

Furthermore, our proposed models offer several advantages over more intricate machine learning and deep learning approaches, including low computational complexity, simple implementation, robustness to noise, and reduced data requirements. Moreover, this research represents the first attempt to represent the Malicious DoH profile obtained by the proposed linear regression and logistic regression-based models utilizing the correlation coefficients between the features. One of the limitations of the proposed model is the lack of a comprehensive dataset for testing the proposed model. This indicates that the model could reach high results with the selected feature set from the ’CIRA-CIC-DoHBrw-2020’ dataset. However, it could lead to different results on different DoH datasets because certain features can achieve the best results on the CIRA-CIC-DoHBrw-2020 dataset but not necessarily on another DoH dataset. This implies that the value of the features can be influenced by the network environment. In future work, we aim to expand the model to profile the Malicious behavior of Malicious data families and evaluate their effectiveness by creating a comprehensive new dataset, further enhancing the security and performance of the DoH system.