1 Introduction

Privacy and anonymity of web users can be endangered by attackers using sniffing/spying tools to eavesdrop their activities. When Internet users browse the web, the identities of websites they visit are revealed to several routers along the way. External observers such as private and governmental security agencies may passively observe and collect information by monitoring and blocking users’ activities on the Internet. In this context, anonymity systems protect the privacy for people who want to surf the web for their critical needs such as sending sensitive e-mails and shopping online. Common encryption techniques such as TLS are used to encrypt the content of web communication and armor them against eavesdropper attackers. However, attackers can still monitor the identity of the client and server using meta-information extracted from transferred packet headers. Therefore, to hide both the content of transmitted data as well as the client/server identities, advanced anonymous systems such as Tor [1] and Jap [2] are designed. Although Tor is a popular anonymity system, the privacy of its users can be compromised through traffic analysis attacks that can infer the identity of visited websites as this study and related works confirm.

1.1 Privacy issues on internet

Warren, Brandeis [3] define privacy as the “right to be let alone”. Roger Clarke mentioned that “Privacy is the interest that individuals have sustaining in a personal space, free from interference by other people and organizations” [4]. However, the web users’ privacy is satisfied when the usage, interchange and release of their data can be one hundred percent under control [5]. Unfortunately, this is not the case as the data of web users is transmitted on cyberspace without maintaining full control over their transferred data. For example, a local observer on an Internet service provider (ISP) or an attacker on a wireless local area network (WLAN) can analyze and track the data sent/received to/from a victim efficiently, inexpensively and stealthily. The privacy needs that most Internet users concern about include [6]:

  • Personal information such as Passwords and Credit Cards which can be misused against users’ preference and interest.

  • Their E-mail addresses and contact information that can be exported for targeted advertisements and attacks.

  • The identity of visited websites and their locations/origins on Internet space.

The possible entities that may carry out such privacy violations are shown in (Fig. 1).

Fig. 1
figure 1

Percentage of possible eavesdroppers who may violate web users’ privacy [47]

A typical approach to provide privacy is to use defense mechanisms that disguise the transferred data content and web users’ identities. However, despite the fact that the data contents are encrypted, the attackers can eavesdrop the observable characteristics of traffic patterns (e.g., packets sizes, packets directions, packets orders, and packets timings) to reveal confidential information about the communication source and/or destination. In particular, website fingerprinting attacks utilize traffic pattern characteristics to identify the identity of a website that a victim visits.

1.2 Website fingerprinting

Hintz [7] is the first who used the term “Website Fingerprinting” to describe the traffic analysis variant to identify visited websites. Website fingerprinting is a type of traffic analysis attacks that allows an adversary to infer the identity of a visited webpage by a victim and hence violating his privacy even if he uses anonymous system like Tor [8]. An attacker can identify the type of a victim’s visited webpage based on inferred features from its traffic pattern. When a victim visits a certain webpage, the HTML document of that page is fetched with its referenced content (e.g., CSSs, JSs, Images, Text, etc.).

The traffic resulting from fetching the content has specific characteristics (packets size, order, direction and delay). Encrypting protocols (e.g., SSL) encrypt the content of transmitted data but they do not effectively hide features such as (packets sizes, directions, timing, etc.) [9]. Therefore, it is possible for an eavesdropper to monitor/sniff the network traffic of a victim and fingerprint website visits based on the order, the direction, the timing, and in particular, the size of the packets. Since the website fingerprint is unique, a visited website can be uniquely identified even if the connection is encrypted using any security/anonymity protocol such as SSL, SSH, Tor, etc. Website fingerprinting on encrypted traffic can be used by governments to censor and block some blacklisted webpages. This is achieved by generating fingerprints for all blacklisted websites then the traffic is analyzed for matching using the list of fingerprints. Due to the dynamic nature of modern websites, fingerprints are continuously updated by censorship entities to cope with such changes [7].

Existing works in [9,10,11,12,13] showed that the traffic analysis attack is possible against various security and anonymity protocols, such as SSH, IPsec tunnels, JAP, and Tor. Anonymity protocols are typically equipped with features, such as mixing and padding, to defeat website fingerprinting attacks [14]. This paper focuses on Tor anonymity system as it is the most popular anonymous solution in use today with around 500,000 daily users and a bandwidth of 2000 MB per second [15].

1.3 Anonymity protocols

Pfitzmann and Köhntopp [16] defined anonymity as “the state of being not identifiable within a set of subjects, the anonymity set is the all possible players in the system.”. When the traffic analysis attacks are proliferated, several anonymity systems have been developed for the sake of users’ privacy protection on cyberspace. Chaum [17] is the first who proposed a system that provides a level of anonymity by establishing a connection that mixes the traffic of a given connection/session with the traffic of other connections/sessions.

Later, several systems have been proposed which employ a wide range of techniques such as padding and mixing in order to make it very difficult for attackers to trace and analyze the traffic [14]. Anonymity systems can be classified into two classes: high-latency and low-latency systems. High-latency systems (Tor: The Second-Generation Onion Router [8] and Mixmaster protocol [18]) are much better at protecting the possible attacks that are based on packet timings. These systems employ various strategies (mixing, reordering and patching) to defend against traffic analysis attacks that are based on packet timings/delays [19]. High-latency anonymity systems are not widely used due to the extra delays that they add in data transmission.

Low-latency systems, on the other hand, do not use techniques that delay the communication and therefore they are suitable for web browsing protocols such as HTTP and interactive protocols such as SSH. Anonymity systems that fall into this category include Tor (The Onion Routing) [1], Java Anon Proxy (JAP) [20], and Invisible Internet Protocol (I2P) [21]. In this paper, we focus on Tor anonymity system, the most commonly anonymous system for web browsing, to evaluate different resistance levels of top five browsers namely Chrome, Internet Explorer (IE), Firefox (FF), Safari and Opera. Anonymity systems can be utilized by users in legal and illegal activities. For example, they can be misused in misappropriation of funds and terrorist actions so the researches for breaking anonymous systems like ours can be much more helpful for governments to track such kinds of criminal processes. On the other hand, a lot of people employ anonymity systems in e-banking, e-voting, e-commerce, and e-auction, etc.

1.4 Web browsers

Web browser is the client-side application in Internet communication and its main function is to fetch the requested content from web servers and display it on browser’s window. Web content is fetched through requests and responses between web browsers and web servers using mainly HTTP and HTTPS protocols.

Few years back, web pages were very simple HTML pages containing simple content (e.g. text, input boxes, and buttons) [22]. Currently, web pages come with more sophisticated and flavored content including multimedia objects such as JSs, CSSs, flash, audio, etc.). This opened several possibilities of how people use computers to perform various tasks online (e.g. access email, e-shopping, scientific research collaboration, etc.). Garsiel, Irish [23] provided full details about different browsers’ structure and functions. In our study, we consider the top five web browsers namely Chrome, Firefox, Internet Explorer, Safari, and Opera as it is shown in (Fig. 2).

Fig. 2
figure 2

Statistics and trends of top web browsers, StatCounter Global Stats [48]

Figure 2 illustrates the Desktop Browser Market Share Worldwide for three past years (Jan 2014–May 2017). So we conducted our study on the most commonly used web browsers worldwide.

1.5 Paper contribution

Website fingerprinting attack has been studied thoroughly in the literature. However, all existing works, except one, considered only one web browser in their study, mainly Firefox. The main contributions of the paper are as follows:

  • We identify the key differences (rendering engines, fetching schemes, etc.) of most commonly used web browsers which affect the traffic pattern substantially.

  • We study website fingerprinting attack on Tor anonymity system taking into consideration the most relevant features that improve the accuracy of the attack.

  • We evaluate the different resistances of top five web browsers against website fingerprinting attack.

  • We find out the factors that influence various resistance levels of browsers against website fingerprinting attack.

1.6 Paper organization

The rest of the paper is organized as follows. Section 2 gives an overview of related work. Section 3 details the website fingerprinting attack that we implemented as well as the evaluation of experimental results. Section 5 illustrates the main factors behind various resistance levels of tested browsers. Finally, Sect. 6 concludes the paper and suggests future research directions.

2 Related work

The previous works can be categorized into two classes: The first class focuses on typical secured encrypted traffic Hyper Text Transfer Protocol (HTTPS). Liberatore and Levine [24] proposed two techniques for identifying the source identity over HTTPS connections. They used a one feature for their experiments, the packets size, of transmitted data under the cover of Secure Shell (SSH) protocol. The classifiers of Jaccard’s coefficient and Naive Bayes (NB) are used to identify similarities between captured traffic and predefined fingerprints of websites. The results of their experiments show that if IP packets are padded and frequencies of packet lengths are considered, the NB classifier is more robust than Jaccard’s classifier. They identify website fingerprinting on a simple SSH tunnel with an accuracy of about 70%.

The second class focuses on fingerprinting encrypted traffic by anonymity systems. Herrmann et al. [12] identify websites under popular encryption methods using a text mining technique. They used Multinomial Naive Bayes (MNB) classifier for training based on the frequency distributions of IP packet lengths. They optimized their classifier by applying a set of text mining transformations so they achieve a higher accuracy than previous work under comparable conditions. Their experiments show an excellent accuracy of 96% against single-hop encryption systems (e.g. SSL, OpenSSH, etc.), while they obtained lower accuracy on multi-hop systems (e.g. Tor and JAP) with an accuracy of 20% on JAP and 2.96% on Tor. This indicates that website fingerprinting on Tor is more challenging than other encrypting systems.

Shi and Matsuura [25] proposed a novel method for website fingerprinting attack where they divide both the incoming and outgoing packets into several intervals and convert these intervals into vectors. The similarities between observed vectors and predefined fingerprints are calculated by their used formula. The practical and theoretical evaluations of their results show that their method is an effective way for degrading the anonymity of users under Tor.

Panchenko et al. [10] came up with a website fingerprinting attack on Tor and JAP anonymity systems using Support Vector Machine (SVM) classifier. They represented a traffic trace as a sequence of packet lengths where input and output packets are distinguished using negative and positive integers. In addition, they injected some features in these sequences to raise the accuracy of the classification such as size markers (whenever flow direction changes, insert the size of packets in the interval), number markers (number of packets in every interval), total transmitted bytes, etc. They used Weka tool to fine-tune the SVM parameters. They evaluated their method using Closed-world and Open-world scenarios. In closed-world scenario they conducted their experiments on the same data set of Herrmann et al. [12] by estimating the accuracy using a ten-fold cross validation. In open-world scenario, their experimental results show that their approach improves the websites recognition rates from 3 to 55% in JAP and from 20 to 80% in Tor.

Cai et al. [13] implemented string alignment using Damerau-Levenshtein distance algorithm to compare the previously made fingerprints with the observed traffic based on the packets sizes and directions features. They identified web pages with an accuracy of 87.3% in closed-world model. They also classified websites instead of individual web pages using Hidden Markov Models (HMMs).

Wang and Goldberg [26] who improve the accuracy of Website fingerprinting by interpreting Tor data cells as units instead of TCP/IP packet sizes and removing Tor SENDMEs cells that provide no useful data in order to reduce the noise. They compared the similarity between the predefined fingerprint instances and observed traffic instances using new Optimal String Alignment Distance metrics (OSAD) with limited computation resources. The results of their closed-world experiments show that their methods achieve better accuracy rate than previous work with 91%.

Most of the related works used only one browser for their empirical analysis, namely, Firefox (Tor Browser). The only work that studied website fingerprinting attack using several web browsers was by Zhioua [27]. Zhioua’s work introduced 5 measures to distinguish differences between the various web browsers traffic. However, the work failed to provide the reasons behind the observed differences. In this regard, our current work can be seen as complementary to Zhioua’s work in that we provide a detailed study of the root causes behind the various resistance level of each web browser. It is important to mention that the results obtained in our work match perfectly the results in Zhioua’s work.

Our empirical analysis observed the impact of browsers in the encrypted traffic pattern in more detailed and fine-grained inspection, the web browsers’ technologies and third party loaded objects. The influence of these factors has a clear impact on the traffic pattern behind each browser. Thus, a variation of resistance levels to traffic analysis attack is evaluated. The resistant scope of each studied browser is shown in the next section.

3 Implementation and evaluation

In order to investigate the resistance level of browsers to website fingerprinting, we implemented a website fingerprinting attack using the most commonly used web browsers under tools’ specifications that are illustrated in Table 1.

Table 1 Software tools used in our experimental evaluation

3.1 Threat model

Unlike typical attacks on anonymity systems such as [1] and [19] which assume a global active adversary, we consider a much simpler local passive adversary. A global active adversary is capable of taping any network wire he or she requires. In addition, it can inject any number of compromised relays in the network. As one can notice, such adversary is extremely powerful and can only be achieved by an Internet service provider (ISP) [28, 29], or a nation-state global surveillance system [30, 31]. On the other hand, a local passive adversary is only capable of monitoring a particular user’s traffic without modifying or stopping it. One attack example that can be launched by such a modest attack model is time correlation attack [32].

One of the design goals of Tor anonymity system is to only to protect against local passive adversary. Tor is not designed to protect against global adversaries, because this will require significant packet delays, while Tor is a low latency protocol.

Launching a website fingerprinting attack does not require a powerful adversary such as global adversary. All what it takes is a local passive adversary. Hence, in this research, only a local passive adversary is considered as it is shown in (Fig. 3). Such threat model is very common and does not require significant amount of resources. Indeed, an adversary can only observe and record the traffic packets at the first relay used by the client. More importantly, no packet generation, modification, deletion, nor delay is performed which makes the attack very light and stealth.

Fig. 3
figure 3

Local passive adversary model

We set up our website fingerprinting attack model which goes through a set of phases illustrated in (Fig. 4).

Fig. 4
figure 4

An overview of our website fingerprinting attack model over

3.2 Data sets

Most of the existing works (e.g., [10, 13, 25], etc.) in website fingerprinting used the top list of most popular Internet webpages from web statistics service AlexaFootnote 1 (called Alexa Top Ranked). To investigate the effectiveness of our fingerprinting approach in practical environment, we selected the top 20 most visited websites from Alexa Top Ranked which are listed in Table 2.

Table 2 Data Sets of our website fingerprinting

We selected these websites/dataset because of their global popularity as well as they are representative to diverse activities on the Internet such as On-line Shopping (e.g., amazon.com, tmall.com, ebay.com), Search Engines (e.g., google.com, yahoo.com, ask.com), Social Websites (e.g., facebook.com, twitter.com, linkedin.com), etc. Due to their popularity, most of them might be a target to website fingerprinting attacks whereas some of them are currently banned in different countries (e.g., Facebook and Twitter). This qualifies them to be accessed through Tor for anonymous browsing. Therefore, we selected this dataset as they provide enough statistical significance for evaluating web browsers with respect to website fingerprinting as well as their applicability in realistic scenarios.

Our dataset is collected over the most commonly used web browsers. To the best of our knowledge we consider the website fingerprinting over the top five web browsers while most of existing works used a single web browser, typically, Firefox. Table 3 shows our dataset and used browsers compared to relevant existing works.

Table 3 The dataset and browsers of our approach compared to relevant previous works

3.3 Data collection

We have collected traffic traces and conducted several experiments using several automation scripts. In our study we conducted many phases: The first phase is the data collection where the encrypted data, resulting from retrieving websites by top five browsers, is captured. The second phase is the features extraction where the raw data of collected traffic packets are preprocessed by removing noisy data, extracting features, etc. Finally, the polished features/characteristics of encrypted websites traffic are analyzed in data classification (Sect. 3.5). These phases are explained in details in the subsequent sections.

In order to create fingerprints of websites, we first established Tor network connection and configured web browsers to use Tor proxy. Consequently, all HTTP traffics are tunneled through Tor anonymity system. Then, we collected our data-set under a closed-world scenario. In such scenario, the attacker creates fingerprints for a list of websites so when a victim visits certain website from a list of predefined/fingerprinted websites, the attacker matches the fingerprints and observes the identity of visited website.

The automation of website visits is achieved through Python scripts code that simulates a typical user action like typing a Uniform Resource Locator (URL) of a website into the address bar of a browser. It automates each browser to visit the list of 20 websites. Each browser visited each website 15 times in a round-robin fashion. During websites visits we scripted the code to remove the browser cache after each visit and set the maximum allocated time for each visit to 25 s. We repeated this process for each browser of the five studied browsers.

During the web browsing visits, we scripted tshark, the command-line version of WireShark traffic analyzer, to capture the real traffic packets of each visited website to be recorded in trace/log files. We ended up with 1500 log/trace files (20 web pages * 15 visits * 5 browsers). Each log file is labeled with browser name, visited web page, and the number of visit for further analysis.

3.4 Features extraction

During data collection process, we ended up with 1500 dump files corresponding to raw information of network packets resulting from website visits. To pre-process the collected raw data for classification phase, we conducted multiple feature extraction passes to extract the webpage features (packets size, direction and order) which is needed for generating webpage fingerprints. The feature extraction processes are clarified in (Fig. 5).

Fig. 5
figure 5

Features extraction processes to generate webpages fingerprints fro classification

In feature extraction phase, we conducted several processes: Firstly, we removed the Acknowledgment (ACK) packets of Transmission Control Protocol (TCP) connections as they do not add useful information in the classification phase. Moreover, it is recommended by Cai et al. [13], that utilizing TCP ACK packets in the classification reduces the performance of the classifier. Furthermore, inserting TCP ACK to acknowledge receipt of every packet, make the webpages traces more similar to each other. Hence, we deleted all TCP ACK packets with sizes of (40 and 52) bytes from our webpages’ traces. Secondly, based on the IP of Tor relays, we filtered only Tor packets from other non-Tor packets, which might be captured during data collection process. Finally, we conducted a data mining process by building a scripting program to extract the webpages’ traffic features (packets’ size, order and direction) which represent a “Webpage Fingerprint” for each visited webpage. Therefore, the extracted features/fingerprints of webpages’ traffic patterns are represented as integers in the observed direction. These integers are recorded as positive integer (e.g. 565) for outgoing packet, and negative integer (e.g. -1360) for incoming packet. This form (e.g. 565, − 1360, − 88, − 1271, 565, − 1360, − 264, etc.) of websites fingerprints is passed to classification process.

3.5 Data classification

There are a variety of packets’ features (e.g., packets’ size, order, timing, direction, etc.), which are used to decide whether or not two webpage fingerprints belong to the same webpage. In our approach, we used the packets’ features (packets’ size, direction and order). So, after collecting the webpages’ dump files and extracting features, we ended up with webpages fingerprints. The resulted fingerprints are represented as a sequence of positive and negative integers like (e.g., 565, − 1360, − 88, − 1271, 565, − 1360, − 264, etc.) as shown in Fig. 5.

This forms of websites fingerprints are classified which represent the websites’ transmitted contents of browsers over Tor anonymous system. The packets’ sizes feature represented as integers in bytes (e.g., 565 bytes). The signs (+ and −) in front of the integers represent the direction of packets. Therefore, the positive integer (e.g., 565) represents the outgoing packets (from Tor client to web server), while the negative integer (e.g.,− 1360) represents the incoming packets (from website server to Tor client). The order of transmitted packets is used for classification as well. The sequence of incoming/received and outgoing/sent packets are stored in the observed order. However, packets ordering features between the two visits for the same webpage, is useful for identifying webpages fingerprints, for example, when a browser request webpage objects, there is always some similarity of ordering for the incoming and outgoing packets of the same webpage’s objects.

However, the representation of webpage features is passed into classification. The prepared websites’ fingerprints are classified using Cai et al. [13] method. In this method, the similarities between websites fingerprints are calculated using Damerau-Levenshtein Distance (D-LD) algorithm [33]. The D-LD algorithm estimates the similarity between websites fingerprints by calculating the number of operations (insertion, deletion, substitution, and transposition) that are required to transfer trace (t) into trace (t′) as in the example shown in (Fig. 6).

Fig. 6
figure 6

Calculating the similarity between webpages fingerprints

The minimum number of operations to transfer fingerprint (t) into fingerprint (t’) is the more similar they are to each other so they are considered as two visits for the same website. This distance metric is a good match for real network traffic as it emulates the packet dropping and retransmission in real network behavior by its deletion, insertion, substitution and transposition methods.

In our empirical analysis we noticed large variations between website traces even for the same website visits because of the dynamical characteristic of website retrieval. Therefore, the edit distances between websites traces are normalized to compensate for those variations using the following formula.

$${\mathbf{L}}\left( {t,t^{\prime}} \right) = \frac{{d\left( {t,t^{\prime}} \right)}}{{{\text{min }}\left( {\left| t \right|,|t^{\prime}|} \right)}}$$

where d(t, t′) is the distance/similarity between trace t and trace t′. The |t| normalization factor represent the number of packets in trace/fingerprint t while |t′| represent the number of packets in trace/fingerprint t′. The classifier normalizes the similarity by the minimum number of packets of two traces.

The similarity distances between websites fingerprints are classified using Support Vector Machine (SVM) classifier. The result of the classification’ accuracy of websites fingerprints is given in (Fig. 7) which reflects different resistance levels of browsers to website fingerprinting attack.

Fig. 7
figure 7

The classification’s accuracy of website fingerprinting attack over Tor using different web browsers

The higher the classification accuracy of website fingerprints over a browser is, the lower the privacy protection is provided by associated browser. The highest recognition rate of websites fingerprints is achieved over IE of 74% which indicates that it has the lowest protection against website fingerprinting attack. On the other hand, the lowest recognition rate of websites fingerprints is obtained by Opera of 41.6% followed by Safari of 53.8% which indicate that the surveillance of their traffic patterns is more challenging by traffic analysis attackers. The recognition results of Firefox and Chrome are 70.4 and 69.6% respectively. We detected the root causes behind various resistance levels of browsers against website fingerprinting attack in the subsequent sections.

4 Discussion

No other research in the literature evaluates the website fingerprinting on Tor anonymous system across browsers and identifies the root causes of traffic analysis efficiency. Our study investigates the accuracy of website fingerprinting under the most commonly used web browsers namely Chrome, Firefox, Internet Explorer, Safari and Opera. In this section we discuss important factors that might affect the accuracy results of website fingerprinting through different browsers.

4.1 Dynamic webpages

Website fingerprinting attack goes through a training phase where several samples of each target website is collected and used for the training. Webpages consist of static and dynamic objects, so some webpages like Yahoo have small ads that tend to change frequently. However, compared the major part of the page contents (e.g. webpage template/structure), the ratio of dynamic/ads content is negligible. Therefore, that part of the page is treated as noise in the website fingerprinting attack. Other dynamic webpages (e.g. Amazon, YouTube, etc.) have their content updated from time to time which might affect the consistency of their fingerprints. In dynamic webpages, if the training samples are not fresh (collected recently), the testing and evaluation phase will produce increasingly low results because of dynamic webpages whose pattern changes frequently. Hence to deal with dynamic webpages, webpage samples used for training should be refreshed more frequently. This is a known assumption which allows website fingerprinting attacks to keep good accuracy in presence of dynamic webpages.

Our choice in website fingerprinting study was to reflect the live conditions of an attacker, where he cannot account for such differences in webpage content, as the aim of the research is to study the website fingerprinting across browsers. However, the exact impact of dynamic webpages on the accuracy of website fingerprinting remains unanswered, so it can be a potential direction for traffic analysis research in the future.

4.2 Similarity of webpages

According to Shi and Matsuura [25] a webpage consists on average of about 20–30 objects (e.g., images, CSSs, flashes, etc.) where each object has its own unique size. As a result, each webpage’s content has its own features (e.g., packets size, direction, etc.) which means that the proportion of distinguishable webpages is large. However, in dynamic webpages and network conditions, there is still a probability of similarity matching for two instances from different webpages. So, to avoid the case of similarity for two webpage instances collected from different webpages, we implemented the Damerau-Levenshtein Distance (D-LD) classification algorithm used by Cai et al. [13]. This is because D-LD algorithm is based on SVM and relies on several features extracted from the data. So even in presence of apparently “similar” websites traffic, the classifier will identify subtle differences that will allow to distinguish them. Cai et al. proved that D-LD algorithm outperforms other previously used distance algorithms in website fingerprinting with high accuracy results (more than 80%).

5 Root causes behind different resistance levels of browsers

The differences between web browsers span a wide range of features from visual/look level to traffic/trace level. These variations are caused by various functionalities that are supported by different browsers. In this study, we will focus on browser-dependent features that have an impact on the shape of network traffic pattern such as Java Scripts, third-party domains’ ads, performance optimization of browsers, parallel download, etc. However, we turned off Tor so that we can reveal the impact of these factors on traffic pattern clearly. The next subsections present the results of several fine-grained tests on browsers-dependent features that identify their impact on the shape network traffic pattern. The experiments are conducted based on the stable versions of browsers, no add-ons/extensions, and without any external programs that may affect the behavior of tested browsers.

5.1 Browsers’ web technology features

To identify the main causes behind different resistance levels of browsers to traffic analysis attacks, we carried out several tests. These tests evaluate browsers’ support to various web technology features and the reflection of these features to browsers’ traffic patterns when retrieving various web contents.

We started by testing browsers’ support to various web technologies utilizing a set of tools ([34,35,36]). Table 4 shows the results of tested web technologies that are supported/not supported by the tested browsers.

Table 4 Web technologies that affect the shape of web browsers’ traffic patterns

The Asynchronous Script Execution (ASE) web technology Feature [37] allows the JavaScript to be loaded and executed asynchronously with other objects during a webpage loading. We noticed that Opera browser doesn’t support ASE feature so all other web objects (e.g. image, text, CSS, etc.) are blocked when Opera retrieves and executes JavaScript files. Therefore, this behavior has a great impact on the order of retrieved objects and the performance of page-loading as it is clear in the result of (Sect. 5.5).

The “Navigation Timing API” feature [38] provides a browser with accurate measurements of TCP connections’ establishment and retrieval’s spent time of webpages’ contents. The experiments results show that Opera and Safari do not support this timing feature so this affect their traffic patterns as illustrated in (Sect. 5.5).

The test results of browsers show that ActiveX feature [39] is supported by Microsoft in its IE browser. This web feature hosts ActiveX controls within webpages contents. So this allows such webpages to automatically download scripts, execute small applications, and embed animations in webpages such as banner ads. The results show that this behavior makes IE browser’s traffic pattern the most distinguishable compared to others as a result the highest recognition rate of website fingerprints is achieved over IE browser.

During our testing, we also noticed that Opera browser is supported by Native Flash blocking feature that blocks some flash contents which results in more regular traffic patterns. The “Cached compiled programs” feature which is also supported by Opera allows caching JavaScript libraries in its internal cache to be used again in JavaScript contents retrieval without reloading. All the features above affect the browser-side footprint heavily when retrieving rich-contents of websites.

5.2 The impact of browsers-dependent features on their traffic patterns

When a browser sends an HTML request, the corresponding server handles the request and delivers an HTML document to the browser. Then, the rendering engine of the browser parses the HTML document so the embedded objects within HTML code (e.g., images, JSs, flashes, etc.) are fetched from their referenced servers. Each web page has its own fingerprint in terms of the number of various web objects. Each browser retrieves website contents differently based on its support to different web technology features so each browser has its own website fingerprint. The characterization of browser features essentially involves the characterization of various website objects (e.g. application/JavaScript, application/x-shockwave-flash, etc.) retrieved by a corresponding browser. Therefore, we have matched each browser feature mentioned above with its relevant behavior in websites retrieval to see its impact on browser’s traffic pattern. We conducted a deep traffic analysis and aggregated analytics on real browsing data. From the analyzed data we aggregated web browsing traffic based on a number of metrics presented in (Table 5). These metrics are evaluated in the next sections.

Table 5 Traffic metrics generated by browsers

5.3 The impact of retrieved Java Scripts on browsers’ traffic patterns

JavaScript is enabled by default on all major browsers. It was reported by Alexa that 98 out of 100 popular websites use Java Scripts [40]. Furthermore, Michael et al. proved that JavaScript contents occupy a fraction of 25% of all downloaded web contents [41]. Each browser has its own JavaScript interpreter so it behaves differently in executing JavaScript files based on its support to the above mentioned web technology features. JavaScript code is loaded and executed on browser side so there are APIs that allow the scripts to communicate with remote servers [22]. As a result, loading and executing JavaScript contents have a significant impact on the shape of browsers’ traffic patterns. The empirical analysis show that different browsers’ JavaScript engines load and execute various amounts of JavaScript data as it is shown in (Fig. 8).

Fig. 8
figure 8

Amount of JS data flows which are retrieved by different web browsers

The results in (Fig. 8) show that IE has the largest amount of JS data flow followed by FF and Chrome. On the other hand, the lowest amount of JS data is fetched by Opera followed by Safari.

IE has the largest amount of JS data flow because Microsoft provides a distinctive feature called (ActiveX) web technology feature. This IE browser’s feature allows certain webpages to automatically execute small applications and download scripts/animations in order to enhance user browsing experience as evaluated in the above (Sect. 5.1), web browsers’ support for web technology features. Moreover, IE allows all flash animation contents that are created by JavaScript Flash language (JSFL) to be loaded as it is integrated with Adobe Flash by default. Therefore, the impact of these features is reflected to IE’s traffic pattern which makes it more distinguishable for website fingerprinting. As a result, IE has the lowest resistance to traffic analysis attacks as confirmed in the results of (Fig. 7).

FF and Chrome browsers are built with a security mechanism called “Safe Browsing” [42]. This mechanism blocks all suspected JavaScript’s files in order to provide more phishing and malware protection. Thus, our results (Fig. 8) show that FF and Chrome retrieved similar amount of JS data. Safari retrieves relatively a small amount of JS data because Apple, its designer, maintains an updated blacklist for malicious JSs and Flashes. So Safari browser blocks versions of JS and Flashes provided by certain websites. The lowest amount of JS data is retrieved by Opera. The reason is that the Opera’s JS engine, Carakan [43], implements an internal caching for compiled JS programs as it is shown in (Table 4) results. This feature is quite effective in reducing the amount of retrieved JS data because the same JS libraries can be reused internally without reloading again. Furthermore, the results in (Table 4) shows that Opera has also a Flash blocking feature so these Opera’s supported features are reflected to its network traffic’s shape which make its traffic pattern with the lowest recognition rate for website fingerprinting attack as shown in (Fig. 7) results.

5.4 The impact of third-party domains’ web contents on browsers’ traffic pattern

Modern websites comprise a mix of various web contents and services distributed on multiple servers/domains. Indeed, webpages’ contents are fetched from third-party servers located in several domains. For example, a visited webpage can host various web services from: digital advertising publishers (e.g. doubleclickbygoogle.com), analytical services for tracking web users’ activities (e.g., quantcast.com, analytics.google.com), and retrieved objects from Content Distribution Networks (e.g., limelight.com, akamai.com). So the amount of network traffic (advertising, analytics and distributing services) that is fetched from several third-party servers comprises a significant fraction of total websites’ traffic. Therefore, in our study we conduct a deep analysis to figure out the impact of third-party’s network traffic to the shape of observed traffic patterns. To assess this factor, we tracked the number of third-party domains (Fig. 9) communicated by different web browsers.

Fig. 9
figure 9

Number of third-party servers retrieved by web browsers

The results show that IE browser communicated with the highest number of third-party servers. This is because Adobe Flash is integrated by default with IE so all flash Ads are loaded from all contacted third-party servers as it is shown in (Fig. 9). Chrome browser filters pop-up ads for security precaution and FF browser has a sandbox security model [44] to restrict accessing data from third-party domains as it is shown in (Fig. 9). Safari browser retrieves data from the least number of third-party servers as it blocks all third-party servers that deal with cookies. Hence, third-party websites that require cookie to be enabled will be restricted by Safari browser. We noticed that Opera browser blocks all data coming from all third-party servers as shown in (Fig. 9) as it has more security restrictions. Putting (Figs. 7, 9) side by side, one can notice that, the higher the number of third-party domains contacted by a web browser, the lower resistance level of that browser to website fingerprinting attack.

5.5 The impact of simultaneous objects retrieval on browsers’ traffic pattern

When a browser requests the URL visible in its address bar, the fetched HTML document is retrieved with its embedded resources (e.g. images, scripts, style-sheets, flashes, etc.). However, requesting each element individually by establishing separated HTTP requests causes the retrieval process to be slow. To reduce this performance issue, web browsers retrieve several objects simultaneously by opening several connections to load website contents in parallel mode. The behavior of browsers in parallel downloads and its loading time management is a major source of variation in their traffic patterns. In this section we evaluate the differences of web browsers’ techniques in their parallel retrieval of websites’ contents.

To this end, we analyze web browsers’ traffic patterns during websites’ visits. We notice that web browsers differ in their strategies of how they parallelize the retrieval of website contents and how they optimize the loading time. The results (Fig. 10) show that Opera’s parallel download mechanism deviates from other browsers because it doesn’t support Asynchronous Script Execution feature explained in (Table 4). Therefore, when Opera retrieves website’s scripts, it blocks all other downloads until the script is loaded, parsed and executed as notices in (Fig. 10).

Fig. 10
figure 10

Impact of JavaScript’s blocking on Opera’s traffic Pattern

The waterfall chart in (Fig. 10) shows the staircase pattern of Opera’s retrieval behavior where there are some JavaScript’s intervals which block Opera from fetching website’s contents in parallel. These intervals result in a large inconsistency in the sequence order of retrieved website’s contents which also create large variations in website fingerprints/traces. So this behavior of Opera browser interprets why Opera has the least accuracy in our website fingerprinting results presented in (Fig. 7).

Web browsers implement different techniques to optimize retrieval time of websites’ contents. The recent implementation of the W3C Navigation Timing API specification that, is supported by Chrome, FF and IE browsers, improves the performance optimization of their loading time [45]. Hence, the web browsers those, support Navigation Timing API web technology, have a major retrieval time advantage which is reflected to the shape of their traffic pattern. Indeed, this web technology feature provides web browsers with fine-grained measurements about real browsing timings such as (e.g., TCP connection, timing information related to loaded elements, etc.). Dutton [46] provided more details about this feature. Our results in (Fig. 11) show that the Average and Standard Deviation (SD) of retrieval time for Amazon’s website visited by different web browsers.

Fig. 11
figure 11

The variations in retrieval time with different web browsers

The results in (Fig. 11) show that Opera and Safari have the largest SD of timings during websites visits which means that the largest variations in their websites’ fingerprints/traces as they do not support Navigation Timing web technology feature. This explains why we got the least accuracy in our website fingerprinting results in (Fig. 7) when Opera and Safari browsers are utilized. IE and FF have approximately the same as they both support Asynchronous Script Execution and Navigation Timing API feature mentioned above. IE has the least SD which indicates a high stability in its retrieval website’s fingerprints which makes it very vulnerable to traffic analysis attacks in particular website fingerprinting.

Although Chrome shares the supporting of Asynchronous Script Execution and Navigation Timing web technology feature as FF and IE, the SD of Chrome’s retrieval timings is more than of FF and IE. The reason is that we noticed in some of websites’ traces/fingerprints, Chrome browser downloads some data (e.g. the most recent Safe Browsing list, apps and themes) from Google’s servers which are maintained periodically. These data are loaded randomly during normal websites’ visits in a form of “application = x-chrome-extension” content-type as observed in (Fig. 12) left part of the figure. It shows the fetching behavior of Chrome which reveals its random updates from Google’s servers which affect its websites’ fingerprints/traces.

Fig. 12
figure 12

Google’s servers related traffic in Chrome websites’ visits

6 Conclusion

Website fingerprinting attack is the recent variant of traffic analysis attack that aims to reveal the identity of visited websites. This paper studies the resistance of top five browsers to website fingerprinting attack by detecting the substantial factors behind their diverse resistances. We implement real website fingerprinting attack utilizing popular web browsers over Tor anonymity system. We provide a detailed illustration of how each stage of our attack model works. Furthermore, we carried out a set of experiments and analysis to identify the main factors that affect the accuracy of website fingerprinting attack under each browser. Finally, we conducted a thorough comparison of web technologies and techniques showing how each browser behave in term of network traffic pattern and the reflection of the change in their accuracy of website fingerprinting.

As a future work, we recommend establishing an extensive website fingerprinting attack model to further investigate the resistance of web browsers on more fine-grained mode/websites classes (e.g. Business websites, News websites, Social networking sites, Gallery Websites, Gaming websites, Search engine sites, etc.) to evaluate to which range each browser protects against websites fingerprinting attack. In addition, this framework attack can be extended to investigate traffic analysis attack on more restricted mobile browsers (e.g., Android, Dolphin, iOS, etc.) since a lot of people today use their smart devices like mobile phones and tablets to browse the Internet in their daily life.