1 Introduction

In the past, Internet traffic relied on client–server paradigm where client used to request the data and the server provided it leading to network traffic which was asymmetric. With the evolution of Internet and the so called Web 2.0, Internet hosts got the privilege to provide their own multimedia content which could be shared with other peers on Internet. Further, Peer-to-Peer (P2P) traffic started evolving towards the end of the 20th century which incorporated direct distribution of contents between peers on Internet. In such a scenario, peers started acting both as client and server simultaneously; thus downloading the contents which they required from other peers and distributing their contents to other peers on Internet. Due to this, network traffic has become symmetric. From the network management point of view, P2P traffic needs to be identified as it involves traffic flowing in both directions at the same time, thus consuming more bandwidth. In this system, peers share the distribution cost of the service instead of relying on a dedicated server for it. This is actually advantageous for the service providers for distributing the contents, but only at the cost of producing more traffic in the network. In order to search contents with the remote peers, there is increase in number of communications between the peers which has resulted in large number of connections as compared to client–server system where only few connections were formed. Thus, P2P systems produce large amount of traffic as opposed to client–server systems. This poses an issue where network traffic needs to be monitored and controlled so that P2P traffic alone doesn’t consume large portion of the available bandwidth. Hence, a balance needs to be maintained so that other kinds of traffic such as HTTP, FTP, SMTP, etc. also get their fair share of bandwidth. It ensures that Internet Service Provider (ISP) is able to provide Quality of Service for each application by implementing specific policies. Further, conventional devices are unable to control P2P traffic effectively due to which ISPs are facing several other challenges like paying for added traffic requirement, satisfying customers with excellent broadband experience, purchasing costly backbone links and upstream bandwidth.

Internet traffic has been growing rapidly over the past few years [1]. This is attributed to the fact that P2P traffic has grown at such a pace that various types of applications have been emerging over time. Various application protocols such as HTTP, SMTP, etc. no longer dominate Internet traffic which has instead been taken over by P2P traffic to a large extent [2]. P2P file sharing has been a significant trend in recent years. The major content which is shared or distributed through P2P applications are audio, video and games which tend to be large in size [3]. This also includes illegal file sharing. P2P applications nowadays account for more than 60 % of total network traffic [2, 46] which consumes major portion of network bandwidth. Azzouna and Guillemin [7] in their study identified that 49 % of traffic was due to P2P applications in Asymmetric Digital Subscriber Line (ADSL) link. A worldwide study conducted by ipoque [8] (in 2007) about Internet traffic showed that P2P file-sharing applications produce more traffic as compared to all other applications taken together. Therefore, identifying the application that produces traffic becomes crucial in order to accomplish the tasks such as implementing billing mechanisms, maintaining Quality of Service for applications, implementing security measures, etc. Now it is a very difficult task as there are umpteen issues associated with it.

Traditional method used to accomplish the task of network traffic classification includes associating port-numbers of transport-layer to the well-known application protocols. But this technique of identifying applications soon became ineffective as various applications started using random port numbers for data transfer. Also, some other applications used masquerading techniques by utilizing well-known port numbers (such as port- number 80 utilized by HTTP) hide their traffic. Karagiannis et al. [9] identified that many P2P applications utilize port number 80 to transfer their data and also found that 30 to 70 % of the traffic generated by P2P applications utilized random port-numbers. Madhukar and Williamson [10] in their study showed that Internet traffic could not be identified correctly by using port-based methods. Due to these issues, another technique based on payload inspection was adopted. Although this technique proved to be of great accuracy, but it also possessed various limitations such as the requirement of large amount of computational resources, privacy issues involved and the inability of this technique to work when payload is encrypted. Hence, another alternative to identify traffic was adopted based on statistical or behavioural methods such as packet length, number of packets sent, number of packets received, etc. which do not possess limitations posed by port-based or payload-based techniques.

The main goal of this survey is to provide comprehensive overview of various traditional techniques as well as the existing ones for classifying P2P traffic. Although there is some research work done regarding survey on internet traffic classification [11, 12], yet this survey explicitly focuses on identifying P2P traffic which is one of the major contributors of internet traffic. It explains about the working of various techniques along with advantages and limitations of each. The remainder of this survey is sectioned as follows. Section 2 describes some related work in traffic classification. In order to have better understanding of traffic identification, Section 3 addresses some important concepts and techniques from the viewpoint of traffic monitoring. Verification about ground truth of traffic is mentioned in Section 4. Section 5 covers various metrics that can be used to evaluate the performance of various techniques. Section 6 covers various P2P classification techniques with published literature which is followed by Conclusion section.

2 Related work

The topic of network traffic identification and hence classification has gained more interest recently in scientific contributions due to various factors associated with it, such as providing network security, quality of service for applications, billing information, among others. As new applications and protocols keep on emerging over time, various studies propose novel techniques to address the challenges posed by them in their identification process.

For identification of P2P traffic, Madhukar and Williamson [10] compared three distinct techniques in terms of efficiency, namely: port-based, payload-based and transport-layer heuristics. In order to provide longitudinal performance study of each technique, they used the sample-data of traffic traces collected over duration of 2-years to evaluate each method. Li et al. [13] compared four different methods of classification in terms of effectiveness and efficiency, namely: port-based, payload-based, C4.5 decision tree and Naive Bayes. The authors collected the traffic traces over duration of several years at two different locations for evaluating the performance on the basis of spatial and temporal perspectives. Nguyen and Armitage [11] provided a survey on traffic classification based on Machine Learning techniques that focused on application-level protocols for identification. The authors also described the issues posed by recent Internet applications in classification process and reasons for developing newer techniques for classification of Internet traffic by highlighting the limitations of older classification techniques. Callado et al. [12] gave introduction about traffic analysis and described the state-of-art of flow-based traffic analysis using several flow properties of Internet. They also provided the explanation about various research works conducted using distinct traffic classification techniques and theoretically compared the results obtained by them.

This survey focuses mainly on P2P traffic classification and various challenges associated in identifying it. Firstly introduction about traffic measurement from the view-point of traffic classification is given in order to provide better understanding of this topic. Furthermore, various approaches have been compared, analysed and overview regarding various techniques, studies and approaches have been presented for identification of P2P traffic.

3 Network traffic measurement

From the past few decades, various authors have highlighted the role of Internet/network traffic measurement which is crucial to understand the behaviour of computer networks [1416]. It is not an easy task as it involves many issues and challenges. Paxson in [16] mentioned some of them while performing this task. He also mentioned some approaches for conducting sound Internet measurements. McGregor in [15] also describes several technical challenges in order to conduct quality measurements. The next subsection discusses some important concepts and techniques which should be considered while conducting traffic measurement.

3.1 Measurement of internet traffic

Williamson in [14] categorised the research tools for the purpose of network study as: Online & Offline, LAN & WAN, Hardware & Software, Protocol level, and Active & Passive. The significance of each category depends upon the research purpose. Their brief description for the purpose of traffic classification is given below:

Online and offline

Online approach involves analysing traffic while it is currently flowing through the network. Such process requires high computational power and resources in high speed networks but is greatly useful in applications such as in NIDSs and firewalls when instant decisions or actions are required to be made for the packets currently flowing in the network. Whereas Offline approach involves network traces to be collected as an offline file for conducting analysis at a later time when the packets have already crossed the network. This approach is mostly preferred when real-time analysis is not required and it is also useful for the purpose of research and validation, as one can run several approaches on same set of traces which can be compared for results.

LAN and WAN

Measurements conducted for traffic classification purpose is preferably done on LAN instead of WAN, since the former involves no loss of information whereas latter one is difficult to get access to.

Hardware and software

Dedicated hardware tends to give better solutions in terms of performance which are useful in real-time analysis. For the purpose of traffic measurement, monitoring or capturing, some companies like Endace [17], ipoque [18], Wildpackets [19] and Napatech [20] provide hardware-based solutions. As researchers involved in traffic classification are mostly interested in analyzing IP packets or Ethernet frames in network, hence it is of less significance whether analysis is done using hardware-based or software-based solution.

Protocol level

Traffic measurement can be performed at different protocol levels or even multiple protocol levels; but for the purpose of traffic classification, mostly Internet traffic is measured at IP level or Ethernet level by the researchers.

Active and passive

Active approach involves injecting actual packets into the network to analyse the behaviour of the traffic. It allows one to control the simulation scenario such as type of traffic flowing in network, its frequency, etc. But its limitation is that it puts extra load on the network bandwidth and can affect the performance of routers or switches. Also, this approach does not truly reflect the actual behaviour of the traffic flowing in the network which may affect the results. On the other hand, Passive approach doesn’t need to inject any packets into the network and captures and analyses the actual traffic flowing through the network. Hence, it doesn’t affect the performance of bandwidth or any network equipment and measurements made using this approach reflects the actual behaviour or properties of real traffic. But its limitation is that, it produces large amount of data which needs to be processed and analysed in order to obtain useful information.

3.2 Measurements on basis of Per-Flow and Per-Packet

For traffic identification or classification purpose, the researchers mostly focus on IP packets or Ethernet frames. In Per-packet approach, each individual packet travelling in the network is captured for the purpose of analysing the traffic. It can be useful in certain scenarios such as Network Intrusion Detection Systems (using tools like, Snort [21], Bro [22]) where some decisions need to be made on each packet travelling through the network. Also, these packets can be captured and stored for offline analysis by using tools such as Wireshark [23] and Ettercap [24] which have the capability to inspect each individual packet and mine the useful information from all layers of protocol stack.

Although, packets flowing through the network are individual data units, but there exist certain relationships between them such as packets generated by same request or response, packets belonging to same application containing data, etc. and hence such hidden information can be mined for by using Per-Flow analysis. A flow is mostly defined as set of packets sharing common characteristics: Source-IP, Destination-IP, Source-Port, Destination-Port and Protocol [2527]. It is considered as active-flow when time-interval between packets belonging to a particular flow is below certain threshold value, which depends upon the purpose of analysis or study. Claffy et al. [28] identified that threshold value of 64 s is good compromise considering the size of flow and initializing & terminating flows. Also, a flow can be defined as unidirectional if no differentiation is made between packets travelling in each direction and hence considered as single flow [28, 29]; or it can be defined as bidirectional if one considers packets flowing in each direction separately as two independent flows [28]. Unidirectional flows are useful in studies such as measuring network performance and bandwidth management where there is a need to measure differences in traffic in both directions. On the other hand, bidirectional flows are considered useful in scenarios such as analysing TCP sessions and for traffic classification purpose, this approach is more appropriate where traffic flowing between two sides belong to same class and generated by same application. For performing flow-based analysis, there are some tools available such as Coral-Reef [30] which can perform traffic analysis from network adaptors or from offline packet-traces. Tools such as Cisco Netflow [31] and Internet Protocol Flow Information eXport (IPFIX) [32] can receive the flow information directly from the router and other network elements.

3.3 Traffic data collection and trace reduction

Traffic data collection in network should be done with care in order to protect users’ privacy and other data containing sensitive information. Some of the good practices and consideration have been mentioned in [33]. In Passive approach, traffic can be captured by polling of routers to obtain flows data using protocols like IPFIX or the trace files can be made by packet capturing with the help of softwares like tcpdump [34], WinDump (Windows version) [35], or other available tools which are based on libpcap [34] or WinPcap [35] libraries. But, using such techniques results in generation of large trace files, which require more processing power and storage space in case of high speed networks. To handle this issue, trace reduction can be done which reduces amount of data collected by applying packet filtering techniques. One may focus on exclusively capturing traffic belonging to a particular application which can be done using transport-layer port numbers. Alternatively, depending upon the technique used to classify traffic, one may only capture packets that request or establish a connection; or requires only first few packets of a flow for analysis. Trace files can also be reduced: i) by storing the summary of protocol-specific request of each application; ii) by capturing limited amount of packets instead of complete flow packets; iii) by storing only the header information of TCP/IP protocol stack; or iv) by storing just the flow information instead of storing each packet information. Further, packet filtering can also be done using various packet sampling methods where packets are randomly (or pseudo-randomly) chosen for analysis purpose and should be chosen in such a way that they represent the traffic to great extent which one wants to measure. Distinction of each sampling method depends upon study purpose, state of network, traffic characteristics, resource constraints, etc. Jurga and Hulb’oj in [36] and Duffield in [26] elaborated on the subject of packet sampling on traffic measurement.

4 Verification of ground truth of traffic

In early days, traffic identification was an easy task which involved port-based identification by mapping transport layer port numbers with the applications or signature-based identification by matching payload signatures with application protocols. But, as various Internet applications, especially P2P applications evolved, the traditional approaches for traffic identification started becoming ineffective, as applications based on P2P architecture used random or well-known port numbers to hide their traffic. Hence, in order to address various issues involved in traffic identification, several new techniques based on statistical or behavioural methods have been developed and adopted over the time.

In order to test new technique for traffic classification, it is essential to assess the ground truth application information of pre-collected traffic; otherwise it has very limited value [37]. Due to privacy concerns, the packet traces which are available publicly only contain header information which makes it difficult to verify the ground truth regarding the applications. But, this issue can be addressed if the packet traces are labelled for ground truth verification before making the headers publicly available. Another method which can be adopted is to verify ground truth information of the traces manually [38], but it is very slow and only feasible for smaller datasets. One may also assess the ground truth by using port number matching or payload inspection technique [39], but they are have their own limitations since port-based matching is inconsistent as many application use random port numbers, whereas DPI technique is ineffective if traffic is encrypted. Hence, by using such approaches to verify the ground truth of the traffic would produce inconsistent results while testing newer techniques. Due to such issues, researchers mostly collect their own traffic traces to verify the ground truth of the applications and test the accuracy of their techniques; but such approach gives inconsistent results while comparing various methodologies as their performance is evaluated under different conditions [40]. It is also possible to collect traffic traces from small computer networks which run pre-defined applications in controlled environment but such approach also may not contain properties that reflect human behaviour. Some of the studies also tried to address the ground truth verification subject. Canini et al. [41] presented a framework called GTVS for improving and simplifying the process of ground truth verification of application traffic which makes use of DPI mechanism and multiple heuristic rules. Gringoli et al. [42] proposed a toolset called GT which includes the existence of deamon that is run on each client to return the process information which initiated network connection. A similar client-based approach is also proposed by Szabó et al. in [43].

None of the techniques proposed by various authors is perfect and have their own merits and demerits. Hence, the performance of new classification technique will depend upon the accuracy of the reference classification model which may lose its effectiveness if there arise any change in communication pattern of the applications. Therefore, a proper method should be chosen in order to assess the ground truth by looking the capabilities and limitations of each, as this is one of the factors on which quality of evaluation results depend.

5 Evaluation metrics for performance analysis

All network traffic classification techniques make use of some metrics in order to evaluate the classification results by comparing them with ground truth information of traces. Each individual case falls in one of the following categories:

  1. a)

    True Positive (TP): It specifies that a case is correctly classified as belonging to a certain class.

  2. b)

    True Negative (TN): It specifies that a case is correctly classified as not belonging to a certain class.

  3. c)

    False Positive (FP): It specifies that a case is incorrectly classified as belonging to a certain class.

  4. d)

    False Negative (FN): It specifies that a case is incorrectly classified as not belonging to a certain class.

A good classifier will minimize FP and FN. In terms of TPs, TNs, FPs and FNs, various metrics can be made for evaluating the performance of classifiers [44, 45], some of which may be equivalent, but most of them measure different classification aspects. Therefore, it is essential to know what is measured by a certain metric. The most commonly used metrics for traffic classification are defined as follows:

$$ \mathrm{Accuracy}=\left(\mathrm{T}\mathrm{P}+\mathrm{T}\mathrm{N}\right)/\left(\mathrm{T}\mathrm{P}+\mathrm{T}\mathrm{N}+\mathrm{F}\mathrm{P}+\mathrm{F}\mathrm{N}\right) $$

Accuracy measures the capability of classifier to identify positive and negative cases. It measures the overall effectiveness of classification model and hence reflects its predictive power. But, relying only on accuracy to evaluate the classifier is insufficient if imbalanced datasets are used which have large number of positive or negative cases; in which case the importance is given to the more popular class. Therefore, it is desirable to use some more metrics which can evaluate other aspects also. The most popular are: Recall and Precision, which are used together for evaluating classifiers [11] and are defined as follows:

$$ \mathrm{Recall}=\mathrm{T}\mathrm{P}/\left(\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{N}\right) $$
$$ \mathrm{Precision}=\mathrm{T}\mathrm{P}/\left(\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{P}\right) $$

Recall measures the percentage of overall positive cases present in the dataset that are correctly identified by the classifier. It is also referred to as hit-rate or true positive rate. Precision measures the percentage regarding correctness of the positive cases that are identified by the classifier. It is also referred to as positive-predictive value. Both the precision and recall evaluates the ability to correctly identify positive cases by the classifier; but they also have a limitation. Both cases do not give information about the amount of negative cases correctly classified by the classifier. Therefore, if required, then one can make use of another metric called Specificity [46] which can be used together with Recall for evaluation of positive and negative cases separately (in that case, Recall is usually called Sensitivity [47]) and is defined as follows:

$$ \mathrm{Specificity}=\mathrm{T}\mathrm{N}/\left(\mathrm{F}\mathrm{P}+\mathrm{T}\mathrm{N}\right) $$

Specificity measures the percentage of cases correctly identified by the classifier as negative. Karagiannis et al. [39] also defined another metric called Completeness, which they used together with Precision to refer to accuracy and is defined as follows:

$$ \mathrm{Completeness}=\left(\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{P}\right)/\left(\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{N}\right) $$

Completeness measures ratio of cases correctly or incorrectly classified as positive, to the total number of positive cases. Therefore, depending upon the context and purpose of each classifier, a proper metrics should be chosen in order to evaluate it. Table 1 shows the summary of various metrics along with their definition and the aspects they measure.

Table 1 Various evaluation metrics for performance measurement, where TP → true positive, TN → true negative, FP → false positive, FN → false negative

6 P2P traffic classification techniques

Earlier, traffic identification and hence classification was an easy task. However, as the P2P architecture evolved, it started using random port numbers or the port numbers assigned to other well-known protocols (such as HTTP), due to which another method based on inspection of payload was adopted to identify the application traffic, but that too had various limitations. So, new approaches employ statistical or behaviour-based methods that overcome various limitations which were present in traditional techniques. The following sections elaborate different types of techniques for traffic classification along with their merits and de-merits.

6.1 Port-based traffic classification

This technique relies on identification of application protocols using TCP or UDP port numbers, since each application is associated with well-defined port numbers which are defined by Internet Assigned Numbers Authority (IANA) [48]. For example, HTTP traffic uses port number 80, DNS traffic uses port number 53 and SMTP uses port number 25. This is a simple technique as it relies on packet headers only to extract to port numbers from it. A classifier placed in the middle of the network analyses for the SYN packets (which are basically TCP packets used for the purpose of 3-way handshake to establish a connection) to know about the server-side of a TCP connection and hence identifies the type of traffic flowing through the network by looking at TCP SYN packet’s target port number in IANA’s registered list of port numbers [48]. Similarly, UDP traffic can be identified using the port numbers it uses during communication between the hosts, but here connection establishment or its maintenance does not take place. Gomes et al. [49] presented a list of TCP and UDP port numbers utilized by several well-known P2P protocols, which is shown in Table 2.

Table 2 Various P2P protocols utilizing well-known port numbers

The main advantage of this technique is that it doesn’t involve any calculations and hence is fast to identify network traffic. Also, its implementation is simple which requires addition of port numbers in the database for new applications that have recently emerged. However, with the evolution on Internet, this approach started to become obsolete [10, 50, 51] as some applications such as P2P started using dynamic port numbers and port numbers which may not be registered with IANA (e.g.: Napster and Kazaa) [52]. Also, in order to get through the firewall, many applications masquerade by hiding their traffic behind well-known port numbers such as port number 80, which maps to HTTP traffic. This technique fails if there is encryption at IP layer which obfuscates TCP or UDP port numbers, hence making it impossible to recognize actual port numbers utilized by the applications.

Earlier, some P2P applications utilized port numbers or ranges which were used to identify P2P application protocols. Moore and Papagiannaki [50] identified that byte-accuracy of at most 70 % could be achieved using port-based classification technique. As port-based classification is a traditional technique, so most of its related work is referred in [49].

6.2 Payload-based traffic classification

This technique is usually most accurate and is based on inspecting packet headers and packet payloads. It relies on a database which contains signatures of previously stored application protocols. The packet payload is inspected bit-wise to locate bit-stream that contains the signatures (which are pre-defined byte sequences) of application protocol. Hence, the traffic can be identified accurately when packet-signatures of network application match with stored-signatures in the database. For example, ‘xe3\x38’ string is contained in eDonkey P2P traffic, ‘\GET’ string is contained in web traffic and so on. This technique is not only employed for P2P traffic identification [51, 53, 54] but also in scenarios which involve identification of threats such as network intrusion detection [55], malicious data and other traffic anomalies. Such technique is also significant for accounting solutions and charging mechanisms, where accuracy is crucial.

The main advantage of this technique is that it performs network traffic identification fairly accurately. However, it also suffers with various limitations. It involves significant amount of complexity and processing load on network equipment which is used to identify network traffic. Such technique is unfeasible in high-speed networks. Hence to resolve this issue, some mechanisms inspect only few packets of each flow which is a compromise between accuracy and efficiency and sometimes in such cases, signatures may not be contained in that part which is captured, which may lead to inaccurate identification of traffic. The database or the device needs to be kept updated with signatures of newly emerged application protocols or else some new traffic may get unidentified. Furthermore, it is difficult to maintain signatures with high hit and low false-positive ratio. For example, payloads of both Gnutella and HTTP traffic contain ‘\GET’ string and hence arises ambiguity. The major drawback of this technique is that identification of network traffic becomes almost impossible if traffic is encrypted or if traffic contains proprietary protocols. Direct analysis of packet payload may also breach the privacy policies of some organisations or violate relevant privacy legislation.

Song and Zhou [56] proposed file-aware P2P traffic classification mechanism based on DPI technique to identify a file and its associated flows; which consists of two strategies based on: i) per-file bandwidth consumption, and ii) number of per-file concurrent active flows. This approach maintained 6-tuple (source-ip, destination-ip, source-port, destination-port, protocol and file-id) file-level information in flow table. In order to reduce computational overhead involved in traditional DPI technique, pattern matching (involving only simple pattern-sets) occurred at beginning of payload and depth of inspection involved only dozen of bytes. Authors evaluated their approach on dataset collected from campus network, where majority of P2P applications include: BitTorrent, eDonkey and Gnutella; and their ground truth was verified using GTVS. The proposed approach achieved 100 % accuracy and completeness ranging from 88 to 93 %. As payload-based classification is traditional technique, so most of its related work is referred in [49].

6.3 Classification of traffic in the dark

As various limitations exists in the port-based and payload-based techniques, therefore new approaches have been developed and adopted which do not rely on port number and inspection of payload to identify the traffic. Such approach is often called classification in the Dark [39, 57] which classifies the traffic using generic properties of packets [38] such as packet size, total bytes sent, ports, etc. or by observing behavioural or statistical patterns of the flows. The main advantage of this technique is that it is able to classify the traffic without inspecting payload or relying on port numbers. However, it is not as accurate as payload-based technique but recent studies have achieved good accuracy in classifying the traffic. Also, this approach is applicable to any unknown application since methods based on it classify the traffic in a particular class instead of identifying specific applications. Various methods which fall under this approach are discussed as follows.

  1. a)

    Statistical or behavioural signatures: Such method rely on packet or flow level properties of traffic such as packet size, totals bytes sent or received, flow duration, flow size, packet inter-arrival time, TCP or UDP ports used, etc.; which can be used individually or collectively for calculation of statistical measures such as average, variance and probability density function. In order to classify the traffic, such method requires prior learning phase to build a reference model.

Freire et al. in [58] and [59] proposed a technique to identify VoIP calls hidden in Web traffic by analysing several properties of network data, which are: size of Web request and response, number of per-page requests, inter-arrival time between requests and retrieval time of page. They evaluated their approach on VoIP data of Google-Talk and Skype which was collected from ISP and university links and achieved recall rates of about 90 % for VOIP calls and 100 % for VoIP calls hidden in Web traffic. Gomes et al. [60] analysed several P2P and non-P2P applications to identify their behaviour pattern and found that there is high heterogeneity in P2P packet sizes when compared to that of non-P2P traffic. Heterogeneity degree was represented using entropy and its value was calculated for a sliding window containing fixed number of packets. It was found that P2P traffic related to VoIP services returned high entropy values while regular client–server traffic returned consistently smaller values. Sun and Chen [61] proposed a novel technique suitable based on C4.5 decision tree for identifying application associated with a TCP flow, using two characteristics: the ACK-Len ab and ACK-Len ba; which are the data volume first sent by communicating parties continuously. Using this approach, authors classified four different types of applications: www, ftp, e-mail and P2P; where P2P traffic was identified by analysing that both parties involved in communication send considerable volumes of data to each other, thus reflecting P2P behaviour. Three dataset were used, where first was taken from Moore [62], second from the working environment (called Set1) and third was extracted from Set1 by using characteristic mentioned in ref. [63]. The proposed approach can be used for online traffic classification as it only depends on data’s total length of first few packets on the flow which greatly save storage space and classified P2P traffic with accuracy, recall and precision rates ranging from 97.648 to 99.694 %, 30 to 80 % and 65 to 93 %, respectively. He et al. [64] proposed fine-grained host-based P2P traffic classification by simply counting special flows (i.e. clustering flows). This approach locates all P2P hosts within monitored network and identifies the types of P2P application running. It builds application profiles of each P2P application by using the flow information that describes its most significant network activity pattern and is learned from traffic traces generated by corresponding P2P application. The performance is evaluated on traffic datasets consisting of P2P applications namely BitComet, BitTorrent, eMule, Vagaa and Thunder. The ground truth verification is done by manually investigating each host running P2P application. The experimental results achieved average true positive and false positive rate of 97.22 and 2.78 % respectively. The proposed approach does not use complicated statistical features of traffic or machine learning algorithms and can readily include new P2P applications in classification scope. It is also able to classify encrypted traffic in real-time. Yang et al. [65] proposed a method to identify P2P live streaming based on union features by analyzing its behavioural characteristics. The datasets consisted of mixture of traffic from BitTorrent and Thunder which are file sharing applications and traffic from PPTV, PPStream, QQlive and UUSE which are on-demand and live streaming applications. The experimental results achieved 95 % accuracy in identifying P2P live streaming traffic. Qin et al. [66] developed a framework named CUFTI (Core Users Finding and Traffic Identification) for identifying and managing P2P traffic of core users (i.e. long-lived peers). They studied peer’s life-time in PPlive system and identified core users from the overlay. The model utilized payload length and direction of first few control packets of different P2P applications (PPlive, BitTorrent and Thunder) as statistical features that were extracted using the longest common subsequence (LCS) and performed flow identification. The experimental results achieved false positive and false negative rates of 3.49 and 8.47 %, respectively in identifying PPlive traffic. Further the model can be employed for real-time identification of traffic. Zhang et al. [67] proposed component based method to detect P2P traffic utilizing UDP for communication. In graph theory, component is defined as connected sub-graphs from a disjoint graph. The approach uses graph-level statistics to detect P2P traffic (utilizing UDP) and does not use packet level information. The dataset consisted of records taken from netflow version 5 and exported from university campus network border-link.

  1. b)

    Heuristic-based methods: This method classifies the traffic by observing the behavioural patterns of traffic using pre-defined set of heuristics such as hosts acting both as client and server, number of connections made by host, number of distinct addresses or ports a host is connected to, hosts using both TCP and UDP for communication, etc. The set of heuristics are analysed sequentially and the packets or flows are classified as belonging to a particular class depending upon the results obtained. There are some studies that make use of heuristics to identify P2P traffic.

Per’enyi et al. [68] proposed a technique for identification of P2P traffic that is based on set of six heuristics: usage of UDP and TCP simultaneously; well-known P2P port numbers; number of consecutive connections existing between two peers; several flows having same flow identities; flow-duration greater than 10 min or flow-size greater than 1 MB; and an IP address using same port number more than 5 times in measurement period. A small labelled traffic traces were used for validation of this approach, which achieved recall rate of 99.14 % for P2P traffic and 97.19 % for non-P2P traffic. John and Tafvelin [69] redefined the combination of heuristics used in [68] and [54] and proposed the heuristics: usage of UDP and TCP simultaneously; well-known port numbers of P2P protocols; the port numbers that are used very often; relationship between number of ports and IP addresses; flow-duration greater than 10 min or flow-size greater than 1 MB. They collected the traffic traces from university link and achieved recall rate of 98 %. Hong [70] proposed a novel method to identify P2P traffic utilizing UDP protocol and revealed & validated three unique characteristics that will not appear together in TCP or UDP traffic produced by non-P2P applications, which are: i) almost all UDP traffic of local host transfers by fixed port number; ii) nearly all remote peers use single port number for communication with local host, and iii) size of UDP packets produced by P2P applications is relatively fixed. These characteristics were examined by collecting 100 blocks of P2P traffic (consisting of BitSpirit, Emule and other P2P applications), each ranging from 100 M bytes to 200 M bytes and evaluation of this approach achieved an accuracy ranging from 98.4 to 99.6 %. Reddy and Hota [71] proposed a new set of heuristics to identify P2P host based on its connection patterns and they do not require any payload signatures. The datasets used was realistic in nature ad consisted of applications namely Http, FTP, Dropbox, SMTP, eMule, Frostwire, Skype, uTorrent and Vuze. The authors verified their approach in real time and only 0.2 % of P2P traffic remained unclassified. As their approach consisted of minimal heuristics, it can be used for real-time identification; but it can only identify broad P2P applications rather than different P2P applications. Bashir et al. [72] proposed an approach based on heuristics to identify BitTorrent activities using Netflow records by observing 3 major segments of traffic: a) traffic from peers contacted via DHT, b) TCP traffic from peers contacted via trackers and c) UDP traffic from peers contacted via trackers. The approach was tested on 5 real life datasets having mixture of applications consisting of BitTorrent, p2p radio streaming application, Skype, SopCast and PPStream. The experimental results achieved the byte accuracy ranging from 91.3 to 95.4 % in identifying BitTorrent activity.

  1. c)

    Machine Learning methods: Machine learning techniques based on supervised or un-supervised methods have been adopted in various studies such as clustering [73], Bayesian estimators or networks [74] and decision trees [75]; which work on set of traffic characteristics by correlating them using probability functions and hence classify the packets or flows as belonging to particular class.

Mohammadi et al. [2] proposed a hybrid approach using genetic algorithm neural networks to classify P2P traffic. Genetic algorithm was used in calculating minimum classification error (MCE) matrix which is then used to map features of dataset into new space where they can easily be separated into different classes. The mapped dataset is fed into classifier named neural networks. Three different indexes namely mutual information, Dunn and SD were measured to compare proposed methodology against standard MCE-based and normal (i.e. no feature mapping) approaches. The experimental results showed that proposed mapping technique reduces overlap among classes and gives improved classification accuracy of 96 %. Schmidt and Soysal [76] proposed a technique involving Bayesian network to identify P2P traffic by using the parameters: well-known port numbers, IP packets-per-flow distribution, packet-size distribution, octets-per-flow distribution and flow-time distribution. They collected the traffic from academic network to evaluate the performance of classifier in their technique as well as in signature-based method and showcased the results of false positive ranging from 22 to 28 % and false negative ranging from 16 to 26 %. Cao et al. [77] proposed a technique using Classification And Regression Tree (CART) for real-time identification of application protocols at both flow-level and host-level. They collected the traffic traces of HTTP, SMTP & FTP from enterprise network by port number filtering method and traces of BitTorrent were collected actively at home environment in controlled manner to assess the ground truth. By evaluating this technique, the classification results obtained showed false positive rate ranging from 0.05 to 12.7 % and false negative rates ranging from 0 to 17.9 %. Raahemi et al. [47] proposed a technique using set of network level packet attributes to identify P2P traffic by using Concept-adapting Very Fast Decision Tree (CVFDT). In order to evaluate the performance of their technique, they used labelled datasets and achieved the accuracy ranging from 79.50 to 98.65 % and specificity ranging from 82.96 to 95.89 %. Angevine and Zincir-Heywood [78] classified TCP and UDP flows of Skype using C4.5 decision tree and AdaBoost algorithms. They collected the labelled traffic traces from university network and achieved recall rate ranging from 94 to 99 % with their technique. Wang et al. [79] identified traffic of multiple P2P protocols using classifier based on decision tree called Random Forest. They captured the traffic traces from academic and residential networks and evaluated their technique using manually labelled dataset to achieve accuracy ranging from 89.38 to 99.98 % and precision ranging from 32.69 to 100 %. Dainotti et al. [80] proposed a classification technique based on hidden Markov models and using parameters: packet size & inter-packet time. They carried out classification on real-traffic traces of HTTP, SMTP, eDonkey, P2P-TV, MSN messenger, PPlive & two multi-player games; whose traces were verified manually as well as using DPI technique, to achieve recall rates ranging from 90.23 to 100 %. Valenti et al. [81] adopted a mechanism based on Support Vector Machine (SVM) and number of packets exchanged between peers during short interval of time; to identify P2P-TV applications. They tested their approach on traffic captured in larger test-bed to achieve recall rates ranging from 91.3 to 99.6 %. Liu et al. [82] proposed a mechanism by utilizing supervised ML algorithm and ratio of amount of downloaded and uploaded traffic in each minute as an identification pattern. They classified P2P applications of Maze, PPlive, BitTorrent, eDonkey and thunder and achieved accuracy ranging from 78.5 to 99.8 %. Raahemi et al. [83] identified P2P traffic using the neural network: Fuzzy Predictive Adaptive Resonance Theory; which was built by utilizing IP headers data. This approach utilized labelled datasets to achieve the classification accuracy ranging from 78 to 92 %. Hu et al. in [84, 85] proposed a novel approach to identify the various applications by building behavioural profiles using association rule mining. They extracted flow statistics by selecting five flow tuples and correlated them using Apriori algorithm. The authors collected the traffic traces from on-campus network, which were verified manually as well as using DPI technique and tested this mechanism on BitTorrent and PPlive to achieve the recall rates ranging from 90 to 98 %.

Liu and Sun [86] proposed a new approach called P2PTIAL that doesn’t require fully labeled samples-set for P2P traffic identification by active learning which consists of two parts: Support Vector Machine (SVM) and uncertainty selection policy. SVM acts as learner which repeats learning process on both labelled & unlabelled sample; whereas uncertainty selection (which is based on distance) selects unlabelled sample to be labelled by oracle (e.g., a human annotator). Further, to improve its effectiveness, authors employed support vector data description (SVDD) technique to filter unlabelled samples having little contribution in active learning to reduce storage space & save computation cost; and used unlabeled sample’s pre-labeled information to avoid imbalanced learning. They utilized Moore-dataset [38, 87], which includes traffic from applications: P2P, www, bulk, database, interactive, mail, services, attack, games & multimedia and evaluated their technique on both un-balanced & balanced learning to achieve the accuracy rate ranging from 79.65 to 86.86 % and 93.00 to 93.07 %, respectively. Jiang and Tao [88] proposed P2P traffic identification model based on SVM that can work on encrypted traffic and selected 3 characteristics: i) change of mean square value of packet size, ii) average flow duration, and iii) ratio of IP address and port numbers. The performance achieved in terms of precision, false-positive and false-negative rates range from 96.55 to 97.89 %; 2 to 2.8 % and 2.45 to 5.29 %, respectively. Gong et al. [89] proposed improved SVM incremental learning algorithm for P2P traffic identification which is able to save storage space and increase identification accuracy (87.89 %), when its performance is compared with standard SVM incremental learning algorithm (having 80.35 % accuracy) and SVM-based re-training algorithm (having 78.90 % accuracy) for increased number of test samples. Deng et al. [90] proposed the ensemble learning model which integrates Random Forests and feature weighted Naive Bayes for P2P traffic identification. Network traces considered for evaluation consisted of both P2P traffic (BaiDuYingYin, BaoFengYingYin, PPS, PPlive, QQlive, XunLeiKanKan and Thunder) and non-P2P traffic (Web, Youku and Souhu) and achieved accuracy of 92.47 %; which overall performs better when compared to simple machine learning methods. Jie et al. [91] proposed a novel and fine-grained P2P traffic classification approach that relied on count of most frequent and steady flows generated by corresponding P2P applications called Clustering Flows. This approach exploited only basic properties of flows (protocol, packets size and number) to perform the classification using SVM algorithm and doesn’t require any other complicated traffic statistical or behavioural features. The experiment performed on traffic traces of P2P applications include BitTorrent, eMule, PPTV & Cbox and achieved true positive rate ranging from 95.4 to 98.63 % and false positive rate of 0.01 %. Bozdogan et al. [92] evaluated the performance of machine learning algorithms for classification of P2P applications, which include BitCommet, uTorrent and BitTorrent. Four supervised algorithms (C4.5, Ripper, SVM and Naive Bayes) and one un-supervised algorithm (K-means) were evaluated using the metrics: detection rate, false positive rate, f-measure and correctly classification rate. The experimental results showed that Ripper algorithm performs better in identifying P2P network traffic.

  1. d)

    Methods involving combined approaches: There also exist some studies which combine different classification approaches to identify network traffic, which are discussed below.

Karagiannis et al. [54] adopted cross-validation mechanism to identify traffic from FastTrack, eDonkey, Gnutella, BitTorrent, Direct-Connect, MP2P & Ares; by using port-numbers, payload signatures and behavioural patterns. In addition to using payload-signatures for particular applications, the non-payload based method used two heuristics to identify flows belonging to P2P applications, which are: (i) identification of source & destination IP pairs that use both TCP and UDP; and (ii) identification of number of distinct IP addresses connected to destination IP is equal to number of distinct ports used for making connections. The behavioural approach achieved the recall rates ranging from 90 to 99 %. Also, they compared the results of payload-based approach with behavioural approach to find the false positive rates ranging from 8 to 12 % of overall P2P traffic. Dedinski et al. [93] adopted an approach for identification of P2P traffic that made use of active crawlers for collecting information of peers of a certain application to infer the topology of the overlay network. In addition, for analysing behavioural patterns, the authors used wavelet analysis technique on traffic to analyse network-level properties: per-packet or inter-packet arrival times. The performance of this architecture evaluated on traffic belonging to eDonkey and FTP. Adami et al. [94] proposed a real-time mechanism using payload-based method & statistical method to identify different Skype clients in the network, which have the communication of: file transfer, direct calls, calls to phone service and calls using relay nodes. They collected the traffic traces from a university network and ADSL link of a small network. The performance of this mechanism (which was conducted both online and offline) was tested for both TCP & UDP with other five classifiers, to achieve false positive rates ranging from 0 to 0.01 % and false negative ranging from 0.06 to 0.64 %, in terms of bytes and flows.

Yan et al. [95] proposed a novel technique for P2P identification based on host heuristics & flow statistics. In order to find out if host is participating in P2P application, authors first matched its behaviour with pre-defined heuristic rules:- IP-popularity ratio, port-pair difference, ephemeral-port ratio, failed-connection ratio; and secondly refined the identification by comparing statistical features of each flow with flow features:- Flow-bytes & flow-duration, and byte-ratio of forward & backward direction. The traffic traces were collected at edge router of the campus network and consists of Web (http and https), Mail (pop3, pop3s, imap, imaps) and P2P (bittorrent, edonkey, skype) traffic; and accuracy rate achieved by this technique in terms of flows and bytes were 93.9 and 96.3 %, respectively. Ye and Cho [96] proposed two-step hybrid P2P traffic classification approach by combining packet-level and flow-level classifier. First step (which is packet-level classification) is the combination of signature-based and heuristic-based technique; where the packets if not classified with former approach, are checked with the latter one for classification. The second step (which is flow-level classification) is based on combination of statistical & pattern-heuristics approach; which is applied on the traffic that remains unclassified in first step. The authors used REPTree algorithm with statistical approach after comparing six ML algorithms for their performance and then applied pattern heuristics (set of rules) to rectify faulty results caused by the former approach. Four datasets were used for evaluation of this technique; where the first two were taken from University of Brescia and Ericsson Research in Hungary other two in controlled environment inside the Dankook University that were labelled with actual application types. The proposed scheme showed low overhead & high scalability and was able to achieve the accuracy rates of 98.19 & 99.82 % in terms of flows and bytes. The authors in [97] used similar hybrid approach to classify and distinguish between P2P botnet traffic from P2P traffic. The botnet traffic of Storm, Waledac, Conficker, C&C and Zeus were mixed to create three datasets. The proposed approach provides low overhead and achieved flow and byte accuracy of 97.10 and 97.06 % respectively using real datasets. Wang et al. [98] proposed a novel Application Behavior Characterization technique for P2P identification. It extracts behavioural features (number of external IP addresses, number of flows, number of packets and number of bytes) from set of flows belonging to certain applications and classifies P2P traffic using machine learning algorithm: C4.5 decision tree. The datasets used involved TCP and UDP flows belonging to Skype, Thunder, PPTV and non-P2P applications. The experimental results achieved for PPTV, Skype and Thunder include precision values of 93.66, 91.01 and 90.96 % and recall values of 92.82, 86.69 and 95.73 %, respectively. Yang et al. [99] proposed a cocktail approach consisting of three sub-methods for identifying BitTorrent traffic. First sub-method uses signature-based approach to identify un-encrypted BitTorrent traffic. Second sub-method uses message-based approach to perform identification of encrypted BitTorrent traffic. Here, after resembling the bidirectional flows into message streams, if the direction and length of first three messages satisfy certain criteria of message stream encryption (a protocol used to obfuscate traffic), then it classifies the flow as encrypted BitTorrent traffic. Third sub-method uses signalling-based approach to perform pre-identification of BitTorrent traffic. Here, prediction of BitTorrent flows takes place using first packet with SYN flag only. The authors evaluated their approach by using modified Vuze clients which not only generated real BitTorrent traffic but also labelled the traffic in benchmark traces by themselves. The experimental results achieved false positive, precision and recall rates ranging from 1.31 to 2.47 %, 98.26 to 99.03 % and 85 to 98 %, respectively. This approach has the ability for real-time identification with low overhead.

6.4 Classification of encrypted traffic

Nowadays, due to widespread use of encrypted communication to protect personal information and/or to conceal exchanged information; identification accuracy is dropping. For example, encryption is used in P2P file sharing, VoIP and ISPs offering virtual private networks for communication. These factors reflect that encryption in going to increase and it makes harder for network administrators to identify applications, since the traffic and its characteristics gets changed when it is encrypted. Hence, most identification methods classify encrypted traffic as either unknown traffic or wrongly infer encrypted traffic as belonging to same application, even though different encrypted applications are mixed in traffic. Hence, most of the existing methods can be expected to become less effective. There exist some studies that make use of P2P traffic classification techniques (discussed in previous section) for addressing this issue, which are discussed below.

The Korczynski and Duda [100] proposed stochastic fingerprints based on first-order homogeneous Markov chains to identify encrypted traffic flows of various applications. They studied twelve representative applications (which includes Skype), whose parameters were identified by observing training application traces. Their technique achieved good accuracy as fingerprint parameters of applications differ considerably. The issue with this technique is that, as application fingerprints change over time; they need to be updated periodically. For P2P application (Skype), the experimental results achieved true positive rate of 98.6 % and false positive rate of 0.1 %. The Alshammari and Zincir [101] proposed a novel technique to identify VoIP encrypted traffic that is based on machine learning which generated robust signatures. They used statistical calculation on network flows to extract feature set without the use of information regarding payload or port numbers & IP addresses of source and destination. Three different sampling techniques (uniform random sampling, stratified sampling, continuous data stream) were studied on three machine learning algorithms (C5.0, AdaBoost, Genetic programming) that were trained on various training datasets; where uniform random sampling was found to be most appropriate for enhancing automatic generation of robust signatures. Experimental results showed that C5.0 performs much better than GP and AdaBoost algorithms in classifying multiple VoIP applications and classified Skype traffic with detection rate ranging from 80.3 to 99.6 % and false positive rate ranging from 0.7 to 3.8 %. But, for other network applications, this technique needs to be explored for its accuracy. The Kumano et al. [102] focused on identifying encrypted traffic in real-time by reducing no. of packets needed to obtain traffic features and maintaining high accuracy. They used two types of encryption (IPSec and PPTV) and employed two machine learning algorithms (C4.5 and SVM) for classifying type of encryption and identification of application. Their work shows how accuracy degrades by reducing no. of packets and also proposed a procedure to identify sufficient no. of packets for each traffic feature. They compared overall accuracy by varying no. of features and packets; which ranged from 79.3 to 92.5 %. The number of packets can further be reduced for some features by eliminating initialization packets but detailed exploration and estimation is required to be done. The Wang et al. [103] proposed a novel approach based on Hidden Markov Model for identifying network activities of encrypted traffic. In their technique, time series and statistical characteristics of packets are considered for analysis. Four time series sequences during the interaction of four activities (session request, data transfer, response to session request, and response to data transfer) are analysed for distinction; due to which packet inter-arrival time is considered as feature element. Similarly for statistical characteristics, due to distinction in packet sequences of four activities; packet length and packet inter-arrival time are selected as feature elements. To verify the effectiveness of the approach, TeamViewer (which allows encrypted communication between hosts) is used. The datasets utilized includes audio, video, transfer and chat traffic types. Experimental results achieved true positive rate ranging from 96.4 to 99.1 % and maximum false positive rate of 3.6 %. However, unsupervised learning methods of modelling and further analysis of complex activities needs to considered further. Du and Zhang [104] identified P2P traffic by utilizing k-means algorithm that monitors flow information of TCP connections and calculates distance. Their approach focused on three TCP file-sharing P2P applications namely BitTorrent, BitSpirit and eMule. Experimental results achieved average true positive rate of 92.64, 96.22 and 99.76 % for BitTorrent, BitSpirit and eMule, respectively. The algorithm proposed by authors is simple, feasible, low overhead of time and can be used for real-time detection of traffic. Datta et al. [105] proposed a novel technique using application behaviour based feature extraction to detect Google-hangout traffic by taking it as a case study. Three machine algorithms were used namely Naive Bayes, J48 decision tree and AdaBoost to classify traffic. The datasets consisted of traffic traces of google-hangout, gmail and google-plus, since these google services share common behaviour between them. The classification results had the recall values of 100 % with J48 and AdaBoost separately and 99.98 % with Naive Bayes.

Table 3 provides the summary of different P2P classification approaches along with the methodologies adopted by various studies. For each study, the performance evaluation is also mentioned; which makes use of the metrics: accuracy, precision, recall, completeness, sensitivity, specificity, false-positive, false-negative or true-positive (TP). Additionally, P2P traffic involved in a study is also mentioned to give an idea of the kind of traffic on which the corresponding performance is achieved. The comparison between various methods in the Table 3 cannot be done, as evaluations were made by authors using distinct metrics and under different conditions. Hence, it only provides an overview of various methods used for classifying P2P traffic which are presented in this literature.

Table 3 Summary of traffic classification studies involving different approaches, including P2P traffic involved and their performance in terms of accuracy (A), precision (P), recall (R), completeness (C), sensitivity (SN), specificity (SP), false-positive (FP), false-negative (FN) or true-positive (TP)

Table 4 presents the summary of various studies conducted to classify P2P traffic along with their references and publication year. The technique/approach adopted by various studies for classifying P2P traffic is categorised based on: port, payload, statistical (or behavioural signatures), machine learning and heuristic. If a study uses machine learning approach for classifying P2P traffic, then corresponding algorithms used in it have also been specified. In addition, two columns are added to describe the ability of a method to be applied to encrypted traffic and for real-time classification. Although the studies based on port numbers did not address the issues of encryption and real-time classification, they still have the ability to identify the traffic. It is because TCP and UDP port numbers are not usually encrypted and traffic can be quickly categorized online by matching their port numbers with the stored database of applications.

Table 4 A summary of cited papers that mentions references (Ref), publication year (Year), authors (Studies) and classification technique used/applicable which includes: port (Port), payload (Payl), statistical/behavioural signature (Stat/Beha), machine learning (Mach), heuristic (Heu), machine learning algorithm (Algorithm), Real-time classification (Real) and encryption (Encryption)

By considering various studies discussed in previous sections and advantages as well as limitations of various identification techniques (i.e. port-based, payload-based and classification in dark), Fig. 1 compares them by considering their implementation, resource requirements and performance in classifying traffic. Hence, the comparison factors include: ease of implementation, requiring less computation, classification accuracy, classification of encrypted traffic, classification in real-time and classification of unknown traffic. Each technique is given a value on a particular factor ranging from 1 to 3, where value 3 represents comparatively highest performing technique and value 1 represents comparatively lowest performing technique. Port-based technique has highest value while considering the factors of ease of implementation and less computation requirement. This technique has the ability to classify encrypted traffic and real-time classification, but it has lowest value in all remaining factors (i.e. classification of encrypted traffic, classification in real-time, classification accuracy and classify unknown traffic) since current generation P2P applications masquerade or utilizes random port-numbers due to which it will not give accurate results. Payload-based classification has highest performance when classification accuracy is of prime importance. Due to this fact it is widely used for ground truth verification of traffic which is discussed in section 4; but comparatively it doesn’t perform well on other remaining factors. Classification in Dark has highest performance while considering encrypted traffic classification, real-time classification and unknown traffic classification.

Fig. 1
figure 1

Comparison of P2P identification techniques based on their performance by considering various factors

7 Conclusion

Major portion of Internet is composed of P2P traffic which consumes a lot of network bandwidth. With the evolution of P2P applications and services and more hosts keep on joining/adopting them; it poses various challenges for network administrators or ISPs to address or manage the network issues concerned with billing, security, fault diagnosis, quality of service, among others. Hence, it is necessary for network administrator or ISPs to accurately and efficiently identify the kind of traffic flowing through their network. Traditionally, port-based mechanism was used for traffic identification, but has lost its utility as applications started masquerading or using random port numbers. Due to such limitations, payload-based mechanism was adopted which has very high accuracy, but also suffers from various limitations or issues such as traffic encryption, privacy, etc. Therefore, newer approaches based on Classification in Dark have been adopted to identify network traffic which overcomes various limitations of previous approaches.

This paper presents a survey on P2P traffic identification approaches and analyses some of the methodologies & achievements of each approach. Nowadays, due to widespread use of encryption for communication by most applications, the existing approaches lose effectiveness and make harder for network administrators or ISPs to accurately classify network traffic, since the traffic as well as its characteristics gets changed resulting in reduced accuracy. Real-time traffic classification also has great importance. So, future work needs to focus on identifying encrypted P2P traffic efficiently in real-time that can also work in high-speed networks. Research should be focused on developing technique that can identify traffic from individual P2P applications (i.e. fine-grained classification) instead of just identifying P2P traffic (i.e. course-grained classification) so that ISPs or network administrators can manage traffic in better way. Also, a new generic technique should be developed that can identify not only existing P2P applications, but any new P2P application which emerges in the future. This requires detailed knowledge of already existing techniques and their loopholes.