1 Introduction

Internet of Things (IoT) is becoming a new pervasive and ubiquitous network paradigm offering distributed and transparent services [1]. Through IoT, lots of smart devices are connected, such as sensors, mobile phones and other smart devices. These smart devices can communicate with each other and exchange information. According to the IDC statistical report, there are over 50 billion IoT devices in the world; they will produce over 60ZB data by 2020 [2,3,4]. By collecting the data of these IoT devices and analyzing these data to sense and understand the environment, the complex systems can be constructed to enhance the quality of life, such as diagnosis of machine condition, human body activities, health monitoring, localization, and structural monitoring.

As the popularity and widespread use of IoT, the massive sensors and devices are generating massive data and various IoT applications are developed to provide more accurate and more fine-grained services to users. These IoT big data can be further processed and analyzed to provide intelligence for the IoT service providers and users. The emerging IoT applications involve many data-driven analytic procedures to efficiently utilize big IoT sensing data [5]. Recently the AI algorithms are introduced into the IoT data analytic procedures [6,7,8].

Over the past decade, the artificial intelligence (AI) achieves a great success with the advances in computing technologies of cloud computing, graphics processing unit(GPU) computing, and other hardware enhancements [9]. Machine learning is the most representative AI algorithm, which has been already applied in multiple fields, such as computer vision, computer graphics, natural language processing (NLP), speech recognition, decision-making, and intelligent control. Similarly, machine learning can also bring a potential benefit to computer network. Some researches studied how to utilize machine learning to solve networking problems, including routing, traffic engineering, resource allocation, and security [10,11,12,13,14]. Machine learning has been regarded as the key technology of autonomous smart/intelligent network management and operation. Especially, most IoT systems are becoming increasingly dynamic, heterogeneous, and complex; thus the management of such IoT systems is difficult. Moreover, the services of of such IoT systems need to be improved, in terms of effectiveness and diversity, in order to attract more users. A lot of studies have made progress on applying machine learning to IoT. Thus we can find that IoT can also benefit from leveraging support from machine learning. The application of machine learning for IoT enables users to obtain deep analytics and develop efficient intelligent IoT applications; this is because machine learning can provide feasible solutions to mine the information and features hidden in IoT data.

In this paper, we survey the application of machine learning for IoT by supporting the possible cooperation with use case scenarios. Meanwhile, we also study the current missing integration aspects of the machine learning and IoT for designating the challenges and future directions.

As a summary, the original contributions of this paper are as follows:

  • We illustrate the potential of machine learning for traffic profiling. The unsupervised solutions and supervised solutions are presented detailedly.

  • We make a summary on the IoT device identification with machine learning, in terms of mobile phone identification and general IoT device identification.

  • We review IoT system security based on machine learning approaches, in terms of device security and network security.

  • We summarize typical IoT applications leveraging machine learning, including personal health applications and industrial applications.

  • We investigate the edge computing and SDN in IoT using machine learning, including edge computing infrastructure design and IoT network management.

  • We also discuss the challenges and open issues on the reviewed areas, including traffic profiling, IoT device identification, security and edge computing, and SDN via machine learning.

The remainder of this paper is organized as follows. Section 2 introduces the progress about the application of machine learning to traffic profiling. Section 2 discusses how to use machine learning to identify IoT devices. Section 4 presents the security solutions of IoT systems by machine learning. Section 5 presents edge computing infrastructures based on machine learning. Section 6 describes how to use SDN with machine learning to manage the IoT network. Section 7 summarizes the typical IoT applications with machine learning. Finally, Sect. 8 concludes this paper.

2 Traffic profiling

Traffic profiling refers to the fundamental task of characterizing, understanding the traffic patterns in communication networks, including IP, wireless, mobile networks etc. It provides insightful information about the underlying traffic, thus helps manage, engineer the network to obtain better performance. For instance, among the benefits, detecting abnormal traffic specifically enhances the security of the underlying networks, which have gained considerable research efforts in recent years.

Fig. 1
figure 1

Traffic profiling model

We define the traffic profiling problem as follows: the input of a traffic profiling task is the captured real network communication data; the output is a collection of patterns underlying the traffic. Figure 1 also shows the traffic profiling problem. Traditionally, researchers focused on investigating statistical properties of networks traffics, e.g., heavy hitters, heavy-tail, self-similarity [15,16,17,18]. While this approach obtains useful information for engineering networks, it is limited to particular networks. In recent years, researchers are leveraging the power of machine learning to profiling network traffics, which obtains more general results.

Here we review the progress of this area in the last decade, with a focus on security applications. We categorize the works into unsupervised and supervised solutions. We note that the categorization is based on whether background information is employed in the proposed solutions, which is different with traditional, theoretical, and abstract unsupervised/supervised learning; here we deal with domain-specific problems. We first summarize the core machine learning technique of the proposed approach, then present the detailed approach, and finally discuss the merits and limits. Table 1 lists a short summary of reviewed works.

Table 1 Recent machine learning based traffic profiling works

2.1 Unsupervised solutions

Xu et al. [19] used the clustering technique to profile IP network traffics. The scheme first captures traffic data and aggregates the data into flows. Each flow has the same five dimensions: (source IP, destination IP, source port, destination port, protocol type). Next, the scheme clusters the data for each dimension, i.e. the source IP dimension, the destination dimension, etc. The significant clutters (according to the distributions) are output. We note that the significant clusters denote the patterns of the network traffic. The IP addresses information reveals the nodes’ patterns of the communication traffic; the ports information shows the services patterns. Both contains important patterns of the traffic. The scheme employed a newly proposed entropy based metric to determine how may clusters are output. Then, for each cluster, the scheme analyzed the structures, i.e. similarities and dissimilarities, of the traffic. The scheme also studies how observed structure evolves with time. Based on the found structures, the scheme used dominant state analysis to model the interaction of the five dimensions (source IP, destination IP, source port, destination port, protocol type) in each cluster.

This scheme finally validated proposed approach on the core network traffic. The experimental results confirmed that the proposed scheme successfully found common, stable, and anomalous behaviors in the experiments. One noteworthy feature of the scheme is that the number of the clusters are adaptively determined. Along this line, the clustering technique is also used in [29] to profile higher level applications, specifically on Email. In future, one interesting, unknown problem is how different clustering algorithms influence the analysis.

Brauckhoff et al. [23] used frequent item set mining to detect anomalies for network traffics. The proposed scheme processes the captured traffic into seven-entry tuples (srcIP, dstIP, srcPort, dstPort, protocol, #packets, #bytes). The scheme first employs a traditional histogram-based detectors to filter out suspicious flows. For the filtered traffic, the scheme sets up a transaction with seven items (srcIP, dstIP, srcPort, dstPort, protocol, #packets, #bytes) for each suspicious flow. Then the scheme uses frequent item set mining to find the anomalies. For instance, if an IP address is flagged as an frequent item set, it may be an anomaly. The output of the proposed scheme are all the frequent item sets.

The scheme was validated on a median-sized ISP. First, ground truth is found using manual analysis, which is based to top-k queries. Then the scheme was used to identify anomalies. Experimental results show that the proposed scheme incurs a very small number of false positives. The biggest advantage of this scheme is that it reduces the time needed to analyze anomalies when detected. One challenging aspect of this approach is on parameter selection. In its current form, the threshold for frequent item set mining is by trial and error.

Glatz et al. [25] used frequent item set mining to profile network traffic and visualize the traffic. The proposed scheme first captures the traffic, obtaining a five tuple for each flow; flow statistics, e.g. payload size etc., may also be included. Then, the scheme employs frequent item set mining to find out top traffic flows. Later, the scheme plots the traffic using a hyper graph.

The scheme was validated on campus networks. By visualizing the traffic, it is easy to find the dominant traffic patterns, including popular network visits, network attacks, network misconfigurations, etc.

Bakhshi et al. [26] employed K-means clustering to profile network users into different behaviors, which is later used to engineering software defined networks (SDN). The proposed scheme first categorizes the captured network into different traffic types. This leads to a 9-entry tuple characterizing application layer services visits. The scheme then uses K-means clustering to group different user behaviors. The obtained user behavior is finally used to support software defined network designs.

The scheme is validated on a residual network. A commercial software NetFlow was employed to capture the traffics. When clustering, the scheme worked from 2 to 7 clusters in order to understand user behaviors. The idea of this scheme for SDN design is interesting.

2.2 Supervised solutions

Hu et al. [20] employed the frequent item set mining technique to identify P2P traffic from a bunch of network traffics. The main idea of the proposed scheme is to extract dominant, unique features from the P2P traffic using frequent item set mining. In order to get the features, the scheme first records P2P traffics delicately as trained data. The data is processed to contain the standard five tuple, i.e. (source IP, destination IP, source port, destination port, protocol type), and some manually statistical properties about the communication flow. Then the scheme uses frequent item set mining to find the the patterns that occur above a threshold. The patterns latter are later used to identify P2P traffic for online network traffics. It is worthy noting that a lot of engineering, scientific, heuristic efforts are done in order to obtain good results.

During performance evaluation of the proposed scheme, 10–15 patterns are obtained for BitTorrent traffics captured in a campus network. These extracted patterns prove to have more 90% accuracy when tested on real traffic. An interesting future work along this line is to adapt this approach to other approaches.

Iliofotou et al. [21] used clustering technique, specifically community mining in graphs, to profile traffics into different applications. The proposed scheme first uses the captured traffic to construct an IP-level connection graph; indeed, only IP addresses are used, but not depending on ports, payloads. Then the scheme finds clusters/commnities in the graph. For each cluster, the scheme tried to identify the underlying application using traditional signature analysis. The identified application is labeled for all the cluster, which is based on the intuition that the cluster shares the same application type.

Further performance evaluation on four backbone network traffics confirmed the effectiveness. The accuracy is around 90%. Besides, the scheme runs fast and works on encrypted traffic. An interesting future work could be lifting the accuracy by employing other useful information, e.g. port, payload.

Iliofotou et al. [22] combined clustering and statistical methods to profile network traffics, specifically on P2P traffics. The proposed scheme works on the traditional (source IP, destination IP, source port, destination port, protocol type) tuple. The scheme first groups captured traffic into similar flows using general clustering algorithms. Then for each cluster, the scheme generates a traffic dispersion graph. Leveraging statistical graph metrics of typical P2P traffics, the scheme determined whether a cluster is a p2P traffic.

The performance evaluation shows that the proposed scheme identifies 90% of P2P traffic in tested backbone networks. The accuracy achieves 95%. accuracy in backbone traces. In future, whether this approach adapts to other application types is worth studying.

Huang et al. [24] employed Naive Bayes, decision tree to classify network traffics into different high-level applications. The proposed scheme defined several statistics on early negotiation round of upper-layer application. Then taking these statistics as features and well captured know-type flows, the scheme trained different classifiers for the traffic. The well-trained classifiers are later used for future traffic detection.

The proposed scheme was evaluated on campus traffics. Experimental results show that classifiers with the newly defined statistics have average 92% accuracy. Specifically, the accuracy is increased by around 7%, compared with the same classifier but without the newly defined feature. To employ this approach, one needs to know the total number of application types; for unknown traffics, how the scheme works is not known.

Kirchler et al. [27] used a K-means variant to profile network users according to the network traffic. The proposed scheme focused on DNS queries. The data is a vector showing how many times a flow visits a specific domain name; the dimension of the vector is the total number of different domains. The scheme then employs a modified K-means clustering algorithm to group different flows, which denote all the traffic of a specific user. Thus, the scheme successfully identified internet users.

The scheme is validated on a campus network DNS server. Among a period of 2 months, as high as 19% users were identified completely; 73% of the users in this subgroup can be linked over a period of 56 days. The accuracy is high. It is worthy noting that the scheme does not use traditional IP and port information. It is interesting to adapt this technique to other network traffics.

Das et al. [28] used a couple of machine learning technique to profile users in network traffics in order to identify user locations using traffic information only. The proposed scheme defined several flow level and application level statistics and used them as features to train machine learning algorithms. The ground truth is obtained by manually selection. The trained classifiers are later used to identify user locations.

The proposed scheme was validated on network traffics captured on wifi access points (AP). The highest accuracy is 89%, which is obtained by the Bayesian Network machine classifier. An advantage of this scheme is that it does not record user personal information, thus favors for user privacy. How to choose machine learning algorithms is also heuristic.

2.3 Challenges and open issues

Model reliability  All the proposed schemes are valid on tested traffics. It is not known whether the model is still effective on different traffics in different ISPs, enterprises, countries, etc. One inherent reason is that traffic patterns change from time and space. In future, addressing reliability is both interesting and challenging.

Huge traffic volume  Another challenge is how to deal large volume of traffic, both in storing them and processing them without lowering accuracy. Parallel machine learning algorithms or effective sampling may help. It is worthy further investigation in future.

Model security  A very interesting and challenging problem is what if the traffic is perturbed with malicious flows. That is the input is not clean; a malicious adversary may purposely pollute the network traffic in order to fool classifiers. This needs to be further addressed in future.

3 IoT device identification

Device identification refers to a mechanism that predicts the type of an internet-of-thing (IoT) device according to the device’s characteristics. Understanding the identifications of IoT devices is critical to service providers (e.g. mobile apps) for commercial purposes (e.g. advertising), and infrastructure (e.g. system/network) managers for security (e.g. finding vulnerable devices).

Fig. 2
figure 2

Device identification model

Specifically, we define the IoT device identification problem as follows: the input is various data collected from a device, e.g. sensors’ data, network data, etc.; the output is a label for the IoT device indicating the type of the device. Figure 2 also shows the model for device identification. This problem receives extensive attention in recent year due to the proliferation of mobile computing, IoT depolyment, and smart everything. Since this area is rapidly evolving due to fast wireless and mobile technology innovations, we review recent efforts on leveraging machine learning to identify IoT devices in the last five years. Table 2 presents a short summary of the reviewed works.

It is worthy noting that proactive approaches are based on IP address, MAC address, unique device number by manufacturer or operating system are not stable; thus, researchers turned to machine learning approaches, which may also be passive identifications. In the following, we first review proposed approaches that tried to identify mobile phones, then we move to review works that aimed to identify general IoT devices.

Table 2 Recent machine learning based device identification works

3.1 Mobile phone identification

Stober et al. [30] used kNN and SVM to identify mobile phone based on network communication traffic. The proposed scheme records mobile traffic using tcddump for a time interval and further transforms the captured data into a 23-entry feature. The feature is heuristically defined based on traffic bursting patterns and not related to detailed payload contents. Then, the captured data is trained using kNN and SVM to identify mobile phones. For future phone identification, just apply the trained classifier.

The proposed scheme was validated on 20 Android mobile devices running 3G communication links. The accuracy achieves 90%. The computation is also efficient: the time needed to identify a mobile phone is about 15 min. The results indicate that mobile phones can be reliably identified/tracked even the communication traffic is encrypted.

Das et al. [31] employed kNN and Gassian mixture model to identify mobile phones based on acoustic data. The proposed scheme captured and extracted features on acoustic signals from the microphone and the microspeaker of a phone. In total, 25 features are defined. Then the proposed scheme trained machine learning models (i.e. Gassian mixture) on the captured data. Later, the trained model or kNN is used to identify mobile phones.

The proposed scheme was validated in lab on 52 mobile phone. Both iOS and Android systems exist. Experimental results showed that devices manufactured by different vendors can be effectively identified. In addition, devices from the same manufacturer and model can also be identified. The accuracy is as high as 98%.

Bojinov et al. [32] used kNN and maximum-likelihood classification to identify mobile phones based on acoustic/acceleratormeter sensors. The scheme first tried to identify features for acoustic/accelerator sensors. The features reflect a basic fact that each sensor is imperfectly manufactured and thus has its unique noises. Based on the unique features, the scheme just uses simple kNN to identify mobile phones.

The proposed scheme was tested both for the acoustic sensor (i.e. microphone and microspeaker) and the accelerator sensors. For the former, a 90% high accuracy is achieved for the former. For the latter, a 50% accuracy is obtained with the additional user-agent string for web browsing. An interesting future work along this line is to employ more powerful machine learning algorithms and more features to pursue higher accuracy. Currently, only few features are used.

Huynh et al. [34] used kNN and Gaussian mixture to identify mobile phones with the touch screen sensor. The proposed scheme is based on the fact that every touch screen of mobile phones are different. The scheme employed 16 different features regarding to signals generated by capacitive sensing. Then Gaussin mixture model and kNN are used to identify mobile devices.

The proposed scheme was tested on 14 mobile phones with Android, iOS as operating systems. The identification accuracy achieve 98%. Such a high accuracy has various potential applications for authentication scenarios, e.g. ATM authentication, smart unlocking, as pointed out in the paper [34].

Kurtz et al. [36] used threshold based classifier and SVM to identify mobile phones for Apple’s iOS system. The proposed scheme employs manually defined feature from phone settings, including public (device model, the current iOS version, etc.) and protected resources (location data, photos, contacts, calendar data, reminders, sensor data) of the phone. In total, 29 features are defined. The scheme then tested the effectiveness of these 29 features. A threshold based classifier and a linear SVM classifier are trained and employed to identify phones.

The proposed scheme was implemented as an iOS app and tested. For the threshold based classifier, more than 90% accuracy was achieved; for the SVM classifier, roughly a little higher accuracy was obtained, but with the added more computation overhead. An interesting future work is to adapt the proposed scheme to Android and other IoT devices.

Baldini et al. [37] employed SVM to identify mobile phones based on magnetometer sensor. The proposed scheme first captured the magnetometer digital output with a given sampling time using an app in the phone. The scheme then extracts Shannon entropy, log energy entropy, variance, standard deviation, skewness, and Kurtosis from the captured data as features. The features are then input an SVM to train the later classify mobile phones.

The proposed scheme was validated on ten phones of different of different brands and models. The classification accuracy achieved 94% for mobile phones of different brands and models. However, intra-model classification has an accuracy around 54%.

3.2 General IoT device identification

Patel et al. [33] employed decision tree combined with random fores and with multi-class Ada boost to identify ZigBee devices. The proposed scheme first defined statistical features on radio signals, including signal’s instantaneous amplitude, phase, frequency, ect. Then the scheme collected the features for known devices and trained decision tree models. For device identification, just capture the features for the device and input the features into the classifiers.

The proposed scheme was validated. The accuracy can achieve 90%. This scheme is also capable of detecting unknown ZigBee devices in a given networked system. One interesting future work is to enhance the feature space and check whether accuracy could be improved.

Tuama et al. [35] employed support vector machine (SVM) to identify cameras according to images. The proposed scheme leverages the detailed photo-taking process of cameras, and then deduces features of a camera. Specifically, more than 10,000 features on co-occurrences matrix, color dependencies, and conditional probability of an image are used. The scheme then trains a SVM model to classify different images to different cameras. The trained model is later used to identify cameras.

The proposed scheme was validated on a public image database. An SVM based on radial basis function (RBF) was trained on 100 images and later tested on another 100 images. Experimental results showed that the identification accuracy achieves more than 97%.

Miettinen et al. [38] used random forest to identify general IoT devices e.g. smart lighting, home automation, security cameras, household appliances and health monitoring devices. The proposed scheme employs the network communication data to extract features. Specifically, 23 features of first a few network packets during initial communications are employed. The scheme then trained a classifier model for each IoT device using the captured data on the 23 features. Finally, the trained models were used to identify IoT devices in the network fastly.

The proposed scheme was tested on a representative set of consumer IoT devices in the European market. A data set of 540 fingerprints representing 27 device types was obtained for training and validation. Experimental results showed that the accuracy of identification achieved over 95% for 17 devices, and around 0.5% for the remaining 10 devices with the same manufacturers. As shown by the paper, the device identification results can be further used by SDN controllers to enforce security policies in the IoT network composing of various devices. An interesting future work along this line could be using other machine learning models and more features to increase identification accuracy.

Meidan et al. [39] employed binary classifiers to identify general IoT devices, e.g. smart TV, IP camera, baby monitor, etc. The proposed scheme works on network traffics between the IoT devices and access points. Specifically, various packet and payload characteristics are used to extract features. The scheme then trains a binary classifier for each IoT device using captured network traffic. When identifying a device, each model is used for multiple network sessions. Finally, a majority vote is used to determine the exact device.

The proposed scheme was validated in a local wireless network environment with multiple IoT devices, including PC, smart phone, a few sensors. Experimental results showed that more than 99% accuracy were obtained.

3.3 Challenges and open issues

Defense approaches  One interesting future direction is on the interplay of IoT device identification and defense. Traffic encryption/padding, false traffic injection, and mobile phone OS priority protection are potential defense strategies. How to defend device identification and how to identify devices with protection mechanisms are important research directions.

Understand the efforts of different machine learning approaches  Another interesting problem relates to how to choose machine learning algorithms. Researchers have proposed a bunch of algorithms. How different algorithms influence IoT device identification and how to define features input to the machine learning algorithms are yet to be understood.

Privacy evaluation  Further, all reviewed work here show that IoT devices can be identified with high accuracy. This breaks user privacy. A deep privacy evaluation and potential protecting methods are worthy studying.

4 Security

Security problems in IoT networks are more and more important with the increasing number of attacks nowadays. The IoT networks are more vulnerable than traditional network because of the characteristics of IoT devices and communication protocols. For example, (1) IoT devices are usually equipped with lower battery and micro-controller, thus it is easy to be flooded; (2) IoT communicate with each through Bluetooth, ZigBee, WIFI or GSM, which are more vulnerable to attacks.

In an IoT network, there are usually three components, including devices, gateway and controller. All of these components could become targets of potential attackers. Figure 3 shows the IoT security problem model. In the model, traffic, wireless signals, device events and configuration files could be analyzed. With the feature extracted from the data sources, varieties of machine learning methods are used to classify the data. The results can be used for privacy and authentication, or judging whether an event belongs to intrusion or anomaly events.

Here we review the progress of this area. We categorize the works into device security which focuses problems on an isolated device, and network security which focuses problems across the whole network. The network security mechanism could be installed on gateway or controller, depending on the specific environment. Table 3 lists a short summary of reviewed works.

Fig. 3
figure 3

IoT security model

4.1 Device security

Kotenko et al. [40] used a combination of multilayered perceptron and probabilistic neural network to forecast the state of an IoT element. The paper used the volume of the traffic, the rate of service, the rate of losses, the length of packets queue and time of user action to inquiry as indicators, to unambiguously define the state of an element. The solution is expected to reduce of cost of administration, especially during emergency.

The proposed solution was evaluated in MathCAD. The results show that the combined ANN model can provide higher precision. Compared with recurrent ANN model, the convergence of the combined model was close to 100%, where the recurrent model fell to 15% in certain cases.

Baldini et al. [41] used RF fingerprinting for device authentication, which was computed based on permutation entropy and dispersion entropy for cellular communication devices, including WIFI, GSM, etc. The mechanism is not easy to crack, because it is based on the physical properties of cellular devices. The authors in the paper also make a comparison between different machine learning classification methods, including KNN, SVM and decision tree.

The authors used a set of 9 nRF24LU1+ wireless devices, where the RF signals are transmitted. The signals were collected by a software defined radio (SDR). The results show that the overall accuracy can achieve 80%, which is enough to support multi-authentication wireless IoT devices.

Sharaf-Dabbagh et al. [42] presented a demo, which also uses RF fingerprinting for wireless device authentication. Additionally, the proposed framework monitors the noises of communication channels and the environment surrounding the source object. Hence, the fingerprinting is more robust compared with previous solutions. The authors also setup a demo which consists of multiple Raspberry Pi boards.

Jeong et al. [43] proposed a new approach that can protect user privacy when running cloud based machine learning algorithms. Traditionally, the cloud collects user data, which is sensitive and vulnerable. The new approach let the clients compute the partially-processed feature data obtained from the early state of neural networks, and the server continues the rest stages after receiving the feature data. Thus, the service is safer, while the data in transmission is not easy for reverse-engineer. The authors also measured the performance with a testbed, where an embedded board equipped with ARM big.LITTLE CPU acted as the client, and a desktop PC equipped with x86 CPU acted as the server. The results show that the new approach have a shorter prediction time while improving the privacy.

Jincy et al. [44] created a general security framework for IoT devices. Due to the increasing use of variety of IoT devices, currently we do not own a specific security mechanism which adapts to all kinds of IoT devices. the framework in this paper classifies the devices to different types indicating the capability to support security mechanism, based on their capabilities and parameters, such as power, processing, scalability, network layer, etc. The authors also found that naive bayes algorithm is appropriate for the above purpose. Thus after inputing a file consists of listed properties, the system will output the class of the device, e.g., Class A for Critical, Class B for Medium and Class C for Non critical devices.

Table 3 Recent machine learning based security mechanism works

4.2 Network security

Nobakht et al. [45] proposed a intrusion detection and mitigation system for smart home called IoT-IDM. The proposed scheme collects traffic and events data from various devices in a smart home. The data will be transported to the SDN controller, where IoT–IDM is deployed. At last, IoT–IDM use linear regression model and SVMs to obtain the optimal classification model.

The proposed scheme was tested with an experiment setup where Philips lighting system is employed, and a realistic setting is used. The accuracy of the linear regression model was 96.2%, and the accuracy of the SVMs model was 100%.

Canedo et al. [46] deployed machine learning within an IoT gateway, to address the challenges, including heterogeneity and quantity of devices across an IoT network. The edge devices in the network will collect data and transfer them to the gateway devices. The gateway devices will use artificial neural networks (ANN) to learn the healthy state of each device and make informed decisions.

The proposed solution created an IoT testbed, where Arduino devices were used to simulate the edge devices, and Raspberry Pi devices were used to simulate the gateway devices. The simulation results show that ANN can make correct prediction over 99% of the time.

Do et al. [47] proposed a mobile security system using machine learning. The new system can be used to improve vulnerabilities in mobile networks, such as phone and IoT networks. Compared with previous systems, the machine learning based system can better solve security problems, including zero day attacks and construction of conclusive attack signatures. The authors also presented a case study about Man-In-The-Middle attack with IMSI catcher. In the case study, SVN and neural networks are used to detect the anomalies.

Stroeh et al. [48] proposed a security mechanism, which does not rely on network traffic, but correlates the attacks with security events or alerts provides by sensors, such as IDSs, logs, etc. The system first collects raw data and transforms them into standardized format. Secondly, the system take the standardized alerts and cluster them into meta-alerts, which contains a structure called alert_taxonomy_set, a bit array that represents each of the supported alert types. At last, the system will sort the meta-alerts into attacks and false alarms.

The authors implemented and tested the new systems against two major data sources, including DARPA challenge and SotM from the honeynet project. Three different machine learning techniques, including SVM, Bayesian Network and decision tree, are used. The results show that the detection rate increase from 40–60% to 50–78% with different operating systems and attack types.

Rathore et al. [49] proposed a bio-inspired machine learning mechanism for improving wireless sensor network security. To address the challenges faced by current wireless sensor networks, with increasing number of nodes and complexities of network topologies. The authors were inspired by the human immune system, which have intelligent capabilities of detecting anomalies in the body. The system first classifies the nodes into fraudulent or benevolent nodes. According to the classification results, the system will generate virtual antibodies, which in advance will have an impact on the trust rate. Finally, the gateway will make a decision whether or not to attack the fraudulent nodes. During the classification phase, SVM and K-means algorithm could be used.

4.3 Challenges and open issues

DDoS attack  Unlike previous attack detection system, machine learning based system consumes more computing resources. Thus, itself becomes the choke point, which is vulnerable to DDoS attack. How to make a tradeoff between accuracy and computing complexity is an important research direction for machine learning based security mechanism.

Security infrastructure  Lots of works are devoted to intrusion and anomaly detection for IoT network/elements. There are much room for research on detecting attacks targeting at the security infrastructure and key distribution mechanism.

Data acquisition  Almost all work use traffic, event, and signal data, for security analysis. However, for some attack, it is hard to detect using these data. At the same time, there are large volume of security related data in the IoT network, including administration, configuration, and routing data, etc. It is worth investigation on how to make better classification decisions with these data.

5 Edge computing with machine learning

In the IoT world, sensors and equipments are all around the network, including the edge network. Lot of IoT applications put requirements for latency, bandwidth, security on the network, and cloud computing can not satisfy such requirements. Edge computing is a promising new technology that can satisfy such demands [50]. For example, (1) VR and AR applications that needs high bandwidth can fetch contents from the edge network; (2) Vehicles can exchange data with each other through the edge networks, supporting vehicles on roads to act co-operatively and providing better user experience [51]. In the following sections, we use “edge computing” and “fog network” interchangeably for convenience.

Figure 4 shows the edge computing problem model in the IoT networks. In the model, traffic and sensor data could be analyzed. With the feature extracted from the data sources, varieties of machine learning methods are used to classify the data. The results can be used for intrusion detection, image recognition, diseases identification, traffic engineering, etc. Table 4 lists a short summary of reviewed works.

Fig. 4
figure 4

Edge computing in IoT network model

Table 4 Recent machine learning based edge computing works

5.1 Edge computing applications

Borthakur et al. [52] proposed a new framework in a smart telehealth system with kinds of wearable devices. In the framework, the authors suggested the use of the edge computing devices, that have lower resources, but locate more closely to the end user. Firstly, the paper described a new architecture for telehealth computing such that decentralization of services at the edge network can be achieved. Secondly, algorithms of speech signal recognition are used for telehealth monitoring, and K-means clustering is used to identify parkinson disease.

Drolia et al. [53] proposed a system called Precog, which accelerate image recognition through enabling caching and prefetching on the edge devices. The system is collaborated between three parts: devices, edge server and cloud server. Unlike previous edge computing solutions that all computing tasks could be completed on the edge server, Precog also uses computing resource on devices and cloud server, due to the computing complexity and data volumn of image recognition. Both the edge server and device will employ recognition cache that stores relevant parts of the trained model. What is more, devices will prefetch part of the trained classifiers which are predicated to be used in the near future.

Azimi et al. [54] proposed a hierarchical computing architecture (HiCH) for healthcare IoT network. In the architecture, the existing machine learning methods are partitioned among different layers of the fog network. For example, the sensor devices are responsible for sensing and monitoring; the edge computing devices are responsible for local decision making and system management; and clould is responsible for heavy training procedures. The authors devise a system based on IBM’s MAPE-K model, and demonstrate a complete implementation that focus arrhythmia detection. The results show that HiCH out-performs traditional systems in both response time, bandwidth utilization and storage, while the accuracy is acceptable.

Grassi et al. [55] devised a low-cost crowdsourcing architecture called ParkMaster in visual analytics for evaluating parking availability. Different with traditional centralized monitoring system, ParkMaster makes use of smartphones, which locates inside the car, captures the video stream along the street and count the number of detected cars after processing the video with machine learning methods. The processing results are uploaded to the ParkMaster cloud, that processes data from multiple cars and recommends a parking slot for a driver.

Wang et al. [56] proposed a service recommendation system based on QoS prediction in mobile edge computing environment. Unlike other context-aware service recommendation system, the proposed system takes mobility into consideration. Based on the mobility information, the system recommend service to users by using collaborative filtering algorithms. The authors carried out a series of experiments based on the data from Shanghai Telecom, and the results show that the new system can achieve higher prediction accuracy.

5.2 Improving edge computing infrastructure based on machine learning

Zissis [57] presented an intelligent intrusion detection system (IDS), which can secure the underlying edge computing infrastructure. The system improves the traditional “Self-protecting” system proposed by IBM. By collecting data from sensors, and making use of the latest unsupervised machine learning method, the system can intelligently detect the abnormal devices which could be harmful to the whole system. Finally, the author developed a proof of concept system, that can detect anomalies in real world.

Schneible et al. [58] also presented an anomalies detection system in edge computing environments using artificial neural networks. In traditional neural networks, all sensor data and the training model should be placed in one place, i.e., the centralized cloud. This characteristic brings congestion and latencies along the traveling path. The new system is based on federated learning, where training data is split among the edge devices, and each edge device stores a copy of the training model. Thus, the centralized cloud repository only needs to aggregate the training results from edge devices. The federated learning mechanism can improve latency, bandwidth and make full utilization of computation power across the edge networks.

Abeshu et al. [59] further investigated the distributed attack detection problem, which is more difficult to detect compared with non-distributed attack. In the edge computing environments, due to the diversity and complexity of devices, traditional machine learning mechanisms have low accuracy and less scalability. Thus, the authors proposed a new scheme based on deep learning, which is popular in recent years due to the advancement of GPU hardware and theory in deep neural networks. At last, the authors evaluated the new mechanism through simulations based on publicly available datasets. The results show that the DL based mechanism outperform the traditional methods.

Besides security considerations. privacy is also an important aspect for fog computing environment. Yang et al. [60] proposed a machine learning based privacy protection mechanism when devices aggregate data from sensors in a fog computing architecture. More importantly, the new mechanism supports multifunctional data aggregation method, thus it can support a wide range of data sources. The system also distributes the computationally heavy tasks to the edge of the network, making the system more scalable then centralized system. The experiment results show that the system achieve high accuracy without disclosing user privacy.

Hogan et al. [61] proposed a solution for traffic engineering in edge networks. In the edge networks, there may exist multiple end-to-end paths, where each path has different delay and bandwidth. Choosing one that bestly match requirements of users is especially important. The solution mentioned in this paper computes the results based on portfolio theory, which maximizes the expected return given the level of risk (representing the expected throughput during the lifetime). Given the model, the authors use machine learning to evaluate the level of risk for each path. At last, using real-world latency traces, the paper compare the proposed solution with other techniques, and the results show that the solution leads to better performance.

5.3 Challenges and open issues

Heterogeneous data types  In the edge network environment, the data sources are composed of heterogeneous sensors. The collected information could contain a diversity of data types, even contain uncertainty under some circumstance. Some data may be incomplete, which add extra complexity to the system. Thus, new architecture in edge computing should take this heterogeneous into consideration.

DDoS attack on edge devices  Lot of work has been devoted to DDoS attack on cloud computing, which has tremendous capacity and is well designed to defend such attack. In the edge computing, (1) the edge devices have lower capacity; (2) the infrastructure is not so mature compared to traditional cloud computing. Thus, the edge computing environment is more vulnerable to DDoS attack, especially when the attackers desire to flood a specific attacking point.

Convergence speed for distributed machine learning  Previous work proposed to use distributed machine learning in the edge computing, where sensors and edge devices is responsible for local and light-weight training procedure, and centralized cloud is responsible for global and heavy training procedure. The distributed machined learning mechanism can improve latency time and achieve similar training accuracy. However, previous work has not taken convergence time into consideration, due to the transmit time between local and global training point could become a bottleneck of the whole system.

6 Software-defined networking with machine learning in IoT

Recently, both academic and industrial fields have seen emerging of software-defined networking (SDN) due to its flexibility. SDN separates control plane from forwarding plane, thus network operators can manipulate the network with high level configuration language, and do not need to take the complex forwarding table configuration into consideration. Due to the complexity and diversity of IoT devices, data path configuration in IoT network is even more difficult compared with traditional network. Thus, SDN can play an important role in the IoT world [62]. However, also due to the complexity of IoT, the control plane needs machine learning for better management of the networks.

Figure 5 shows the software-defined networking problem model. In the model, traffic and sensor data could be analyzed. With the feature and flow extracted from the data sources, varieties of machine learning methods are used to classify the data. The results can be used for intrusion detection, traffic management, fault detection, DDoS attack detection, etc.

In this section, we will investigate into the previous works related with machine learning for SDN in IoT network. We will review two important aspects, (1) how machine learning can IoT network management with SDN more easily and effectively; (2) how machine learning can help detect the possible intrusion and increase the accuracy.

Fig. 5
figure 5

Software-defined networking in IoT network model

6.1 IoT network management

Kim et al. [63] proposed a new solution that can identify the service context of a flow, and infer the QoS requirements of the flow. Because SDN will assign a flow to a virtual network, thus the service context identification is important for flow assignment and virtual network construction. However, the service context identification is not straightforward, i.e., it can not be derived directly from the protocol field, because the packets could be encrypted. Thus, the authors proposed that machine learning could be used to classify the flows, based on their characteristics, such as mean packet length and mean inter-arrival time, etc.

Vukobratovic et al. [64] presented a novel architecture called Condense which integrate data analysis function into the IoT infrastructure, thus data manipulation such as aggregation and computation, could be done along the path. With Condense, data redundancies could be reduced, and network bandwidth could be saved, in the IoT networks. To implement Condense, the authors proposed a function computation interface, which should be placed between data communication and analysis. SDN can be used for implementation of the function computation. With this enhancement, machine learning tasks can also be integrated into the Condense architecture, that is, the learning tasks could be seen as a series of function computation across the network (Table 5).

Table 5 Recent machine learning based SDN in IoT networks

Jagadeesan et al. [65] described a new mechanism for software faults detection in software-defined IoT network. Although SDN brings flexibility in the control plane, and operators can control the network through high level language such as JAVA or python, it could make the network more vulnerable to software faults. Considering the complexity and heterogeneity of the IoT network, the problem could be even more serious. The authors in this paper proposed to use machine learning to classify the encountered problem into software faults and other problems.

Taneja [66] proposed a framework for traffic management in IoT networks using SDN. In the IoT networks, traffic management is more necessary and more difficult, because of the great differences between different devices. Fortunately, most IoT communication protocols support traffic classification. For example, 802.11ah can classify devices into TIM (Traffic Indication Map) devices and non-Tim devices, and LoRaWAN can classify devices into class A, B or C. The authors put forward a new management mechanism, where SDN is used to perform dynamic management of traffic class, and transmit requirements prediction in the near future. Thus, machine learning could be used to increase the accuracy of prediction.

6.2 Intrusion detection

Nobakht et al. [45] proposed a host-based intrusion detection system (IDS) based on openflow for smart home environment. In the new system, the controller collects and analyze data from sensors. Using machine learning, the controller can judge whether there is intrusion or not. The authors also implemented a proof of concept system called IoT-IDM based on Floodlight. Machine learning methods could be used as a module in the system. The paper also studies a special case in a smart light system, and the results show that the system can bring flexibility with SDN, and achieve high accuracy with appropriate machine learning algorithm.

Uwagbol et al. [67] presented a pattern driven corpus to predict SQL injection attack. Although SQL injection attack is well studies, the problem arises again because IoT and SDN networks bring new opportunities for the attackers, and the defenders lack a readiness corpus for machine learning method that could identify new attacks. The authors in this paper presented a pattern driven corpus generation mechanism based on finite state automata. With the generated corpus, machine learning methods could be used to train a new model. Finally, two publicly datasets are used to evaluate the accuracy of the proposed mechanism, and the results show that the mechanism can achieve high accuracy.

Most previous work tried to secure DDoS attack in the IoT world, Ahmed et al. [68] made use of machine learning methods to identify DNS query-based attack. Unlike DDoS attack, DNS query-based attack could be launched with only a small number of packets, thus it would be more harmful if used. In the proposed system, the SDN controller will collect traffic data from the network, and identify the DNS query-based attack traffic based on machine learning. At last, the authors implement a prototype based on dirichlet process mixture model, and conduct simulation based on real-world traces. The simulation results show that the machine learning based method outperform traditional mean shift based method.

Bhunia et al. [69] proposed a dynamic attack detection system called SoftThings, which is based on SDN, and tries to prevent attacks at the network layer rather than device layers. Thus, the network can eliminate the malicious traffic as early as possible. The system system is divided into three layers: devices layer, cluster SDN controller and centralized master controller. The distributed SDN controller will monitor and detect anomalous behaviors of IoT devices. Once found, the mis-behavior will be reported to the master controller, which will dynamically judge whether there exists attacking or not. Finally, the authors conducted simulations on Mininet emulator, and the results show that SoftThings greatly improve the performance of traditional attack detection system.

6.3 Challenges and open issues

Overheads in control plane  Overload in control plane is a possible problem in SDN network. In the IoT environment, the overheads could be even higher with huge number of data sources and high complexity of machine learning methods. Thus, it is important to guarantee that the additional overheads will not crash the system.

DDoS attack on controllers  As mentioned above, the overheads in control plane will be quite high. It will become a vulnerable point in the system. Once the attackers find the patter that can bring large overheads to the system, in both data collection and training procedure. It can make the whole system collapse with high probability.

Controller placement  In the IoT world, controller could be place almost everywhere, e.g., edge network, centralized cloud based controller. However, different placement mechanism bring different performance. For example, controller on the edge can reduce the communication latencies, and controllers on the centralized cloud can bring large computation capacity. Depending on the application scenario, controller placement should be well studies to satisfy user demands.

7 IoT applications

In recent years, IoT applications are constantly emerging in almost all fields, e.g., health, agriculture, industrial, etc. However, IoT applications are facing great challenges, due to the heterogeneity and complexity of the data sources. In Fig. 6, we show the model of IoT applications. In the model, the potential data sources include wearable device, mobile phone sensors (like accelerometers), network camera and kinds of wireless sensors. These sensors capture human vital signs like temperature, ECG, and environment data like humidity, meter readings and camera images. Then, varieties of machine learning methods could be used for human health monitoring, human activity recognition, fraud action detection and object detection.

There are lots of works on IoT applications. We mainly review the previous works on personal health monitoring and industrial applications. For example, IoT networks can be used to monitor human stress, recognize human activity or presence in the personal health field; or predict agriculture disease and fraud action in smart cities. Table 6 list a short summary of review works.

Fig. 6
figure 6

IoT application model

7.1 Personal health applications

Asthana et al. [70] presented a recommendation system that advises wearable IoT solutions and wearable devices for any individual. The system first collects available user health data, including health history, demographic features and previous collected IoT data from medical or health sensors. Then, with classification models like decision tree, logistic regression and LibSVM, the system makes predications about the diseases. Each disease is related to some attributes that need to be monitored. At last, a mathematical optimization model is used to recommend the best IoT solution or wearable devices.

Walinjkar et al. [71] proposed a prognostic approach based on real-time Electrocardiograph (ECG) analysis. With real-time data from constantly ECG monitoring devices, the scheme first analyze the ECG waveforms with K-NN or other classifier. The system can predict arrhythmia and other ECG abnormalities. The authors also setup a monitoring IoT network, where the analysis results will be transferred to NHS (National Health Services, UK) cloud in real-time. In this paper, two classifiers including bagged tree and K-NN were used, the test results show that the precision can reach 99.4% if K-NN is used.

Nguyen et al. [72] explored the IoT application in medical field and proposed an IoT tiered architecture, which collects sensor data, analyze them and transform them into clinical feedback. The architecture is divided into five layers: (1) sensing layer, which uses sensors, actuators and wearable devices to gather data; (2) sending layer, where kinds of communication mechanism including Wifi, Bluetooth, ZigBee and LTE could be used to send the data to the cloud; (3) processing layer, which could happen on smart phones, micro-controllers and micro-processors. Notifications and alerts could be generated if necessary after processing; 4) storing layer, where data can be stored in clouded or hosted servers; 5) mining and learning layer, which converts information to decisions or predictions using mining or machine learning algorithms.

Madeira et al. [73] described a system that can detect the human presence using IoT devices, and do not rely on devices, like cameras and motion detectors, that explicitly detect human presence. The system first collects interaction data, e.g., reading and writing, with the large diversities of devices. Then using machine learning algorithms, the system can predict the human presence. The system was tested using a dataset gathered during 3 days from 900 users. The authors also tested a set of classification methods, including C4.5 decision tree, LinearSVC and random forest, to make prediction. The results show that the precision ranges from 50 to 99% according to the algorithm selection.

Pandey [74] used individual heart beat to predict whether a person is in stress or not in an IoT network. The author designed a Wifi-equipped board that can detect pulse waveform. The board will transfer the data to the server. Over time, the server will assemble a fingerprint of the data across different times of the day. Using either SVM or logistic regression, the server can make prediction on stress. The results show that the precision can reach 68% if appropriate models are used.

Kwapisz et al. [75] proposed a user activity recognition mechanism based on phone accelerometers. The system first collects data from users who carried cell phone while performing some chosen activities. With the time series generated by the accelerometers, the system then transform them into information features, such as average, standard deviation, etc. At last, the system uses machine learning method, including logistic regression and multilayer perceptron, to classify the feature vectors into different activities.

The authors test the system with data collected from twenty-nine users. The results show that the precision can be over 90%, and the precision of multilayer perceptron based classification is higher than logistic regression based classification.

Table 6 Recent machine learning based IoT application works

7.2 Industrial applications

Patil et al. [76] proposed an agriculture system that can monitor the environment conditions of vineyard, and predict the grape diseases in its early stages. The system used varieties of sensors to monitor the temperature, humidity and moisture throughout the yards. Using ZigBee, the data will be transmitted to servers, where a hidden markov model will be applied. In the hidden markov model, each state represents a certain condition. The author had implemented the system in real-world since Nov, 2015. The results show that the accuracy of the hidden markov model is 90.9%, which greatly improve the accuracy of statistical methods.

Siryani et al. [77] used machine learning to improve the efficiency of smart meter operation. With the tremendous increasing number of smart meters, administrators need to guarantee the cost efficiency of their operations. In this paper, the authors used varieties of machine learning methods, to predict whether to send a technician to a customer location. With higher accuracy, the system can reduce much travel expense and human resources.

The models were tested using data from a commercial network. Different classification algorithms, including bayesian network, naive bayes, decision tree and random forest, were tested. Finally, the results show that random forest achieve the highest accuracy, which is 96.69%, and the expected saving cost is about 1 million US dollars for the commercial network.

Ling et al. [78] designed an IoT-based system, that can detect occupancy of parking spaces automatically. The system first uploads the collected images. Then a vehicle recognition function is used to learn the parking spot. After that, a feature clustering algorithm based on Mean-shift, is used to find the most frequently parked locations.

The authors test the system on a Raspberry Pi 3 model. Camera data are collected on a local street near university of Washington campus. The Raspberry Pi board is connected to AWS IoT for restoring and observing. The results show that the real-time accuracy can achieve 97%.

Guo et al. [79] proposed an innovative method for detecting the characterization of flowering dynamics of rice. The method first collect time series of images from the rice fields. Secondly, the method extract local feature points from the images. During the third step, the method will generate visual words as the object-recognition approach. The method will use SVM to classify the time series of images, and detect the flowering part. For evaluation, the authors collected image data during different time with different rice varieties. The results show that the method perform well for counting number of flowering panicles. The accuracy of classification can be over 80% when proper training data are chosen.

7.3 Challenges and open issues

Saving computing resources  Most IoT devices are equipped with lower battery and micro-controllers, even the gateway is limited by battery and computing resources. However, some machine learning method, e.g., DNN, needs lots of computing resources and is power hungry. Thus, how to distribute the tasks among different computing nodes, to save power and computing resources and achieve near-optimal accuracy, is an important research direction for IoT network applications.

Unstructured data sources  Most works use structured data sources, such as sensors, images and records. However, in real IoT world, more data exist with unstructured format. It is worth further investigation on how to use machine learning based on these data.

Real-time and online analysis  In many IoT applications, like health and industrial monitoring, the devices need to compute on-line and give feedback in real-time. Thus, the requirement for better security, QoS and computation complexity is higher than other applications. These problems need to be further addressed in the future.

8 Conclusion

Machine learning has a great potential to be the key technology for IoT. Machine learning trends to provide analytics for the IoT applications. Despite the recent wave of success of machine learning for networking, there is a scarcity of machine learning literatures about its applications for IoT services and systems, which this survey aims to address. This paper is different from the previously published survey papers in terms of focus, scope, and breadth; we have written this paper to emphasize the application of machine learning for IoT and the coverage of recent advances. Due to the versatility and evolving nature of IoT, it is impossible to cover each and every application. However, this paper has made an attempt to cover the major applications of machine learning for IoT and the relevant techniques, including traffic profiling, IoT device identification, security, edge computing infrastructure, network management based on SDN, and typical IoT applications. We have presented a thorough study on the recent researches about the application of machine learning for IoT, its technical progress, and application domains. We have also presented concise research challenges and open issues, which are critical to the application of machine learning for IoT.