Keywords

1 Introduction

Hacking in cyber-security is advancing daily in this world. There must be some higher security protection to stop these attacks. Several kinds of attacks are done by the attacker; some are D-dosing, man-in-the-middle, information escape, SQL injection, remote to local. They use such techniques to illegally enter restricted networks, websites or personal data from your device [1]. The attackers from within or outside are finding innovative techniques to crack the information, money or any sensitive information. The innovative ideas and new methods developed promise to stop or try to reduce new methods created or developed by the hackers. Cyber-security can be stated as a method or technique to defend against various cyber-attacks done by hackers and shield sensitive data from attackers. Cyber-security within the year 2016 had multiple advances updates in machine learning techniques like auto-cars, linguistic communication process, medical field, and virtual AI [2]. These need to be used to find various databases related to the matter of various intrusion detection. Thus implement it using machine learning to update and make the security better against the intrusion. First, we need to input these into the machine learning (ML) model. This model gets practiced by the dataset model and then is known as the trained model. Once we input the dataset, next we use the machine learning formula on the dataset sample. [3]. ML formula plays a crucial part in increasing security for intrusion detection systems [3]. Machine learning algorithms are separated into two groups: unsupervised learning and supervised learning. They are distinguished based on data (i.e. input) they are settled for.

Unsupervised learning refers to the algorithms of training information that are unlabeled, with the job of deducing the classes all by itself. Supervised learning refers to the algorithms of training information that are labeled and acknowledge what differentiates the labels. The labeled information is extremely rare and the chore of the labeled data is itself exceptionally exhausting and we may not be able to sight if labels really exist.

2 Common Machine Learning Techniques Used in Cyber-Security

Regression

In regression, values of the dependent attributes are approximated on the basis of values of the independent attributes through studying the currently existing data connected to previous events. This understanding is also used to manage the new events. Regression is used to solve fraud detection in cyber-security. When a model is understood on the basis of the past database proceedings by observing the current attributes, it determines fraudulent transactions. We can learn decision tree, support vector machine, linear regression, random forest, polynomial regression and some more regression models from machine learning. Venkatesh Jaganathan used multiple regression techniques for prognosticating the effect of cyber-attacks. The all-inclusive common vulnerability scoring system (CVSS) level is taken to be a co-related feature while two non-co-related features as Y1 (vulnerabilities count) and Y2 (mean traffic). For privacy identification in a smart environment, Daria Lavrova suggested a multiple regression model, which helped to uncover the known and unknown attacks.

Classification

Classification is one of the broadly used supervisory machine learning tasks. The use of the following machine learning tools is possible due to the accessibility of a huge collection of labeled data. In cyber-security, classifications are made on the basis of ML which discriminates the provided email messages as spam or that are not used in spam detection. The spam messages are separated from non-spam messages by the spam filter models. Classifications made based on deep learning frameworks which involve recurrent neural networks (RNN), convolutional neural networks (CNN), restricted Boltzmann machines (RBM) or long short-term memory (LSTMs) cells for attribute selection through multi-layer and non-sparse neural network tend to be quite effective in handling complicated tasks with the availability of a huge collection of the past dataset. The machine learning techniques used for classification involve naïve Bayes, logistic regression, K-nearest neighbors, decision tree, support vector machine, random forest classification.

Clustering

It is indispensable to have data with the label as regression and classification in supervised learning models. But clustering is an unsupervised learning method that retrieves general patterns from raw data even though it is unlabeled. A set of indistinguishable events establishes a cluster as they share common attributes that define a specific behavioral pattern. Clustering, in cyber-security, is used for the analysis of malware, forensic analysis, anomaly detection, etc. Self-organizing maps (SOMs) based on neural networks may be useful for cluster analysis. In cyber-security, some of the ML clustering techniques used are K-means, K-Medoids, DBSCAN, Gaussian mixture model and agglomerative clustering.

3 Issues in Cyber-Security

Machine learning algorithms have an important part in four different areas, which are intrusion detection system, malware analysis, Andriod malware detection and spam detection (Fig. 1).

Fig. 1
figure 1

Issues in cyber-security

Intrusion Detection

If there is any exploitation of the information by malacious software or violation of company policy, intrusion detection is used. Intrusion detection can be done in many ways. It is mainly classified into two types based on signature (signature-based) and anomaly (anomaly-based) intrusion. All packages that are received are first cross-checked with the signatures present for similarities with a known malicious threat. This is signature-based intrusion detection. Monitoring of the network traffic is done by an established normality baseline in anomaly-based intrusion detection. Biswas [4] displayed machine learning-based ways that are very useful in making a better intrusion detection system. Combinations of feature selection techniques gave them great results. Vinaya Kumar [5] put forward a scale hybrid IDS AlertNet system which helps in analyzing networks and activities done by the host. We used deep neural networks (DNNs) to create the model. Deep belief networks for intrusion detection are proposed by Zahangir Alom [6]. We have used the features of the trained set of two-layer restricted Boltzmann machine (RBM). Shone et al. [7] gave us a DL model for intrusion detection systems operation in networks using features of machine learning and deep learning.

Malware Detection

Malware is a short form of malicious software and is one of the types of cyber-threats software in the cyber-world. It is usually used for unauthorized attacks on organizations, like filching information or getting control over the entry or deal damage to personal data of the organizer and so on. The term coined as malware is usually given for programs which are malicious in nature, like virus, bugs, bots, adware, rootkits, Trojan horses, worms, spyware, ransomware, Keylogger, backdoor. Most of the malware can be subdivided into a number of families. For example, we can classify ransomware into Jisut family, Pletor family, Simplocker family, Charger family, Koler family, RansomBO family, Svpeng family, etc. The programs which are malicious in nature can be transported concealed in a secure file and operating systems. There can be many examples, like executable and linkable files or UNIX ELF, Windows PE files (portable executables with .exe, dll, efi). Malware programs can also be document-based and kept hidden inside doc files, pdfs and rtf files. Extensions and plug-ins for famous software platforms can also have malware in the form of extensions; for example, extension for web browsers and frameworks.

Uppal et al. [8] used ngram method to put forward a classification and detection system for malware. Chowdhury et al. [9] showed a neural network-based method for malware detection. Kalash et al. [10] proposed classifying malware using CNN. They applied CNN classification to them after they converted their codes of 25 families of malware binaries to grayscale images.

Android Malware Detection

Android is exceedingly attacked by mobile malware makers as it is one of the most extensively used mobile platforms. With an alarming increase in the volume and variant of Android malware, it has become exceptionally difficult to detect and classify the types of mobile malware. Researchers have made a large number of attempts toward mobile malware detection. Arp et al. [11], Varsha et al. [12] and Sharma and Dash [13] extracted static features from Android apps and they attained satisfactory results by using machine algorithms, like decision tree, SVM, K-NN, random forest, naïve Bayes to attain satisfactory results.

Spam Detection

Spam email comes in various flavors. Many are just exasperating messages aiming to draw attention to a cause or spread wrong information. Some of them are phishing emails with the intent of attracting the receiver into clicking on a malicious link or downloading malware. Spam detection is a supervised machine learning problem. This means you must develop your machine learning model with a set of samples of spam and ham messages and let it find the pertinent patterns that separate two discrete categories [1923].

4 Real-Life Case Scenario of Cyber-Security Risk Analysis Using Machine Learning

A real-time scenario is highlighted in this section. The main goal is to see the ability of machine learning classifiers to differentiate the different types of responses given by the classifiers for the input malicious code [14]. We used four types of machine learning algorithms to classify the malicious codes, namely naïve Bayes (NB), neural network, radial basis function and support vector machine (SVM). From four different organizations, we took a combined dataset [15, 16]. The incidents that happened in the organization were collected by a centralized hub, and then the summed up data were used for the research with a goal of analyzing the results given by the classifiers in differentiating between the various accidents that took place and learning that how different data are taken from the different organizations can help in improving the accuracy of classification [17, 18].

From the given dataset table in Table 1, we have four different organizations, and the number of events occurring whose summation is in total 1900 was used to find the behavior of malware in different classifiers. First, we calculated the precision analysis in Table 2 which shows the precision of different classifiers and how well they react to the malware. Accordingly, the rows were designed where SVM has the highest recovery precision. Table 3 shows the recall analysis which is the correct malware detection divided by the number of malware that should have returned by using different classifiers. Table 4 is the F-score analysis (the higher the F-score, the more the precision and recall) of the different classifiers and shows that for different functions different classifiers are better, i.e., the F-score for different functions varies for different classifiers. Figure 2 shows the training and testing accuracy of different machine learning algorithm, where the neural network has the highest training and testing accuracy; naïve Bayes has the lowest testing accuracy and SVM has a drastic decrease for the unknown dataset (resting set) whereas it works much better for the trained dataset. The radial basis function is very similar to a neural network and also shows great results in both the training and testing sets.

Table 1 Data samples distribution among four different organizations
Table 2 Precision analysis of classifiers in identifying the different types of responses based on malware
Table 3 Recall analysis of classifiers in identifying the different types of response based on malware
Table 4 F-score analysis of classifiers in identifying the different types of responses based on malware
Fig. 2
figure 2

Classification accuracy rate analysis using four classifiers

5 Conclusion

In order to resolve various types of cyber-security problems, machine learning techniques are extensively used. The current advancements are made in the area of deep learning and machine learning and provide encouraging solutions for cyber-security threats. But it is equivalently crucial to recognize the correct algorithm acceptable for the required application. To achieve high detection rates and to keep the solution hard against malware attacks, a multi-layered proposal is required. While solving a cyber-security problem, it is important to select the right model. In this paper, for cyber-security problems, the authors investigated state-of-the-art mechanisms. The desired results for cyber-security can be achieved by the amalgamation of machine learning techniques and human supervision.