Keywords

1 Introduction

The demand for advanced data analytics leading to the use of machine learning and other emerging techniques can be attributed to the advent and subsequent development of technologies such as Big Data, Business Intelligence, and the applications that require automation. As Sandhu [1] explains, machine learning is a subset of artificial intelligence, which uses computerized techniques to solve problems based on historical data and information without unnecessarily requiring modification in the core process. Essentially, artificial intelligence involves creation of algorithms and other computation techniques that promote smartness of machines. It encompasses algorithms that think, act, and implement tasks using protocols that are otherwise beyond human’s reach.

Machine learning is a component of artificial intelligence although it endeavors to solve problems based on historical or previous examples [2]. Unlike artificial intelligence applications, machine learning involves learning of hidden patterns within the data (data mining) and subsequently using the patterns to classify or predict an event related to the problem [3]. Simply, intelligent machines depend on knowledge to sustain their functionalities and machine learning offers such a knowledge. In essence, machine learning algorithms are embedded into machines and data streams provided so that knowledge and information are extracted and fed into the system for faster and efficient management of processes. It suffices to mention that all machine learning algorithms are also artificial intelligence techniques although not all artificial intelligence methods qualify as machine learning algorithms.

Machine learning algorithms can either be supervised or unsupervised although some authors also classify other algorithms as reinforcement, because such techniques learn data and identify pattern for the purposes of reacting to an environment. However, most articles recognize supervised and unsupervised machine learning algorithms. The difference between these two main classes is the existence of labels in the training data subset. According to Kotsiantis [4], supervised machine learning involves predetermined output attribute besides the use of input attributes. The algorithms attempt to predict and classify the predetermined attribute, and their accuracies and misclassification alongside other performance measures is dependent on the counts of the predetermined attribute correctly predicted or classified or otherwise. It is also important to note the learning process stops when the algorithm achieves an acceptable level of performance [5]. According to Libbrecht and Noble [2], technically, supervised algorithms perform analytical tasks first using the training data and subsequently construct contingent functions for mapping new instance of the attribute. As stated previously, the algorithms require prespecifications of maximum settings for the desired outcome and performance levels [2, 5]. Given the approach used in machine learning, it has been observed that training subset of about 66% is rationale and helps in achieving the desired result without demanding for more computational time [6]. The supervised learning algorithms are further classified into classification and regression algorithms [3, 4].

Conversely, unsupervised data learning involves pattern recognition without the involvement of a target attribute. That is, all the variables used in the analysis are used as inputs and because of the approach, the techniques are suitable for clustering and association mining techniques. According to Hofmann [7], unsupervised learning algorithms are suitable for creating the labels in the data that are subsequently used to implement supervised learning tasks. That is, unsupervised clustering algorithms identify inherent groupings within the unlabeled data and subsequently assign label to each data value [8, 9]. On the other hand, unsupervised association mining algorithms tend to identify rules that accurately represent relationships between attributes.

1.1 Motivation and Scope

Even though both supervised and unsupervised algorithms are widely used to accomplish different data mining tasks, the discussion of the algorithms has been mostly done singly or grouped depending on the need of learning tasks. More importantly, literature reviews that have been conducted to account for supervised and unsupervised algorithms either handle supervised techniques or unsupervised ones with limited focus on both approaches in the same. For instance, Sandhu [1] wrote a review article on machine learning and natural language processing but focused on supervised machine learning. The author did not conduct a systematic review and, as such, the article does not focus on any specific period or target any given database. Baharudin et al. [10] also conducted a literature review on machine learning techniques though in the context of text data mining and did not implement any known systematic review methodology. Praveena [11] also conducted a review of papers that had implemented supervised learning algorithms and, as such, did implement any of the known systematic review approaches. However, Qazi et al. [12] conducted a systematic review although with a focus on the challenges that different authors encountered while implementing different classification techniques in sentimental analysis. The authors reviewed 24 papers that were published between 2002 and 2014 and concluded that most review articles published during the period focused on eight standard machine learning classification techniques for sentimental analysis along with other concept learning algorithms. Unlike these reviews, the systematic review here conducted focused on all major stand-alone machine learning algorithms, both supervised and unsupervised published during the 2015–2018 period.

1.2 Novelty and Review Approach

The systematic review relied on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) tool to review studies that have used different supervised and unsupervised learning algorithms to address different issues [13]. The approach used in the search was such that different papers published between 2013 and 2018 dealing with the use of machine learning algorithms as methods of data analysis were included. The identification and subsequent inclusion and exclusion of the articles reviewed was based on whether the paper is peer-reviewed, scholarly, full-text, and year of publication that ranges between 2015 and 2018 [13,14,15]. The search was conducted on EBSCO and ProQuest Central Databases. The search queries used are as follows, and they were implemented in the two databases. In conventional PRISMA review, it is a requirement to check and identify the search criteria in the title and the structure of the abstract alongside introduction (rationale and objectives) and methods including information sources, data items, summary measures, and synthesis results [16]. However, such an approach was adopted, and applied to published articles instead of being implemented on review articles. Table 1.1 summarizes the search queries that were run in the two databases.

Table 1.1 Summary of the queries used to search ProQuest Central and EBSCO databases

The inclusion criteria deferred for both databases with EBSCO relying on date of publication and full-text to narrow the search, while ProQuest Central search filters included Abstract (AB), Document Text (FT), Document Title (TI), and Publication Title (PUB). An instance of search implemented in ProQuest Central with some of the above criteria is as shown below.

ft(Supervised machine learning) AND ft(Unsupervised machine learning) OR ti(Supervised machine learning) AND ti(Unsupervised machine learning) OR pub(Supervised machine learning) AND pub(Unsupervised machine learning)

2 Search Results

The search and screening results based on PRISMA and elements of meta-analysis are presented in the following section. The major steps used to arrive at the final articles and subsequent analysis included screening (rapid title screening), full test screening, data extraction including extraction of the characteristics of the study, and meta-analysis based on specific check lists and aspects of the machine learning algorithm used.

2.1 EBSCO and ProQuest Central Database Results

The search results obtained from the two databases before the commencement of the review process were as follows. The EBSCO search identified 144 articles that were published between 2015 and 2018. Of the 144 documents, 74 had complete information including name of authors, date of publication, name of journal, and structured abstracts. However, only 9 of the 74 articles had full-text and, as such, selected for inclusion in the review process. As for the search results from ProQuest Central, the initial search yielded over 19,898 results, but application of the filters reduced 3301 articles, of which 42 were reviews and 682 covered classification techniques, while 643 covered or had information related to algorithms in general. However, the subject alignment of the research papers was not considered because of the wide spectrum of application of the algorithms such that both supervised and unsupervised methods were also applied in other subjects. The distribution the search result based on top ten journals is as shown in Fig. 1.1.

Fig. 1.1
figure 1

The distribution of ProQuest Central Search Results as per the top ten publication titles (journals)

Figure 1.1 shows that PloS One had the highest number of articles published on supervised and unsupervised machine learning. Sensors and Scientific Reports (Nature Publisher Group) had 213 and 210 articles. Multimedia Tools and Applications (172), Remote Sensing (150), and International Journal of Computer Vision (124) had over 100 articles. Even though Mathematics Problems in Engineering and Internal Computer Vision had 61 and 58 articles, the two publications were better placed at exploring the mathematical and algorithmic aspects of supervised and unsupervised machine learning algorithms. The inclusion and exclusion criteria focused on the algorithms as well as their mathematical discourse and application in different fields.

Based on the PRISMA checklist, a total of 84 articles were included in the study and their content analyzed for the implementation of supervised and unsupervised machine learning techniques.

The final number of articles used in the review is 84, although 20 of them underwent meta-analysis when each study was vetted for clarity of the objectives and study questions. Regarding study questions and the effectiveness of the approached used to implement the chosen machine learning algorithms resulted in exclusion of 1290 articles (Fig. 1.2). The rest (1985) met the required study question criteria but also screened for the comprehensiveness of the literature search, data abstraction, evaluation of the results, and the applicability of results [17,18,19]. It is imperative to note that publication bias and disclosure of funding sources were not considered as part of the screen process. The 84 articles met these meta-analysis requirements and were subsequently included in the analysis (Fig. 1.2).

Fig. 1.2
figure 2

The PRISMA flow diagram for the search conducted on ProQuest Central and EBSCO and the final number of studies included the analysis

It is crucial to note that of the 84 articles that were included in the study, 3 were published in 2013 and 3 were published in 2014 but were not filtered out by the data of publication restriction.

2.2 Distribution of Included Articles

The articles used in the study consisted of Feature, Journal Articles, General Information, Periodical, and Review types with a distribution represented in the following chart.

From Fig. 1.3, 78 articles were published between 2015 and 2018, while the missing articles were published in 2013 [20,21,22] and 2014 [23,24,25] and their inclusion can be associated to publication biasness, which is also observed in the type of documents or study. According to the search, inclusion, and inclusion criteria, the final results ought to have only journal articles, but others were features, general information, periodicals, and reviews. The six papers that were published between 2013 and 2014 were included, because they met all the criteria required for meta-analysis and the indexed meta-data showed that the papers were published in 2015. Regarding the misinformation, we can deduce that the publications had an inaccuracy of about 7.2%.

Fig. 1.3
figure 3

Distribution of articles based on year of publication

3 Discussion

The 84 articles discussed different supervised and unsupervised machine learning techniques without necessarily making the distinction. According to Praveena [11], supervised learning requires an assistance born out of experience or acquired patterns within the data and, in most cases, involves a defined output variable [26,27,28,29,30]. The input dataset is segregated into train and test subsets, and several papers address the concept of training datasets based on the desired outcome [31,32,33,34]. All the algorithms that use supervised learning approach acquire patterns within the training dataset and subsequently apply them to the test subset with the object of either predicting or classifying an attribute [35,36,37]. Most of the authors described the workflow of a supervised machine learning and, as it also emerged from the review, decision tree, Naïve Bayes, and Support Vector Machines are the most commonly used algorithms [8, 38,39,40,41,42].

3.1 Decision Tree

It is important to recall that supervised learning can either be based on a classification or regression algorithm, and decision tree algorithm can be used as both although it is mainly used for classification as noted in these articles [20, 43,44,45]. The algorithm emulates a tree, and it sorts attributes through groupings based on data values [46]. Just like a conventional tree, the algorithm has branches and nodes with nodes representing variable group for classification and branches, assuming the values that the attribute can take as part of the class [47, 48]. The pseudocode illustrating the decision tree algorithm is as shown below. In the algorithm, D is the dataset, while x and y are the input and target variables, respectively [49, 50].

Algorithm 1.1: Decision Tree

Protocol DT Inducer (D, x, y)

  1. 1.

    T = Tree Growing (D, x, y)

  2. 2.

    Return Tree Pruning (D, T)

Method Tree Growing (D, x, y)

  1. 1.

    Create a tree T

  2. 2.

    if at least one of the Stopping Criteria is satisfied then;

  3. 3.

     label the root node as a leaf with the most frequent value of y in D as the correct class.

  4. 4.

    else;

  5. 5.

     Establish a discrete function f(x) of the input variable so that splitting D according to the functions outcomes produces the best splitting metric

  6. 6.

    if the best metric is greater or equal to the threshold then;

  7. 7.

     Mark the root node in T as f(x)

  8. 8.

    for each outcome of f(x) at the node do;

  9. 9.

      \( \boldsymbol{Subtree}=\boldsymbol{Tree}\ \boldsymbol{Growing}\ \left({\boldsymbol{\delta}}_{\boldsymbol{f}\left(\boldsymbol{x}\right)={\boldsymbol{t}}_{\mathbf{1}}},\boldsymbol{D},\boldsymbol{x},\boldsymbol{y}\right) \)

  10. 10.

      Connect the root of T to Subtree and label the edge t 1

  11. 11.

    end for

  12. 12.

    else

  13. 13.

      Label the root node T for a leaf with the frequent value of y in D as the assigned class

  14. 14.

    end if

  15. 15.

    end if

  16. 16.

    Return T

Protocol Tree Pruning (D, T, y)

  1. 1.

    repeat

  2. 2.

     Select a node t in T to maximally improve pruning evaluation procedure

  3. 3.

    if t ≠ 0 then;

  4. 4.

      T = pruned (T, t)

  5. 5.

    end if

  6. 6.

    until t = 0

  7. 7.

    Return T

As illustrated in the pseudocode, Decision Tree achieves classification in three distinct steps. Firstly, the algorithm induces both tree growing and tree pruning functionalities [51]. Secondly, it grows the tree by assigning each data value to a class based on the value of the target variable that is the most common one at the instance of iteration [52, 53]. The final step deals with pruning the grown tree to optimize the performance of the resultant model [19, 53, 54]. Most of the reviewed studies involved application of decision trees for different applications, although most involved classification cancer and lung cancer studies, clinical medicine especially diagnosis of conditions based on historical data as well as some rare forms of artificial intelligence applications [40, 52, 55,56,57]. Most of the studies have also recognized decision tree algorithms to be more accurate when dealing with data generated using the same collection procedures [43, 44, 52].

3.2 Naïve Bayes

The Naïve Bayes algorithm has gained its fame because of its background on Bayesian probability theorem. In most texts, it is considered a semisupervised method, because it can be used either in clustering or classification tasks [58, 59]. When implemented as a technique for creating clusters, Naïve Bayes does not require specification of an outcome and it uses conditional probability to assign data values to classes and, as such, is a form of unsupervised learning [47, 60,61,62]. However, when used to classify data, Naïve Bayes requires both input and target variables and, as such, is a supervised learning technique [55, 63, 64]. As a classifier, the algorithm creates Bayesian networks, which are tree generated based on the condition probability of an occurrence of an outcome based on probabilities imposed on it by the input variables [65, 66]. The pseudocode for the Naïve Bayes algorithm is presented below [49, 67, 68].

Algorithm 1.2: Naïve Bayes Learner

Input: training set T s, Hold-out set H s, initial components, I c, and convergence thresholds ρ EM and ρ add

 Initial M using one component

I ← I c.

repeat

  Add I components to M thereby initializing M using random components drawn from the training set T s

  Remove the I initialization instances from T s

  repeat

   E-step: Proportionally assign examples in T s to resultant mixture component using M

   M-Step: Calculate maximum likelihood parameters using the input data.

   if log P (H s/M) is the best maximum probability, then save M in M best

   every 5 cycles of the two steps, prune low-weight components of M

  until P (H s/M) fails to increase by the ratio ρ EM

  M←M best

  Prune low weight components of M

  I ← 2 I.

until P (H s/M) fails to increase by the ratio ρ add

 Execute both E: step and M: step twice on M best using examples from H s and T s

 Return M←M best

As the pseudocode illustrates, Naïve Bayes algorithm relies on Bayes’ theorem represented mathematical below to assign independent variables to classes based on probability [31, 58].

$$ P\left(H|D\right)=\frac{P(H)P\left(D|H\right)}{P(D)} $$
(1.1)

In Eq. (1.1), the probability of H when the probability of D is known is defined in terms of the product probability of H, probability of D given the probability of H divided by the probability of D. The H and D are events with defined outcome and they can represent Heads and Tails in coil tossing experiments [12, 45, 69, 70]. The extension of the theorem in supervised learning is of the form represented in Eq. (1.2).

$$ P\left(H|D\right)=P\left({x}_i,\dots, {x}_n|H\right)=\prod_iP\left({x}_i|H\right) $$
(1.2)

In the above equation, x i, … , x n represents the input attribute, for which conditional probabilities are computed based on the known probabilities of the target variables in the training dataset [71,72,73]. The algorithm has been discussed in different contexts and its application is mainly attributed to the creation of data labels for subsequent unsupervised learning verifications [16, 74, 75].

3.3 Support Vector Machine

The support vector machines (SVMs) algorithm was also common among the search results articles. The articles that explored the applications of SVM did so with the objective of evaluating its performance in different scenarios [30, 58, 73, 76]. All the applications of SVM are included toward classification and the tenet of the algorithm is computation of margins [53, 77, 78]. Simply, SVM draws margins as boundary between the classes in the provided dataset. Its principle is to create the margins such that the distance between each class and the nearest margin is maximized and in effect leading to the minimum possible classification error [68, 78,79,80]. The margins are defined as the distance between two supporting vectors separated by a hyperplane. The pseudocode for the SVM algorithm is as demonstrated below. The algorithm assumes that the data are linearly separable so that the weight associated with support vectors can be drawn easily and the margin computed [62, 70]. The assumption makes regularization possible [49, 81].

Algorithm 1.3: Support Vector Machine

Input: S, λ, T, k

Initialize: Choose w 1 such that \( \left\Vert {\boldsymbol{w}}_{\mathbf{1}}\right\Vert \boldsymbol{\le}\sqrt{\boldsymbol{\lambda}} \)

FOR t = 1 , 2 … ,T

 Select A t ⊆ S, in which |A t= k

 Set \( {\boldsymbol{A}}_{\boldsymbol{t}}^{+}=\left\{\left(\boldsymbol{x},\boldsymbol{y}\right)\boldsymbol{\in}{\boldsymbol{A}}_{\boldsymbol{t}}:\boldsymbol{y}\left({\boldsymbol{w}}_{\boldsymbol{t}},\boldsymbol{x}\right)<\mathbf{1}\right\} \)

 Set \( {\boldsymbol{\delta}}_{\boldsymbol{t}}=\frac{\mathbf{1}}{\boldsymbol{\lambda} \boldsymbol{t}} \)

 Set \( {\boldsymbol{w}}_{\boldsymbol{t}+\mathbf{0.5}}=\left(\mathbf{1}-{\boldsymbol{\delta}}_{\boldsymbol{t}}\boldsymbol{\lambda} \right){\boldsymbol{w}}_{\boldsymbol{t}}+\frac{{\boldsymbol{\delta}}_{\boldsymbol{t}}}{\boldsymbol{k}}\sum_{\left(\boldsymbol{x},\boldsymbol{y}\right)\boldsymbol{\in}{\boldsymbol{A}}_{\boldsymbol{t}}^{+}}\boldsymbol{yx} \)

 Set \( {\boldsymbol{w}}_{\boldsymbol{t}+\mathbf{1}}=\left\{\mathbf{1},\frac{\mathbf{1}/\sqrt{\boldsymbol{\lambda}}}{\left\Vert {\boldsymbol{w}}_{\boldsymbol{t}+\mathbf{0.5}}\right\Vert}\right\}{\boldsymbol{w}}_{\boldsymbol{t}+\mathbf{0.5}} \)

Output: w T + 1

The implementation of the algorithm and its accuracy is dependent on its ability to margin violations and subsequent misclassification of classes on either side of the vectors. The margin is based on the following set of equations:

$$ {\displaystyle \begin{array}{c}{W}^{\mathrm{T}}x+b=1\\ {}{W}^{\mathrm{T}}x+b=0\\ {}{W}^{\mathrm{T}}x+b=-1\end{array}} $$
(1.3)

In Eq. (1.3), the three sets of equation describe the hyperplane separating two linear support vectors W T x + b = 1 and W T x + b =  − 1, and all the classes within the two support vectors are classified accurately, while those outside the support vectors violate the margin [25, 81, 82]. Consequently, the larger the distance between the support vectors, the higher the chances that points are correctly classified.

As for unsupervised learning algorithms, most of the studies either discussed, cited, or implemented k-means, hierarchical clustering, and principal component analysis, among others [20, 55, 73, 83, 84]. Unlike supervised learning, unsupervised learning extract limited features from the data, and it relies on previously learned patterns to recognize likely classes within the dataset [85, 86]. As a result, unsupervised learning is suitable for feature reduction in case of large dataset and clustering tasks that lead to the creation of new classes in unlabeled data [80, 87, 88]. It entails selection and importation of data into appropriate framework followed by selection of an appropriate algorithm, specification of thresholds, review of the model, and subsequent optimization to produce desired outcome [89, 90]. Of the many unsupervised learners, k-means was widely discussed among the authors and as such was also previewed in the review.

3.4 k-Means Algorithms

The algorithm has been used in different studies to create groups or classes in unlabeled datasets based on the mean distance between classes [91, 92]. The technique initiates and originates the classes or labels that are subsequently used in other prospective analysis [69]. A pseudocode for the k-means algorithm is as shown in the illustration below [15, 61].

Algorithm 1.4: k-Means Learner

Function k-means ()

 Initialize k prototypes (w 1…, w k) so that the weighted distance between the clusters becomes w j = i l ,j ∈ {1,  … , k},l ∈ {1,  … , n}

 Associate each cluster C j with the prototype weight w j

Repeat

  for each input vector i l;,l ∈ {1,  … , n}

   do

    Assign i l to cluster C j∗ with the nearest w j∗

   for each cluster C j∗ : j ∈ {1,  … , k}, do;

    Update the prototype w j to be centroid of the sample

    observations in the current C j∗; \( {\boldsymbol{w}}_{\boldsymbol{j}}=\sum_{{\boldsymbol{i}}_{\boldsymbol{l}\boldsymbol{\in }{\boldsymbol{c}}_{\boldsymbol{j}}}}{\boldsymbol{i}}_{\boldsymbol{l}}/\left|{\boldsymbol{C}}_{\boldsymbol{j}}\right| \)

   Calculate the error function

$$ \boldsymbol{E}=\sum_{\boldsymbol{j}=\mathbf{1}}^{\boldsymbol{k}}\sum_{{\boldsymbol{i}}_{\boldsymbol{l}}\boldsymbol{\in}{\boldsymbol{C}}_{\boldsymbol{j}}}{\left|{\boldsymbol{i}}_{\boldsymbol{l}}-{\boldsymbol{w}}_{\boldsymbol{j}}\right|}^{\mathbf{2}} $$

  until E becomes constant or does not change significantly.

The pseudocode demonstrates the process of assigning data values to classes based on their proximity to the nearest mean with the least error function [93,94,95,96]. The error function is computed as the difference between the mean and the assigned cluster mean [97, 98].

3.5 Semisupervised and Other Learners

Even though the search was focused and narrowed down to supervised and unsupervised learning techniques, it emerged that research preferred using different methods for the purposes of comparing the results and verification of the classification and prediction accuracy of the machine learning models [75, 99, 100]. Some of the studies used supervised and unsupervised machine learning approaches alongside reinforcement learning techniques such as generative models, self-training algorithms, and transudative SVM [101,102,103]. Other studies focused on ensemble learning algorithms such as boosting and bagging, while other studies defined different perceptions related to neural networks. [59, 66, 104,105,106,107]. Finally, some of the studies addressed algorithms such as k-Nearest Neighbor as an instance-based learning but could not categorize it as either supervised or unsupervised machine learning algorithm because of the limitations of the applications [41, 108,109,110].

4 Conclusion and Future Work

Even though the search results yielded over 3300 qualified papers, the filtering processes based on title screening, abstract screening, full text screening, and data extraction coupled with meta-analysis reduced the number of articles to 84. Despite the narrowing the search results to supervised and unsupervised machine learning as key search words, the results contained articles that addressed reinforced learners and ensembled learners among other techniques that review did not focus. The trend is understandable, because machine learning and data science is evolving and most of the algorithms are undergoing improvements, hence the emergence of categories such as reinforced and ensembled learner. Hence, future systematic review prospect should focus on these emerging aggregations of learners and assess through research progress based on authorship, regions, and applications to identify the major driving forces behind the growth.