1.1 Introduction to Machine Learning

The last two decades have seen a quiet but important revolution in computer science. Now more than ever, computers and algorithms are leading to more prosperous and more accurate insights with software that learns from experience and adapts automatically to match the needs of its tasks [1]. Formerly, the programmer decided how the system would work by manually writing the code. Today, we do not write programs but rather collect data consisting of instruction insights, and develop the algorithms changes that manipulate it as necessary to extract patterns and insights. Today, we have programs that can recognize faces and fingerprints, understand speech, translate, navigate, drive a car, recommend movies, and many more [1]. This is possible now because of artificial intelligence (AI) and its fields, mainly machine learning.

Artificial intelligence reflects a computer’s ability to recognize patterns to then use and apply those patterns based on available data [2]. Artificial intelligence mimics human cognition by accessing data from a variety of sources and systems to make decisions and learn from their results and patterns [3]. Artificial intelligence was inspired by the human brain, given that computers were once known as “electronic brains” [1]. The human undoubtedly has incredible processing capabilities that humans have long aimed to understand in order to use and create an artificial version known as artificial intelligence. Artificial intelligence has been overgrowing since the Turing Test (originally named the imitation game) was conducted in 1950 by Alan Turing, which suggested that computers do have the ability to think intelligently as artificial entities [3]. Alan Turing’s research revolutionized how the world perceived artificial intelligence and its use in daily life [2]. As artificial intelligence has evolved from being a purely academic field, it has become a significant part of many social and economic sectors, including speech recognition, medical diagnosis, vehicles, and voice-activated assistance [3].

Machine learning (ML) is a type of artificial intelligence that branches from computer science [4]. Machine learning is not just a data storing, processing, or training problem; it is instead a means to achieve artificial intelligence by either training it on a dataset or using repeated trials to train a computer program to maximize intelligent performance [1]. Machines are great at making smart decisions because of the enormous datasets. On the other hand, humans are much better at making decisions with limited information. This combination is highly effective in leveraging both human and machine intelligence in creating machine learning models. Combining machine learning and human intelligence provides remarkably high levels of accuracy, leading us to artificial intelligence [3].

1.2 Origin of Machine Learning

Machine learning (ML) is often credited to a psychologist from Cornell University named Frank Rosenblatt, who, based on his theories about the workings of the human nervous system, developed a machine capable of recognizing letters of the alphabet in the 1960s [4]. The machine, called the “perceptron,” converted analog signals into discrete ones, becoming the prototype for modern artificial neural networks (ANNs). Further studies of the structures and learning abilities of neural networks took place in the 1970s and 1980s. Even so, the Novikoff theorem (1962), which states that a perceptron learning algorithm can be converged in a finite number of steps, has become more widely known and credited for machine learning. In 1979, students at Stanford University created a notable invention known as the “Stanford cart,” which could navigate different obstacles in a room [4]. The invention of the Stanford cart is an important part of the history of artificial intelligence and machine learning, as it paved a pathway for robotics research within the area. Today, rover robotic cleaning vacuums use a similar method to avoid obstacles in a room and pick up foreign materials such as dirt.

1.3 Growth of Machine Learning

Machine learning has transformed the twenty-first century through its progress and growth, increasing computer competence in various fields, including the automotive industry, healthcare, commerce, banking, and manufacturing [5]. The first decade of the twenty-first century marked a turning point in machine learning history, which can be attributed to three trends that worked collaboratively [4]. The first trend is big data, which refers to a large volume of data that is complicated and requires specialized methods to process [4]. This very large and mostly accessible data includes weather data, business transactions data, medical test results, social media posts, security camera recordings, GPS locations from smartphones, sensor data, and many more. Big data tomorrow will be bigger than today, and with more data, trained models will get more intelligent [1]. The second trend is the reduced cost of parallel computing and memory by distributing the processing of high volumes of data between simple processors [4]. In addition to that, more complex and powerful yet affordable processors like graphics processing units (GPU) are produced. These resources are available for data scientists and organizations today via cloud computing without the need for huge investments in hardware. The reduced costs and increased processing power and storage capability allowed for increased data capturing and storage and more efficient and faster coding and analysis. The third trend is the development of new algorithms of machine learning [4], many of which are readily available in open-source communities. The third trend is by far the most important, as it aided in the creation of artificial neural networks that support higher-level functions for data processing [4]. ANNs are crucial algorithms within machine learning (ML), serving the important purpose of solving complex problems.

There is more data than our sensors or brains can handle or process. The information available online today contains massive amounts of digital text and is now so vast that manual processing is impossible. The use of machine learning for this is much more efficient and is known as machine reading. The basic advantage of machine learning is that it can be applied to a wide range of tasks without explicitly being programmed to learn. Using machine learning, we can build systems that are capable of learning and adapting to their environment on their own with minimal supervision and maximal user satisfaction [1].

1.4 How Machine Learning Works

Machine learning works by utilizing many algorithms that make intelligent predictions based on the datasets being used. These datasets can be enormous, consisting of millions of data that cannot be processed by the human mind alone [5]. Machine learning has four variants: supervised, unsupervised, semi-supervised, and reinforcement learning [6]. In order to understand these variants, it is important to understand “labels.” A “label” in machine learning is the dependent variable and is a specified value of the outcome [6]. In supervised learning, labeled datasets are used by machine learning professionals to train algorithms by setting parameters to make accurate predictions about data [3]. Regression is one example of supervised learning [1]. On the other hand, unsupervised learning consists of multiple unlabeled datasets which are used to detect structure and patterns using the algorithm [3]. Data clustering is one example of unsupervised learning, and it is also much faster, as there are fewer labeled data [1]. Semi-supervised learning fits the models to both labeled and unlabeled data [6]. The main goal of semi-supervised learning is to understand how combining labeled and unlabeled data can change learning behavior and the design algorithms that use this combination [7]. Reinforcement learning is a machine learning algorithm that allows machines and software to automatically evaluate optimal behavior in specified contexts for improved efficiency [8]. Reinforced learning is usually used for training complicated artificial intelligence models to increase automation. All four of these variants are an important component of machine learning outcomes and play a significant role per their learning capabilities [8].

1.5 Machine Learning Building Blocks

Statistics, data mining, analytics, business intelligence, artificial intelligence, and machine learning are concepts, methods, and techniques used to understand data and explore it to find valuable information, relationships, trends, patterns, and anomalies and ultimately to make predictions. In the following section, we will introduce data, its types, and how it is managed and explored. We will also introduce business intelligence and data analytics. Statistics will be covered in detail in Chap. 2, and in Chap. 3, we will introduce data mining and explore machine learning and the different algorithms in more depth.

1.5.1 Data Management and Exploration

1.5.1.1 Data, Information, and Knowledge

Data and information are comparable but not the same. While many believe that data and information represent the same concept and can be used interchangeably, these two terms are quite distinct. Data are streams of raw, specific, and objective facts or observations generated from events such as business transactions, inventory, or medical examinations. Standing alone, data have no intrinsic meaning. Data is generally broken into two categories, structured and unstructured. Structured data, like patient records, sales transactions, and warehouse inventory, has a predefined format and can be easily processed and analyzed. It is commonly stored and managed in a relational database. Unstructured data, like free text, videos, images, audio files, tweets, and portable medical device outputs, are complex in their form and are more difficult to manage, process, and analyze [9, 10].

Once processed (e.g., filtered, sorted, aggregated, assembled, formatted, or calculated), data becomes endowed with relevance and purpose and is put in context. Data thus turns into information. Information is a cluster of facts that are meaningful and useful to human beings in processes such as making decisions [11, 12]. For instance, patients’ IDs, names, dates of birth, home addresses, postal codes, phone numbers, emails, and diagnoses are examples of data that can be collected in a community center, clinic, or hospital, while a bar chart presenting the percentage of patients in different age groups, a pie chart representing the number of patients per type of disease, or a map representing the patients’ distribution in a geographic area are examples of information.

Consider a simple example that we can all relate to, how purchases are processed at checkout at a grocery store. Scanning the barcodes of the products at a store generates or accesses data in the form of a product number, a short description of the product, and a price. When these data are processed, an invoice is generated, and the store’s inventory is updated. This generated information helps the store determine how much to charge the customer and process the payment. This new information also lets the store manager know how much inventory is left for each product and helps him decide when to order new supplies [13].

In summary, data is the new oil; data is simply a collection of facts. Once data are processed, organized, analyzed, and presented in a way that assists in understanding reality and ultimately making a decision, it is called information. Information is ultimately used to make a decision and take a course of action.

When processed further and internalized by humans, information becomes knowledge. Knowledge can be defined as understanding, awareness, or experience. It can be learned, discovered, perceived, inferred, or understood [10]. In the grocery store example, knowledge would be the awareness of which products sell the most during specific times of the year, which translates into the decision to order additional supplies to avoid out of the stock issue [13].

As we move from data to information and then to knowledge, we see more human contribution and greater value and, traditionally, a decreasing role of technology. Data are easily captured, generated, structured, stored, and transmitted by information and communication technology (ICT). Information, which is data endowed with relevance and purpose [14], requires analysis, a task increasingly being done by technology but also by human mediation and interpretation. Finally, knowledge, valuable information from the human mind, is difficult to capture electronically, structure and transfer and is often tacit and personal to the source [12, 15].

1.5.1.2 Big Data

There is no unique or universal definition of big data, but there is a general agreement that there has been an explosion in data generation, storage, and usage [9]. Big data is a common term used to refer to the massive size structured and unstructured data generated, made available, and being used [16]. These data come from daily business transactions at banks and retailers, for example, from sensors such as security cameras and monitoring systems, the GPS systems on every mobile phone, content posted on social media such as YouTube videos, and from many more ubiquitous sources [13]. Big data in the healthcare field comes from medical devices such as MRI scanners and X-ray machines, sensors such as heart monitors, patient electronic medical and health records, insurance providers’ records, doctors’ notes, genomic research studies, wearable devices, and many more [17]. An example of big data is what is collected by Fitbit, a manufacturer of wearable activity trackers. In 2018, it was announced that Fitbit had collected 150 billion hours’ worth of heart rate data from tens of millions of people from all over the world. These data also include sex, age, location, height, weight, activity levels, and sleep patterns. Moreover, Fitbit has 6 billion nights’ worth of sleep data [18].

There are multiple factors behind the emergence and growth of big data, and they include technological advances in the field of information and communication technology (ICT), where computing power and data storage capacity are continuously increasing while their cost is decreasing. The increased connectivity to the Internet is another major factor. Today, most people have a mobile device, and many modern pieces of equipment are connected to the Internet [13].

Big data is generally characterized by the four Vs: volume, variety, velocity (introduced originally by the Gartner Group in 2001), and veracity (added later by IBM) [19]. Multiple additional Vs were introduced later, including validity, viability, variability, vulnerability, visualization, volatility, and value [9, 19, 20]. Volume is the most defining characteristic of big data. The volume of data generated is increasing exponentially, and new units of measure have been created, such as zettabytes (1021), to accommodate this increasing volume of data. According to IDC, a market-research firm, the data created and copied in 2013 was 4.4 zettabytes, and this number is projected to exponentially increase to 44 zettabytes in 2020 and 180 zettabytes in 2025 (Fig. 1.1) [19, 21]. Examples of large volumes of data are the 20 terabytes (1012) of data produced by Boeing jets every hour and the 1 terabyte of data uploaded on YouTube every 4 minutes [22].

Fig. 1.1
A line graph titled big data volume in zettabytes. It plots values such as (2013, 4.4), (2020, 44), and (2025, 180). Values are approximated.

Exponential increase in the volume of big data (actual and projected) based on an IDC study (adapted from [13])

Variety refers to the different forms of big data, such as videos, pictures, social media posts, images from X-ray machines, location data from GPS systems, data from sensors like security devices and wearable wireless health monitors, and many more. Velocity refers to the very high speed at which big data are continuously being generated, for example, from medical devices and monitors in hospitals’ intensive care units or security cameras. Such data must be generated and analyzed in real-time, particularly when the outcome has a direct impact on someone’s safety in the case of driverless cars or their financial situation in the case of the stock market. Finally, veracity represents the high level of uncertainty and low levels of reliability and truthfulness of big data [9, 10, 13, 19]. Data can be biased, incomplete, or filled with noise, and data scientists and analysts spend more than 60% of their time cleaning data [19]. These characteristics of big data represent challenges for any company or industry. Some of the challenges are technical, such as being able to analyze the large volume of data, which is generated very rapidly and in many different formats. Other challenges may be administrative, such as the reliability of the data [13].

The increasing volume and complexity of data, which is very rapidly generated in different formats, have made it practically impossible for humans to analyze without sophisticated analytics techniques. Therefore, techniques like data analytics, data mining, artificial intelligence, and machine learning are playing an increasing role in transforming data or information into knowledge and helping humans make decisions and take action (Fig. 1.2).

Fig. 1.2
A process chart has the following steps. Data, processing, information, table analytics, knowledge, decision, and action.

Data-to-action value chain

1.5.1.3 OLAP Versus OLTP

A significant amount of data is produced by daily business transactions, be it a purchase of a product, such as an airline ticket or a book; withdrawing money from a bank; admitting a patient to a hospital; generating medical imagery, such as X-rays; updating patient records after a medical examination; and so on [13]. These transactions are managed by transaction processing systems (TPS) or online transaction processing systems (OLTP), which are computerized systems, such as payroll systems, order processing systems, reservations systems, or enterprise resource planning (ERP) systems, that perform and record the transactions that are necessary to conduct a business, such as employee record keeping, payroll, sales order entry, and shipping [11, 23]. At a bank, OLTP can be used to create new accounts, deposit and withdraw funds, process checks, transfer funds to other accounts, withdraw cash, pay bills, calculate and apply fees, and generate a report on all transactions performed during a period of time. OLTP systems function at the operational level of an organization and are mainly responsible for acquiring and storing data related to day-to-day automated business transactions, running everyday real-time analyses, and generating reports [10]. The value of the data generated and maintained by an OLTP system goes beyond supporting an organization’s operations and generating reports.

These data, coming from multiple sources or systems, can be further analyzed to support organizational decision-making using online analytical processing (OLAP). OLAP can manipulate and analyze large volumes of data from different perspectives and answer ad-hoc inquiries by executing multidimensional analytical queries [20, 23]. An OLAP system is a computer system with advanced query and analytical functionality, such as ad-hoc and what-if analysis capabilities [20, 24]. At a bank, an OLAP system can be used to predict which customers may quit, an exercise called churn analysis. OLAP can predict which customers are most susceptible to certain new services to develop targeted marketing campaigns instead of blanket or mass marketing, which is more expensive and less efficient and effective. Table 1.1 presents a brief comparison between OLTP and OLAP [10].

Table 1.1 A comparison between OLTP and OLAP (adapted from Sharda et al. (2015) [10])

1.5.1.4 Databases, Data Warehouses, and Data Marts

Today, most data is stored, organized, manipulated, and managed inside databases. A database is a collection of data formatted and organized into records that facilitates accessing, updating, adding, deleting, and querying those records [25]. A database can be perceived as a collection of files that are viewed as a single storage area of organized data records that are available to a wide range of users [22].

The most common type of database is a relational database, which consists of tables (called relations, hence the name “relational database”) that are connected via relationships (not to be confused with relations or tables). Each table, not different from a spreadsheet, represents an entity of interest for which we collect and manage data, for example, a customer table or a student table. Each table consists of multiple fields related to the entity it represents, such as the customer’s last name, first name, social security number, phone number, and address. An example of an employee relation or table is in Table 1.2.

Table 1.2 An employee table divided into rows (i.e., records) and columns (i.e., fields)

In a relational database, tables are connected via relationships. Relationships are created by linking primary keys and foreign keys in different tables. In a database, each table has a primary key, which is a field or attribute that is used to uniquely identify each record in the table, such as a student ID, a patient medical health card number, a product code, or a customer phone number. A primary key can consist of multiple fields as long as their combination is unique for every record in the table, such as the combination of a shipment number and product ID. Such a key is called a composite primary key. A table also has foreign keys, which point to primary keys in other tables and confirm the presence of relationships between these tables. Figure 1.3 presents an overview of the relationships in a university database.

Fig. 1.3
A diagram explains the 4 databases of instructor, course, credit, and student. Primary, foreign, and composite primary keys are annotated on the left side.

Relationships in a relational database

Relational databases are designed to quickly access the data for transaction processing (via TPS/OLTP systems), such as admitting a new patient to the hospital or performing a sales transaction. In addition to daily transactions, the database can be queried for occasional reports or information, such as the account balance of a customer; the report can then be used for decision-making, such as providing a line of credit or a loan. However, as the size of a database grows due to day-to-day transactions generating additional new records in the tables, it becomes very time-consuming to generate any analytics using the data. Moreover, analytics or OLAP would slow down the system, making routine transactions handled by the OLTP too slow. Transferring funds between customer bank accounts could take minutes instead of seconds. A solution would be to extract the data from the different databases, transform it into an appropriate format, and load it into a special database specifically designed for querying and OLAP. Data that is redundant or has no value would be cleansed. This process is called “extract, transform, and load,” or ETL for short [13]. The databases suitable for querying, OLAP, and decision-making are referred to as “data warehouses” and “data marts.” A data warehouse is a physical repository where current and historical data are specifically organized to provide an enterprise-wide cleansed data in a standardized format [25, 26]. The data in a warehouse is structured to be available in a form ready for OLAP, data mining, querying, reporting, and other decision support applications [26]. A data mart is a data warehouse subset usually focused on a single subject or department [22, 26]. Figure 1.4 presents an overview of a data warehouse.

Fig. 1.4
An illustration of the overview of data warehouse has the following flow. Data sources, E T L process, enterprise data warehouse, data mart, and routine reports.

Data warehouse overview (adapted from Sharda et al. (2013) [26])

Data inside a data warehouse or data mart is designed based on the concept of dimensional modeling, where high-volume complex queries are needed. The most common style of dimensional modeling is the star schema. While in an OLTP environment, the database consists of tables representing entities of interest, such as patients and their attributes (name, phone, address, etc.), the star schema has a fact table with a large number of attributes (mainly numbers) that are most needed for analysis and queries, while the rest of the valuable data are stored in attached dimension tables [24, 26, 27]. Figure 1.5 provides a visual representation of an OLTP and an OLAP-based database structure.

Fig. 1.5
Two block diagrams for the data model of O L T P and O L A P, which have the components of the customer, product, supplier, and order details. B includes dimensions of customer, employee, time, and products.

Data models for OLTP and OLAP (adapted from Mailvaganam (2007) [27])

While an OLTP system is designed for operational purposes and thus it is detailed, continuously and easily updated, has very current data, has to be always available, and is designed for transactional speed, an OLAP-based system is more informational and has summary data, is not continuously updated, has mainly historical and integrated data, and is designed for complex queries and analytics [24, 26, 27].

To perform the analytical processing, a data cube is created, which is a multidimensional data structure that is generated out of the star schema and allows for fast analysis of data. The cube is multidimensional but is commonly represented as three-dimensional for ease of viewing and understanding. Each side in a cube represents a dimension, such as a patient, procedure, or time, and the cells are populated with data from the fact table. The cube is optimally designed for common OLAP operations, such as filtering, slicing, dicing, drilling up and down, rolling up, and pivoting [24, 26, 27], which will be explored next.

1.5.1.5 Multidimensional Analysis Techniques

Data to be reported can be manipulated in many cases with simple arithmetic and statistical operations, such as summing up (e.g., total sales in a year), counting (e.g., the number of sales transactions), calculating the mean (e.g., the average profit of sales), filtering (e.g., extracting names of customers in a certain region who made the highest purchases), sorting, ranking, and so on. To extract the data from multiple tables in a relational database of a TPS system, one can issue an SQL query command that pulls out the data from multiple tables by performing a “join” operation (i.e., joining related data from different tables). To perform OLAP on a multidimensional data structure, similar to a cube in a data warehouse, several operations or techniques may be needed, such as slicing, dicing, and pivoting [9, 24, 28].

For simplicity, assume that we have a three-dimensional dataset of sales, where the dimensions are product, region, and year, which can be represented as a cube, where each axis is a dimension, and the cells contain sales data in thousands of dollars (Fig. 1.6). In this figure, we find that the sale of chairs in QC in 2021 was worth $110,000.

Fig. 1.6
A 3 by 3 O L A P cube with year, product, and region details on the top, left, and bottom edges, respectively. The years are 2019 to 2021. The regions are O N, Q C, and B C.

Example of an OLAP cube containing sales figures of three products in three regions over a period of 3 years

1.5.1.5.1 Slicing and Dicing

Slicing and dicing operations are used to make large amounts of data easier to understand and work with. Slicing is a method to filter a large dataset into smaller datasets of interest while dicing these datasets creates even more granularly defined datasets [9]. Slicing is taking a single slice out of the cube, representing one dimension, showing, for example, the sales of tables for each region and year (Fig. 1.7).

Fig. 1.7
A 3 by 3 O L A P cube with year, product, and region details on the top, left, and bottom edges, respectively. The second row for the product titled table is highlighted.

Representation of slicing data to extract the sales of tables in all regions and years

Another example of slicing is in Table 1.3, where the sales of each product are summed up for all regions and years.

Table 1.3 Example of slicing

Dice is a slice on more than two dimensions of the cube [28]. Dicing is putting multiple side-by-side members from a dimension on an axis with multiple related members from a different dimension on another axis, allowing the viewing and analysis of the interrelationship among different dimensions [24]. Two examples of dicing are depicted in Table 1.4, showing the sales of all products per region per year and sales per region per product for all years.

Table 1.4 Dicing for region/year (left) and region/product (right)
1.5.1.5.2 Pivoting

A pivot table is a cross-tabulated structure (crosstab) that displays aggregated and summarized data based on the ways the columns and rows are sorted. Pivoting means swapping the axes or exchanging rows with columns and vice versa or changing the dimensional orientation of a report [9, 24, 28] (Table 1.5).

Table 1.5 Pivoting region and year
1.5.1.5.3 Drill-Down, Roll-Up, and Drill-Across

Drilling down or rolling up is where the user navigates among levels of the data ranging from the most summarized (roll-up) to the most detailed (drill-down) [28] and happens when there is a multilevel hierarchy in the data (e.g., country, province, city, neighborhood) and the users can move from one level to another [24]. Figure 1.8 shows an example of drilling down on the product dimension. When you roll up, the key data, such as sales, are automatically aggregated, and when you drill down, the data are automatically disaggregated [9].

Fig. 1.8
Three tables for product dimensions with columns for the store, C A, O R, L A, and total. The group at each table points to the group at the next table.

Drilling down (adapted from Ballard et al. (2012) [24])

Drilling across is a method where you drill from one dimension to another, but where the drill-across path must be defined [24]. Figure 1.9 shows an example of a drill-across from the store CA to the product dimension.

Fig. 1.9
Two drilling-down tables with 4 columns for sales in U S D and metrics. Rows are CA, OR, and LA in-store, and products are soda, milk, and juice. C A points to the product in the next table.

Drilling across (adapted from Ballard et al. (2012) [24])

1.5.2 The Analytics Landscape

Analytics is the science of analysis—using data for decision-making [26]. Analytics involves the use of data, analysis, and modeling to arrive at a solution to a problem or to identify new opportunities. Data analytics can answer questions such as (1) what has happened in the past and why, (2) what could happen in the future and with what certainty, and (3) what actions can be taken now to control events in the future [9, 10].

Data analytics have traditionally fallen under the umbrella of a larger concept called “business intelligence,” or BI. BI has been defined as “the integration of data from disparate source systems to optimize business usage and understanding through a user-friendly interface” [29] and as “the concepts and methods to improve business decision-making by using fact-based support systems” [30]. BI is a conceptual framework for decision support that combines a system architecture, databases and data warehouses, analytical tools, and applications [22]. BI is a mature concept that applies to many fields, despite the presence of the word “business.” While remaining a quite common term, BI is slowly being replaced by the term “analytics,” sometimes referring to the same thing. The major objective of BI is to enable interactive access to data and data manipulation, and to provide end users (e.g., managers, professionals) with the capacity to perform analysis for decision-making. BI analyzes historical and current data and transforms it into information and valuable insights (and knowledge), which lead to more informed and evidence-based decision-making [10]. BI has been very valuable in applications such as customer segmentation in marketing, fraud detection in finance, demand forecasting in manufacturing, and risk factor identification and disease prevention and control in healthcare. BI uses a set of metrics to measure past performance and report a set of indicators that can guide decision-making; it involves a set of methods such as querying structured datasets and reporting the findings, using dashboards, automated monitoring of critical situations, online analytical processing (OLAP) using cubes, slice and dice, and drilling. BI is essentially reactive and performed with much human involvement [13].

Analytics, alternately, are more proactive and can be performed automatically by a set of algorithms (e.g., data mining and machine learning algorithms). Analytics access structured data (e.g., product code, quantity sold, and current inventory level) and unstructured data (e.g., free text describing the product or pictures of the product); they describe what happened in the past, such as how many units of a certain product were sold last year (descriptive analytics); predict what will (most likely) happen in the future, such as how many units we expect to sell next year (predictive analytics); or even prescribe what actions we should take to have certain outcomes in the future (prescriptive analytics), such as what quantity of the product we should order and when. Analytics analyze trends, recognize patterns, and possibly prescribe actions for better outcomes, and they use a multitude of methods, such as predictive modeling, data mining, text mining, statistical analysis, simulation, and optimization [13].

Some sources offer a distinction between BI and analytics using a spectrum of analytics capabilities. BI is traditional and mature and looks at the present and historical data to describe the current state of a business. It uses basic calculations to provide answers. This functionality is compatible with what is referred to as “descriptive analytics” and is at the lower end of the spectrum. Analytics, on the other hand, mines data to predict where the business is heading and prescribes actions to maximize beneficial outcomes. It uses mathematical models to determine attributes and offer predictions. These functionalities are referred to as “predictive” and “prescriptive analytics” and fall on the higher end of the analytics spectrum [13, 31]. Having clarified to a certain extent the difference between BI and analytics, we will refrain from using the term BI and rely instead on the analytics taxonomy: descriptive, diagnostic, predictive, and prescriptive analytics, which will be described in detail below.

1.5.2.1 Types of Analytics (Descriptive, Diagnostic, Predictive, Prescriptive)

Analytics are of four types: descriptive, diagnostic, predictive, and prescriptive. These types have increasing difficulty and complexity levels and provide increasing value to the users (Fig. 1.10).

Fig. 1.10
A graph of value versus difficulty has an upward line with elements of hindsight, insight, and foresight mentioned from the bottom to the top.

Types of analytics, the value they provide, and their level of difficulty (adapted from Rose Business Technologies [32])

1.5.2.1.1 Descriptive Analytics

Descriptive analytics query past or current data and report on what happened (or is happening). Descriptive analytics display indicators of past performance to assist in understanding successes and failures and provide evidence for decision-making; for instance, decisions related to the delivery of quality care and optimization of performance need to be based on evidence [13].

Using descriptive analytics, such as reports and data visualization tools (e.g., dashboards), end users can look retrospectively into past events; draw insight across different units, departments, and, ultimately, the entire organization; and collect evidence that is useful for an informed decision-making process and evidence-based actions. At the initial stages of analysis, descriptive analytics provide an understanding of patterns in data to find answers to the “What happened?” questions, for example, “Who are our best customers in terms of sales volume?” and “What are our least selling products?” Descriptive statistics, such as measures of central tendency (mean, median, and mode) and measures of dispersion (minimum, maximum, range, quartiles, and standard deviations), as well as distribution of variables (e.g., histograms), are used in descriptive analytics [13].

Descriptive analytics can quantify events and report on them and are a first step in turning data into actionable insights. Descriptive analytics, for example, can help with population health management tasks, such as identifying how many patients are living with diabetes, benchmarking outcomes against government expectations, or identifying areas for improvement in clinical quality measures or other aspects of care [33]. Descriptive analytics considers past data analysis to make decisions that help us achieve current and future goals. Statistical analysis is the main “tool” used to perform descriptive analytics; it includes descriptive statistics that provide simple summaries, including graphics analysis, measures of central tendencies (e.g., frequency graphs, average/mean, median, mode), or measures of data variation or dispersion (e.g., standard deviation) [13].

Surveys, interviews, focus groups, web metrics data (e.g., number of hits on a webpage, number of visitors to a page), app metrics data (e.g., number of minutes spent using a feature), and health data stored in electronic records can be the source for all analytics, including descriptive analytics. Media companies and social media platforms (e.g., Facebook) use descriptive analytics to measure customer engagement; managers in hospitals can use descriptive analytics to understand the average wait times in the emergency room (ER) or the number of available beds. Descriptive analytics allow us to access information needed to make actionable decisions in the workplace. They allow decision-makers to explore trends in data (why do we have long lines in the ER?), to understand the “business” environment (who are the patients coming to the ER?), and to possibly infer an association (i.e., a correlation) between an outcome and some other variables (patients with the chronic obstructive pulmonary disease tend to have more visits to the ER) [13].

Reports are the main output in descriptive analytics, where findings are presented in charts (e.g., a bar graph or pie chart), summary tables, and most interestingly, pivot tables. A pivot table is a table that summarizes data originating from another table and provides users with the functionality to sort, average, sum, and group data in a meaningful way [13] (Fig. 1.11, 1.12 and 1.13).

Fig. 1.11
A screenshot of an excel sheet with the columns for date, region, product, units sold, unit price, tax, and total. The sheet has entries in 12 rows.

Example of a data sheet in Microsoft Excel

Fig. 1.12
A snapshot of an excel sheet with a pivot table of column headers chair, light, table, and total. The table has 5 rows.

Example of a pivot table in Microsoft Excel that summarizes data from Fig. 1.11. On the right, we can notice the pivot table fields that allow users to control which summaries are being computed and displayed

Fig. 1.13
A column chart of the sum of total versus region for chair, light, and table products. The chair is the highest for regions B C, and Q C.

Example of a column chart in Microsoft Excel that visualizes data from Fig. 1.12

1.5.2.1.2 Diagnostic Analytics

Descriptive analytics give us insight into the past but do not answer the question, “Why did it happen?” Diagnostic analytics aims to answer that type of question. They focus on enhancing processes by identifying why something happened and what the relationships are between the event and other variables that could constitute its causes [34]. They involve trend analysis, root cause analysis [35], cause and effect analysis [36, 37], and cluster analysis [38]. They are exploratory and provide users with interactive data visualization tools [39]. An organization can monitor its performance indicators through diagnostic analysis.

1.5.2.1.3 Predictive Analytics

Predictive analysis uses past data to create a model that answers the question, “What will happen?”; it analyzes trends in historical data and identifies what is likely to happen in the future. Using predictive analytics, users can prepare plans and proactively implement corrective actions in advance of the occurrence of an event [39]. Some of the techniques used are what-if analysis, predictive modeling [40,41,42], machine learning algorithms [43,44,45], and neural network algorithms [46, 47]. Predictive analytics can be used for forecasting and resource planning. Predictive analytics share many basic concepts and techniques, like algorithms, with machine learning, which is covered in detail later in this textbook.

1.5.2.1.4 Prescriptive Analytics

While predictive analytics estimate what may happen in the future, prescriptive analytics goes a step further by prescribing a certain action plan to address the problems revealed by diagnostic analytics and increase the likelihood of the occurrence of the desired outcome (which may not have been forecasted by predictive analytics) [39, 48,49,50]. Prescriptive analytics encompasses simulating, evaluating several what-if scenarios, and advising how to maximize the likelihood of the occurrence of desired outcomes. Some of the techniques used in prescriptive analytics are graph analysis, simulation [51,52,53], stochastic optimization [54,55,56], and nonlinear programming [57,58,59]. Prescriptive analytics is beneficial for advising a course of action to reach a desirable goal.

Prescriptive analytics go beyond prediction to prescribe an optimal course of action to reach a certain goal based on predictions of future events. A simple example would be an app that predicts the duration of a journey from a current location to certain destinations; if the app is equipped with prescriptive analytics, then it can prescribe the shortest path to reach the destination after comparing several alternative routes [13] (Fig. 1.14).

Fig. 1.14
A table titled evolution of analytics since the 1980s has 4 columns and 3 rows. Row headers are questions, process focus, and tools and techniques. The column headers are descriptive, diagnostic, predictive, and prescriptive analytics.

Analytics: questions, focus, and tools (adapted from Podolak [60])

1.6 Conclusion

Machine learning has proven itself to be a sustainable and useful technology in today’s world, and its use is increasing every single day. Everything from smart devices to sophisticated automated systems such as self-driving cars uses machine learning in order to operate. Our progressively complex world is better understood with machine learning because we are currently exposed to more information than ever before and it will only continue growing [1]. In this chapter, we introduced the concept of machine learning and its origins, applications, and building blocks. In the following chapters, we elaborate more on the concept and explore in depth its different algorithms.

1.7 Key Terms

  1. 1.

    Machine learning

  2. 2.

    Artificial intelligence

  3. 3.

    Parallel computing

  4. 4.

    Distributed computing

  5. 5.

    Graphics processing units (GPU)

  6. 6.

    Big data

  7. 7.

    Transaction processing systems (TPS)

  8. 8.

    Online transaction processing systems (OLTP)

  9. 9.

    Online analytical processing (OLAP)

  10. 10.

    Data variety

  11. 11.

    Data velocity

  12. 12.

    Data veracity

  13. 13.

    Databases

  14. 14.

    Data warehouses

  15. 15.

    Data marts

  16. 16.

    Data slicing

  17. 17.

    Data dicing

  18. 18.

    Analytics

  19. 19.

    Descriptive analytics

  20. 20.

    Diagnostic analytics

  21. 21.

    Predictive analytics

  22. 22.

    Prescriptive analytics

1.8 Test Your Understanding

  1. 1.

    Write a definition of analytics.

  2. 2.

    How are descriptive analytics different than diagnostic analytics?

  3. 3.

    How are diagnostic analytics different than predictive analytics?

  4. 4.

    What is data slicing?

  5. 5.

    When do we use data dicing?

  6. 6.

    Which system is focused on daily business processes: OLTP or OLAP?

  7. 7.

    Enumerate five advantages of big data in healthcare.

  8. 8.

    Choose a sector of society and specify five advantages of the use of AI and machine learning in that sector.

  9. 9.

    Choose a sector of society and specify five disadvantages of the use of AI and machine learning in that sector.

  10. 10.

    Can AI be a source of bias? How? Search for examples in the literature.

1.9 Read More

  1. 1.

    Biswas, R. (2021). Outlining Big Data Analytics in Health Sector with Special Reference to Covid-19. Wirel Pers Commun, 1–12. https://doi.org/10.1007/s11277-021-09446-4

  2. 2.

    Clark, C. R., Wilkins, C. H., Rodriguez, J. A., Preininger, A. M., Harris, J., DesAutels, S., Karunakaram, H., Rhee, K., Bates, D. W., & Dankwa-Mullan, I. (2021). Health Care Equity in the Use of Advanced Analytics and Artificial Intelligence Technologies in Primary Care. J Gen Intern Med, 36(10), 3188–3193. https://doi.org/10.1007/s11606-021-06846-x

  3. 3.

    El Morr, C., & Ali-Hassan, H. (2019). Analytics in Healthcare: A Practical Introduction. Springer.

  4. 4.

    IBM. (2022). What are healthcare analytics? IBM. Retrieved May, 10, 2022 from https://www.ibm.com/topics/healthcare-analyticsKhalid, S., Yang, C., Blacketer, C., Duarte-Salles, T., Fernández-Bertolín, S., Kim, C., Park, R. W., Park, J., Schuemie, M. J., Sena, A. G., Suchard, M. A., You, S. C., Rijnbeek, P. R., & Reps, J. M. (2021). A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data. Comput Methods Programs Biomed, 211, 106,394. https://doi.org/10.1016/j.cmpb.2021.106394

  5. 5.

    Lopez, L., Chen, K., Hart, L., & Johnson, A. K. (2021). Access and Analytics: What the Military Can Teach Us About Health Equity. Am J Public Health, 111(12), 2089–2090. https://doi.org/10.2105/ajph.2021.306535

  6. 6.

    Moreno-Fergusson, M. E., Guerrero Rueda, W. J., Ortiz Basto, G. A., Arevalo Sandoval, I. A. L., & Sanchez-Herrera, B. (2021). Analytics and Lean Health Care to Address Nurse Care Management Challenges for Inpatients in Emerging Economies. J Nurs Scholarsh, 53(6), 803–814. https://doi.org/10.1111/jnu.12711

  7. 7.

    Mukherjee, S., Frimpong Boamah, E., Ganguly, P., & Botchwey, N. (2021). A multilevel scenario based predictive analytics framework to model the community mental health and built environment nexus. Sci Rep, 11(1), 17,548. https://doi.org/10.1038/s41598-021-96801-x

  8. 8.

    Qiao, S., Li, X., Olatosi, B., & Young, S. D. (2021). Utilizing Big Data analytics and electronic health record data in HIV prevention, treatment, and care research: a literature review. AIDS Care, 1–21. https://doi.org/10.1080/09540121.2021.1948499

1.10 Lab

All instructions will be for Windows users; Mac users can follow overall the same instructions.

1.10.1 Introduction to R

R is an open-source integrated development environment (IDE) used for statistical analysis. This section describes step-by-step instructions to download and install R v4.1.0 and RStudio IDE v1.4.1717.

R can be downloaded and installed from the following location: https://www.r-project.org/. Below are instructions for R’s download and installation.

  1. 1.

    Go to the following mirror location and download R (The Comprehensive R Archive Network (sfu.ca)). For Windows users, click on “Download R for Windows”; for other operating systems, click on the corresponding link (Fig. 1.15).

Fig. 1.15
A screenshot of the R installation has 3 sections under the comprehensive R archive network for download, source code, and questions.

R Installation for Windows users and Mac users

  1. 2.

    While we will demonstrate the installation for Windows, the installation for macOS is similar. For Windows, click on the “install R for the first time” link as shown in Fig. 1.16:

Fig. 1.16
A snapshot of the R installation has subdirectories details under R for windows. The left margin includes various options under C R A N, about R, software, and documentation.

First-time R installation

  1. 3.

    Click on the “Download R 4.1.0 Windows” link (Fig. 1.17):

Fig. 1.17
A screenshot has the following links. Download R 4 dot 1 dot 0 for windows, installation and other instructions, and new features.

R v4.1.0 for Windows installation

  1. 4.

    R-4.1.0-win.exe will be installed into the Downloads folder. Click on the R-4.1.0-win.exe file to start the installation and continue by clicking the “Next” button to complete the installation (Fig. 1.18).

Fig. 1.18
A snapshot of a window with the heading setup, R for windows 4 dot 1 dot 0. The screen has a set of information. The next button at the bottom is selected.

R setup for Windows user

  1. 5.

    If a shortcut for R was not automatically created on your desktop, you can always create one. On Windows, go to the following location C:\Program Files\R-4.1.0 (on a Mac, go to the Applications folder) (Fig. 1.19), right click on R.exe, and choose “Create Shortcut” (or “Make Alias” for Mac users). A new shortcut will be created; move it to your desktop. This will enable you to launch R easily from the desktop.

Fig. 1.19
A screenshot of a window depicts the folder locations. The file named R in windows C is selected.

R installation local directory

  1. 6.

    Double-click on the “R” icon and launch the R software; a new command prompt appears (Fig. 1.20):

Fig. 1.20
A set of commands for starting the R application.

Launching R application

1.10.2 Introduction to RStudio

RStudio v1.4.1717 is the IDE for the R language. It includes a workspace for coding, debugging, plotting, etc. It can be installed from the following location: Download the RStudio IDE—RStudio. Below are instructions for RStudio’s download and installation.

1.10.2.1 RStudio Download and Installation

  1. 1.

    Download RStudio: Click on “Download RStudio for Windows.” RStudio is available for Mac users as well on the same webpage (Fig. 1.21).

Fig. 1.21
A snapshot has two options under R studio desktop 1 dot 4 dot 1717 to install R and download R studio desktop.

Installing RStudio Desktop 1.4.1717

  1. 2.

    Double-click on “RStudio-1.4.1717.exe” and click on “Setup RStudio v1.4.1717”; the installation will start (Fig. 1.22).

Fig. 1.22
A screenshot titled R studio setup has a bar with a percentage demonstrating installation.

RStudio setup

  1. 3.

    If a shortcut for RStudio was not automatically created on your desktop, you can always create one. On Windows, go to the following location c:\Program files\RStudio\Bin (go to the Applications folder if you are using macOS) (Fig. 1.23).

Fig. 1.23
A snapshot of a window depicts a list of files in windows C. The rstudio file at the bottom is selected.

Creating a shortcut for the RStudio application

  1. 4.

    After launching the RStudio application, its IDE will appear as shown in Fig. 1.24:

Fig. 1.24
A screenshot of the RStudio window has a set of commands under the console tab on the left and an arrow point to the package on the right.

Launching RStudio IDE

1.10.2.2 Install a Package

Packages are libraries that allow us to do specific tasks in RStudio (e.g., load a file, display a result, do an analysis). A package can be installed in the RStudio console using the following instructions:

  1. 1.

    Click on the Packages tab and click on the Install button (Fig. 1.25):

Fig. 1.25
A screenshot of the RStudio window has the environment tab and console tab selected. An arrow points to the packages that consist of the system library. Base and datasets, graphics, and grDevices are enabled.

Navigating RStudio Packages tab

  1. 2.

    Install the readr package by typing “readr” and clicking on the Install button (Fig. 1.26).

Fig. 1.26
A screenshot of the install packages dialog box with entry fields. The install and cancel buttons are at the bottom.

Installing readr package for loading files

1.10.2.3 Activate Package

  1. 1.

    To activate the readr library used to read csv and txt files, type “library(readr)” (Fig. 1.27):

Fig. 1.27
A screenshot of RStudio has a set of commands under the console tab on the left. The environment tab on the top right is empty. At the bottom right, four components under the library system of packages are selected.

Activating readr package for use

1.10.2.4 User Readr to Load Data

Different dataset types, such as txt, csv, and xlsx, can be imported into RStudio files (Fig. 1.28)

  1. 1.

    Download Diabetes.csv: Go to the following link https://www.kaggle.com/uciml/pima-indians-diabetes-database/version/1 (or go to kaggle.com and search for “Pima Indians Diabetes Database”).

  2. 2.

    Next, you will load the diabetes.csv file and plot it as a histogram. Under the main menu, click on the File menu/Import dataset/From text readr. In case you are prompted to download a library, accept to download it. Choose the input file from the folder where you have saved it and click Import.

Fig. 1.28
A snapshot of a table titled diabetes has 8 columns and entries in 15 rows.

Importing dataset using the readr package

1.10.2.5 Run a Function

The hist function can be used to visualize the data in a histogram. The hist function can present the data imported earlier (blood pressure vs. age) in a histogram (Fig. 1.29). Type the following:

hist(diabetes$BloodPressure,main="Blood Pressure Histogram",xlab = "Blood Pressure",ylab = "Count", las=1).

Fig. 1.29
An RStudio window has 4 sections, a table for diabetes, a set of commands, the environment tab, and a histogram.

Visualize data in a histogram

1.10.2.6 Save Status

To save your progress, use the ctrl + S shortcut or click the blue Save button in the main menu (Fig. 1.30):

Fig. 1.30
A screenshot headed rstudio has a table for diabetes with 9 columns and 9 rows. An arrow at the top points to the save icon.

Save progress status in RStudio

1.10.3 Introduction to Python and Jupyter Notebook IDE

Python is an interpreted programming language. It is characterized by human readability; it is an important programming language in machine learning and artificial intelligence due to its flexible libraries and ease of use.

In this book, Jupyter Notebook IDE is used for Python labs and examples.

1.10.3.1 Python Download and Installation

Python programming language can be installed from the following location: Python.org. Below are instructions for python download and installation for Python v3.9.6.

  1. 1.

    Download the Windows installer to your computer location (Fig. 1.31).

  2. 2.

    Install Python v3.9.6 (64-bit) using the “Customize installation” option; please follow the screenshots carefully by checking the right checkboxes as indicated (Figs. 1.32, 1.33, and 1.34).

  3. 3.

    To validate the installation, open a terminal (open “cmd” in Windows; open a terminal in macOS) and type “python -v.”

Fig. 1.31
A screenshot with the files section of details in a table with 6 rows and 7 columns. The windows installer of 64-bit at the bottom is indicated by an arrow.

Install Python v3.9.6 Windows installer

Fig. 1.32
A screenshot of a window titled python 3 dot 9 dot 6 setup has options for installation and customized installation.

Setting up Python 3.9.6

Fig. 1.33
A snapshot of a window headed python 3 dot 9 dot 6 setup, has 4 optional features on the screen. The next button at the bottom is selected.

Configuring Python 3.9.6 features

Fig. 1.34
A screenshot of a window titled python 3 dot 9 dot 6 setup has seven advanced options, out of that 3 are selected. The install, back, and cancel buttons are at the bottom.

Configuring more Python v3.9.6 installation options

1.10.3.2 Jupyter Download and Installation

  1. 1.

    To install “Jupyter Notebook” IDE, open a terminal and type “pip install jupyter”; if the command does not work then type “pip3 install jupyter” (Fig. 1.35).

  2. 2.

    We need a useful Python package called “pandas” to manipulate data. In order to use pandas, we need first to install them. To do so we will use the command pip install.

    On Windows, open a terminal (cmd) as an administrator; you can do so by right clicking on cmd and choosing Run As Administrator. In the terminal, type “pip install pandas” (Fig. 1.36).

    On macOS, open the terminal and write “sudo pip install pandas.”

    macOS will ask you for your password; the system assumes you have administrative powers. Enter the password and the library package will be installed (Fig. 1.37).

  3. 3.

    Using the same strategy, install matplot and openpyxl libraries (Fig. 1.38):

Fig. 1.35
A snapshot of the command prompter window with a set of commands for the installation of Jupyter.

Installing Jupyter Notebook IDE

Fig. 1.36
A screenshot of the administrator command prompt window with a set of commands for the installation of panda packages.

Installing pandas package to work with data frames for Windows users

Fig. 1.37
A snapshot of the command prompt window with a warning message and set of commands for the installation of panda packages.

Installing pandas package to work with data frames for Mac users

Fig. 1.38
A snapshot of the command prompt window exhibits a warning message and set of commands for the installation of the matplotlib package.

Installing matplotlib and openpyxl packages

pip install matplotlib pip install openpyxl

For MacOS, launch Jupyter Notebook by typing the following in the terminal: “Jupyter Notebook”. For windows type “python -m jupyter notebook”. Then, click “New” button and choose “Python 3” to create a notebook (Figs. 1.39 and 1.40).

Fig. 1.39
A screenshot of the command prompt window depicts a set of commands for installing the Jupyter notebook I D E.

Launching Jupyter Notebook IDE

Fig. 1.40
A Jupyter webpage has 11 folders under the files tab with the last modified date. The logout and quit options are at the top.

Notebook webpage

1.10.3.3 Load Data and Plot It Visually

Python code can be added in the Jupyter Notebook IDE file, and every line can be executed using the “Run” button for every line. It is important to note that code on all lines can be run once by doing the following: Under the Cell menu, click “Run All.” This is shown below in Fig. 1.41.

Fig. 1.41
A snapshot of a window titled Jupyter depicts a list of options under the cell tab. The cell tab and the option, run all, under the cell tab are annotated.

Load data in Jupyter Notebook and execute code

The code below allows us to read the blood pressure measurement from the diabetes.xlsx file and plot it in a histogram. Open diabetes.csv and save it as diabetes. xlsx, then follow the instructions below (Fig. 1.42):

Type: import pandas as pd Type: import matplotlib.pyplot as plt Type: df=pd.read_excel("diabetes.xlsx") Type: dfplt=df.plot(kind="hist")

Fig. 1.42
A screenshot of a window headed Jupyter with some codes at the top and a stacked column chart at the bottom.

Visualizing blood pressure measurements in a graph

1.10.3.4 Save the Execution

All that we have done can be saved in a file with the extension “ipynb.” This file can be loaded later in Jupyter Notebook IDE for any updates or changes to continue working from where you left. Click on “File,” then choose “Save” or click on the save icon.

1.10.3.5 Load a Saved Execution

Go to your workspace folder in the command line and lunch Jupyter Notebook. Then, double-click on the file that you need to continue working on.

1.10.3.6 Upload a Jupyter Notebook File

You can also upload files into Jupyter Notebook through the application interface. After launching the program, click on the Upload button and upload all the files that you want to upload, as shown in Fig. 1.43.

Fig. 1.43
A snapshot of a window depicts an open tab with a file under the Jupyter projects in windows C is annotated.

Upload Jupyter Notebook files into the application

1.10.4 Do It Yourself

The following problem is to try by yourself:

  1. 1.

    Weka is a machine learning software. Install Weka from the following link: https://waikato.github.io/weka-wiki/downloading_weka/

  2. 2.

    Jupyter lab is the next generation Jupyter Notebook interface. Install Jupyter lab from https://jupyter.org/. We suggest that you use Jupyter lab instead of Jupyter Notebook.

  3. 3.

    Download any dataset from Find Open Datasets and Machine Learning Projects | Kaggle and plot the data visually in R.

  4. 4.

    Download any dataset from Find Open Datasets and Machine Learning Projects | Kaggle and plot the data visually in Jupyter Notebook or Jupyter lab.

  5. 5.

    Try Colabroatory, also known as Colab, the online Python development environment provided by Google: https://colab.research.google.com/. We strongly advise you to use either Colab or JupyterLab for your projects.