This chapter presents some technical details about data formats, streaming, optimization of computation, and distributed deployment of optimized learning algorithms. Chapter 22 provides additional optimization details. We show format conversion and working with XML, SQL, JSON, 15 CSV, SAS and other data objects. In addition, we illustrate SQL server queries, describe protocols for managing, classifying and predicting outcomes from data streams, and demonstrate strategies for optimization, improvement of computational performance, parallel (MPI) and graphics (GPU) computing.

The Internet of Things (IoT) leads to a paradigm shift of scientific inference – from static data interrogated in a batch or distributed environment to on-demand service-based Cloud computing. Here, we will demonstrate how to work with specialized data, data-streams, and SQL databases, as well as develop and assess on-the-fly data modeling, classification, prediction and forecasting methods. Important examples to keep in mind throughout this chapter include high-frequency data delivered real time in hospital ICU’s (e.g., microsecond Electroencephalography signals, EEGs), dynamically changing stock market data (e.g., Dow Jones Industrial Average Index, DJI), and weather patterns.

We will present (1) format conversion of XML, SQL, JSON, CSV, SAS and other data objects, (2) visualization of bioinformatics and network data, (3) protocols for managing, classifying and predicting outcomes from data streams, (4) strategies for optimization, improvement of computational performance, parallel (MPI) and graphics (GPU) computing, and (5) processing of very large datasets.

16.1 Working with Specialized Data and Databases

Unlike the case studies we saw in the previous chapters, some real world data may not always be nicely formatted, e.g., as CSV files. We must collect, arrange, wrangle, and harmonize scattered information to generate computable data objects that can be further processed by various techniques. Data wrangling and preprocessing may take over 80% of the time researchers spend interrogating complex multi-source data archives. The following procedures will enhance your skills in collecting and handling heterogeneous real world data. Multiple examples of handling long-and-wide data, messy and tidy data, and data cleaning strategies can be found in this JSS Tidy Data article by Hadley Wickham.

16.1.1 Data Format Conversion

The R package rio imports and exports various types of file formats, e.g., tab-separated (.tsv), comma-separated (.csv), JSON (.json), Stata (.dta), SPSS (.sav and .por), Microsoft Excel (.xls and .xlsx), Weka (.arff), and SAS (.sas7bdat and .xpt).

rio provides three important functions import(), export() and convert(). They are intuitive, easy to understand, and efficient to execute. Take Stata (.dta) files as an example. First, we can download 02_Nof1_Data.dta from our datasets folder.

figure a

The data are automatically stored as a data frame. Note that rio sets stingAsFactors=FALSE as default.

rio can help us export files into any other format we choose. To do this we have to use the export() function.

figure b

This line of code exports the Nof1 data in xlsx format located in the R working directory. Mac users may have a problem exporting *.xslx files using rio because of a lack of a zip tool, but still can output other formats such as ".csv". An alternative strategy to save an xlsx file is to use package xlsx with default row.name=TRUE.

rio also provides a one step process to convert and save data into alternative formats. The following simple code allows us to convert and save the 02_Nof1_Data.dta file we just downloaded into a CSV file.

figure c

You can see a new CSV file popup in the current working directory. Similar transformations are available for other data formats and types.

16.1.2 Querying Data in SQL Databases

Let’s use as an example the CDC Behavioral Risk Factor Surveillance System (BRFSS) Data, 2013-2015. This file for the combined landline and cell phone data set was exported from SAS V9.3 in the XPT transport format. This file contains 330 variables and can be imported into SPSS or STATA. Please note: some of the variable labels get truncated in the process of converting to the XPT format.

Be careful – this compressed (ZIP) file is over 315MB in size!

figure d
figure e

Let’s try to use logistic regression to find out if self-reported race/ethnicity predicts the binary outcome of having a health care plan.

figure f
figure g

Next, we’ll examine the odds (rather the log odds ratio, LOR) of having a health care plan (HCP) by race (R). The LORs are calculated for two array dimensions, separately for each race level (presence of health care plan (HCP) is binary, whereas race (R) has 9 levels, R1, R2, …, R9). For example, the odds ratio of having a HCP for R1 : R2 is:

$$ OR\left(R1:R2\right)=\frac{\frac{P\left( HCP\mid R1\right)}{1-P\left( HCP\mid R1\right)}}{\frac{P\left( HCP\mid R2\right)}{1-P\left( HCP\mid R2\right)}}. $$
figure h

Now, let’s see an example of querying a database containing structured relational collection of data records. A query is a machine instruction (typically represented as text) sent by a user to a remote database requesting a specific database operation (e.g., search or summary). One database communication protocol relies on SQL (Structured query language). MySQL is an instance of a database management system that supports SQL communication and is utilized by many web applications, e.g., YouTube, Flickr, Wikipedia, biological databases like GO, ensembl, etc. Below is an example of an SQL query using the package RMySQL. An alternative way to interface an SQL database is using the package RODBC.

figure i
figure j

To complete the above database SQL commands, it requires access to the remote UCSC SQL Genome server and user-specific credentials. You can see this functional example on the DSPA website. Below is another example that can be done by all readers, as it relies only on local services.

figure k
figure l
figure m
figure n

16.1.3 Real Random Number Generation

We are already familiar with (pseudo) random number generation (e.g., rnorm(100, 10, 4) or runif(100, 10,20)), which generate algorithmically computer values subject to specified distributions. There are also web services, e.g., random.org, that can provide true random numbers based on atmospheric noise, rather than using a pseudo random number generation protocol. Below is one example of generating a total of 300 numbers arranged in 3 columns, each of 100 rows of random integers (in decimal format) between 100 and 200.

figure o

16.1.4 Downloading the Complete Text of Web Pages

RCurl package provides an amazing tool for extracting and scraping information from websites. Let’s install it and extract information from a SOCR website.

figure p

The web object looks incomprehensible. This is because most websites are wrapped in XML/HTML hypertext or include JSON formatted metadata. RCurl deals with special HTML tags and website metadata.

To deal with the web pages only, httr package would be a better choice than RCurl. It returns a list that makes much more sense.

figure q
figure r

16.1.5 Reading and Writing XML with the XML Package

A combination of the RCurl and the XML packages could help us extract only the plain text in our desired webpages. This would be very helpful to get information from heavy text-based websites.

figure s

Here we extracted all plain text between the starting and ending paragraph HTML tags, <p> and </p>.

More information about extracting text from XML/HTML to text via XPath is available online.

16.1.6 Web-Page Data Scraping

The process that extracting data from complete web pages and storing it in structured data format is called scraping. However, before starting a data scrape from a website, we need to understand the underlying HTML structure for that specific website. Also, we have to check the terms of that website to make sure that scraping from this site is allowed.

The R package rvest is a very good place to start “harvesting” data from websites.

To start with, we use read_html() to store the SOCR data website into a xmlnode object.

figure t

From the summary structure of SOCR , we can discover that there are two important hypertext section markups <head> and <body>. Also, notice that the SOCR data website uses <title> and </title> tags to separate title in the <head> section. Let’s use html_node() to extract title information based on this knowledge.

figure u

Here we used %>% operator, or pipe, to connect two functions. The above line of code creates a chain of functions to operate on the SOCR object. The first function in the chain html_node() extracts the title from head section. Then, html_text() translates HTML formatted hypertext into English. More on R piping can be found in the magrittr package.

Another function, rvest::html_nodes () can be very helpful in scraping. Similar to html_node(), html_nodes() can help us extract multiple nodes in an xmlnode object. Assume that we want to obtain the meta elements (usually page description, keywords, author of the document, last modified, and other metadata) from the SOCR data website. We apply html_nodes() to the SOCR object to extract the hypertext data, e.g., lines starting with <meta> in the <head> section of the HTML page source. It is optional to use html_attrs(), which extracts attributes, text and tag names from HTML, obtain the main text attributes.

figure v

16.1.7 Parsing JSON from Web APIs

Application Programming Interfaces (APIs) allow web-accessible functions to communicate with each other. Today most API is stored in JSON (JavaScript Object Notation) format.

JSON represents a plain text format used for web applications, data structures or objects. Online JSON objects could be retrieved by packages like RCurl and httr. Let’s see a JSON formatted dataset first. We can use 02_Nof1_Data.json in the class file as an example.

figure w
figure x

We can see that JSON objects are very simple. The data structure is organized using hierarchies marked by square brackets. Each piece of information is formatted as a {key:value} pair.

The package jsonlite is a very useful tool to import online JSON formatted datasets into data frame directly. Its syntax is very straight-forward.

figure y

16.1.8 Reading and Writing Microsoft Excel Spreadsheets Using XLSX

We can transfer a xlsx dataset into CSV and use read.csv() to load this kind of dataset. However, R provides an alternative read.xlsx() function in package xlsx to simplify this process. Take our 02_Nof1_Data.xls data in the class file as an example. We need to download the file first.

figure z
figure aa

The last argument, 1, stands for the first excel sheet, as any excel file may include a large number of tables in it. Also, we can download the xls or xlsx file into our R working directory so that it is easier to find the file path.

Sometimes more complex protocols may be necessary to ingest data from XLSX documents. For instance, if the XLSX doc is large, includes many tables and is only accessible via HTTP protocol from a web-server. Below is an example of downloading the second table, ABIDE_Aggregated_Data, from the multi-table Autism/ABIDE XLSX dataset:

figure ab

16.2 Working with Domain-Specific Data

Powerful machine-learning methods have already been applied in many applications. Some of these techniques are very specialized and some applications require unique approaches to address the corresponding challenges.

16.2.1 Working with Bioinformatics Data

Genetic data are stored in widely varying formats and usually have more feature variables than observations. They could have 1,000 columns and only 200 rows. One of the commonly used pre-processng steps for such datasets is variable selection. We will talk about this in Chap. 17.

The Bioconductor project created powerful R functionality (packages and tools) for analyzing genomic data, see Bioconductor for more detailed information.

16.2.2 Visualizing Network Data

Social network data and graph datasets describe the relations between nodes (vertices) using connections (links or edges) joining the node objects. Assume we have N objects, we can have N ∗ (N − 1) directed links establishing paired associations between the nodes. Let’s use an example with N=4 to demonstrate a simple graph potentially modeling the node linkage Table 16.1.

Table 16.1 Schematic matrix representation of network connectivity

If we change the a → b to an indicator variable (0 or 1) capturing whether we have an edge connecting a pair of nodes, then we get the graph adjacency matrix .

Edge lists provide an alternative way to represent network connections. Every line in the list contains a connection between two nodes (objects) (Table 16.2).

Table 16.2 List-based representation of network connectivity

The edge list on Table 16.2 lists three network connections: object 1 is linked to object 2; object 1 is linked to object 3; and object 2 is linked to object 3. Note that edge lists can represent both directed as well as undirected networks or graphs.

We can imagine that if N is very large, e.g., social networks, the data representation and analysis may be resource intense (memory or computation). In R, we have multiple packages that can deal with social network data. One user-friendly example is provided using the igraph package. First, let’s build a toy example and visualize it using this package (Fig. 16.1).

figure ac
Fig. 16.1
figure 1

A simple example of a social network as a graph object

Here c(1, 2, 1, 3, 2, 3, 3, 4) is an edge list with 4 rows and n=10 indicates that we have 10 nodes (objects) in total. The small arrows in the graph show the directed network connections. We might notice that 5-10 nodes are scattered out in the graph. This is because they are not included in the edge list, so there are no network connections between them and the rest of the network.

Now let’s examine the co-appearance network of Facebook circles. The data contains anonymized circles (friends lists) from Facebook collected from survey participants using a Facebook app. The dataset only includes edges (circles, 88,234) connecting pairs of nodes (users, 4,039) in the member social networks.

The values on the connections represent the number of links/edges within a circle. We have a huge edge-list made of scrambled Facebook user IDs. Let’s load this dataset into R first. The data is stored in a text file. Unlike CSV files, text files in table format need to be imported using read.table(). We are using the header=F option to let R know that we don’t have a header in the text file that contains only tab-separated node pairs (indicating the social connections, edges, between Facebook users).

figure ad

Now the data is stored in a data frame. To make this dataset ready for igraph processing and visualization, we need to convert soc.net.data into a matrix object.

figure ae

By using ncol=2, we made a matrix with two columns. The data is now ready and we can apply graph .edgelist().

figure af

Before we display the social network graph we may want to examine our model first.

figure ag

This is an extremely brief yet informative summary. The first line U--- 4038 87887 includes potentially four letters and two numbers. The first letter could be U or D indicating undirected or directed edges. A second letter N would mean that the objects set has a “name” attribute. A third letter is for weighted (W) graph. Since we didn’t add weight in our analysis the third letter is empty (“-“). A fourth character is an indicator for bipartite graphs, whose vertices can be divided into two disjoint sets where each vertex from one set connects to one vertex in the other set. The two numbers following the 4 letters represent the number of nodes and the number of edges, respectively. Now let’s render the graph (Fig. 16.2).

figure ah
Fig. 16.2
figure 2

Social network connectivity of Facebook users

This graph is very complicated. We can still see that some words are surrounded by more nodes than others. To obtain such information we can use the degree() function, which lists the number of edges for each node.

figure ai

Skimming the table we can find that the 107-th user has as many as 1,044 connections, which makes the user a highly-connected hub. Likely, this node may have higher social relevance.

Some edges might be more important than other edges because they serve as a bridge to link a cloud of nodes. To compare their importance, we can use the betweenness centrality measurement. Betweenness centrality measures centrality in a network. High centrality for a specific node indicates influence. betweenness() can help us to calculate this measurement.

figure aj

Again, the 107-th node has the highest betweenness centrality (3.556221e + 06).

We can try another example using SOCR hierarchical data, which is also available for dynamic exploration as a tree graph. Let’s read its JSON data source using the jsonlite package (Fig. 16.3).

Fig. 16.3
figure 3

Live demo: a dynamic graph representation of the SOCR resources

figure ak

This generates a list object representing the hierarchical structure of the network. Note that this is quite different from an edge list. There is one root node, its sub nodes are called children nodes, and the terminal nodes are call leaf nodes. Instead of presenting the relationship between nodes in pairs, this hierarchical structure captures the level for each node. To draw the social network graph, we need to convert it as a Node object. We can utilize the as.Node() function in the data.tree package to do so.

figure al

Here we use mode="explicit" option to allow “children” nodes to have their own “children” nodes. Now, the tree.json object has been separated into four different node structures – "About SOCR ", "SOCR Resources", "Get Started", and "SOCR Wiki". Let’s plot the first one using igraph package (Fig. 16.4).

figure am
Fig. 16.4
figure 4

The SOCR resourceome network plotted as a static R graph

In this graph, "About SOCR ", which is located at the center, represents the root node of the tree graph.

16.3 Data Streaming

The proliferation of Cloud services and the emergence of modern technology in all aspects of human experiences leads to a tsunami of data much of which is streamed real-time. The interrogation of such voluminous data is an increasingly important area of research. Data streams are ordered, often unbounded sequences of data points created continuously by a data generator. All of the data mining, interrogation and forecasting methods we discuss here are also applicable to data streams.

16.3.1 Definition

Mathematically, a data stream in an ordered sequence of data points

$$ Y=\left\{{y}_1,{y}_2,{y}_3,\cdots, {y}_t,\cdots \right\}, $$

where the (time) index, t, reflects the order of the observation/record, which may be single numbers, simple vectors in multidimensional space, or objects, e.g., structured Ann Arbor Weather (JSON) and its corresponding structured form. Some streaming data is streamed because it’s too large to be downloaded shotgun style and some is streamed because it’s continually generated and serviced. This presents the potential problem of dealing with data streams that may be unlimited.

Notes:

  • Data sources: Real or synthetic stream data can be used. Random simulation streams may be created by rstream. Real stream data may be piped from financial data providers, the WHO, World Bank, NCAR and other sources.

  • Inference Techniques: Many of the data interrogation techniques we have seen can be employed for dynamic stream data, e.g., factas, for PCA, rEMM and birch for clustering, etc. Clustering and classification methods capable of processing data streams have been developed, e.g., Very Fast Decision Tree s (VFDT), time window-based Online Information Network (OLIN), On-demand Classification , and the APRIORI streaming algorithm.

  • Cloud distributed computing: Hadoop2/HadoopStreaming, SPARK, Storm3/RStorm provide an environments to expand batch/script-based R tools to the Cloud.

16.3.2 The stream Package

The R stream package provides data stream mining algorithms using fpc, clue, cluster, clusterGeneration, MASS, and proxy packages. In addition, the package streamMOA provides an rJava interface to the Java-based data stream clustering algorithms available in the Massive Online Analysis (MOA) framework for stream classification, regression and clustering.

If you need a deeper exposure to data streaming in R, we recommend you go over the stream vignettes.

16.3.3 Synthetic Example: Random Gaussian Stream

This example shows the creation and loading of a mixture of 5 random 2D Gaussians, centers at (x_coords, y_coords) with paired correlations rho_corr, representing a simulated data stream.

Generate the stream:

figure an

16.3.3.1 k-Means Clustering

We will now try k-means and density-based data stream clustering algorithm, D-Stream, where micro-clusters are formed by grid cells of size gridsize with density of a grid cell (Cm) is least 1.2 times the average cell density. The model is updated with the next 500 data points from the stream.

figure ao

First, let’s run the k-means clustering with k = 5 clusters and plot the resulting micro- and macro-clusters (Fig. 16.5).

figure ap
Fig. 16.5
figure 5

Micro and macro clusters of a 5-means clustering of the first 500 points of the streamed simulated 2D Gaussian kernels

In this clustering plot, micro-clusters are shown as circles and macro-clusters are shown as crosses and their sizes represent the corresponding cluster weight estimates.

Next try the density-based data stream clustering algorithm D-Stream. Prior to updating the model with the next 1,000 data points from the stream, we specify the grid cells as micro-clusters, grid cell size (gridsize=0.1), and a micro-cluster (Cm=1.2) that specifies the density of a grid cell as a multiple of the average cell density.

figure aq

We can re-cluster the data using k-means with 5 clusters and plot the resulting micro- and macro-clusters (Fig. 16.6).

figure ar
Fig. 16.6
figure 6

Micro- and macro- clusters of a 5-means clustering of the next 1,000 points of the streamed simulated 2D Gaussian kernels

Note the subtle changes in the clustering results between kmc and km_G5.

16.3.4 Sources of Data Streams

16.3.4.1 Static Structure Streams

  • DSD_BarsAndGaussians generates two uniformly filled rectangular and two Gaussian clusters with different density.

  • DSD_Gaussians generates randomly placed static clusters with random multivariate Gaussian distributions.

  • DSD_mlbenchData provides streaming access to machine learning benchmark data sets found in the mlbench package.

  • DSD_mlbenchGenerator interfaces the generators for artificial data sets defined in the mlbench package.

  • DSD_Target generates a ball in circle data set.

  • DSD_UniformNoise generates uniform noise in a d-dimensional (hyper) cube.

16.3.4.2 Concept Drift Streams

  • DSD_Benchmark provides a collection of simple benchmark problems including splitting and joining clusters, and changes in density or size, which can be used as a comprehensive benchmark set for algorithm comparison.

  • DSD_MG is a generator to specify complex data streams with concept drift. The shape as well as the behavior of each cluster over time can be specified using keyframes.

  • DSD_RandomRBFGeneratorEvents generates streams using radial base functions with noise. Clusters move, merge and split.

16.3.4.3 Real Data Streams

  • DSD_Memory provides a streaming interface to static, matrix-like data (e.g., a data frame, a matrix) in memory which represents a fixed portion of a data stream. Matrix-like objects also include large objects potentially stored on disk like ff::ffdf.

  • DSD_ReadCSV reads data line by line in text format from a file or an open connection and makes it available in a streaming fashion. This way data that is larger than the available main memory can be processed.

  • DSD_ReadDB provides an interface to an open result set from a SQL query to a relational database.

16.3.5 Printing, Plotting and Saving Streams

For DSD objects, some basic stream functions include print(), plot (), and write_stream(). These can save part of a data stream to disk. DSD_Memory and DSD_ReadCSV objects also include member functions like reset_stream() to reset the position in the stream to its beginning.

To request a new batch of data points from the stream we use get_points(). This chooses a random cluster (based on the probability weights in p_weight) and a point is drawn from the multivariate Gaussian distribution (mean = mu, covariance matrix = Σ) of that cluster. Below, we pull n = 10 new data points from the stream (Fig. 16.7).

figure as
Fig. 16.7
figure 7

Scatterplot of the next batch of 700 random Gaussian points in 2D

Note that if you add noise to your stream, e.g., stream_Noise <- DSD_Gaussians(k = 5, d = 4, noise = .1, p = c(0.1, 0.5, 0.3, 0.9, 0.1)), then the noise points that are not classified as part of any cluster will have an NA class label.

16.3.6 Stream Animation

Clusters can be animated over time by animate_data(). Use reset_stream() to start the animation at the beginning of the stream and note that this method is not implemented for streams of class DSD_Gaussians, DSD_R , DSD_data.frame, and DSD. We’ll create a new DSD_Benchmark data stream (Fig. 16.8).

figure at
Fig. 16.8
figure 8

Discrete snapshots of the animated stream clustering process

This benchmark generator creates two 2D clusters moving in 2D. One moves from top-left to bottom-right, the other from bottom-left to top-right. Then they meet at the center of the domain, the 2 clusters overlap and then split again.

Concept drift in the stream can be depicted by requesting (10) times 300 data points from the stream and animating the plot. Fast-forwarding the stream can be accomplished by requesting, but ignoring, (2000) points in between the (10) plots. The output of the animation below is suppressed to save space.

figure au
figure av

Streams can also be saved locally by write_stream(stream_Bench, "dataStreamSaved.csv", n = 100, sep=",") and loaded back in R by DSD_ReadCSV().

16.3.7 Case-Study: SOCR Knee Pain Data

These data represent the X and Y spatial knee-pain locations for over 8,000 patients, along with labels about the knee Front, Back, Left and Right. Let’s try to read the SOCR Knee Pain Dataset as a stream.

figure aw

We can use the DSD::DSD_Memory class to get a stream interface for matrix or data frame objects, like the Knee pain location dataset. The number of true clusters k = 4 in this dataset.

figure ax
figure ay

16.3.8 Data Stream Clustering and Classification (DSC)

Let’s demonstrate clustering using DSC_DStream, which assigns points to cells in a grid. First, initialize the clustering, as an empty cluster and then use the update() function to implicitly alter the mutable DSC object (Fig. 16.9).

figure az
Fig. 16.9
figure 9

Data stream clustering and classification of the SOCR knee-pain dataset (n=500)

figure ba
figure bb

The purity metric represents an external evaluation criterion of cluster quality, which is the proportion of the total number of points that were correctly classified:

$$ 0\le Purity=\frac{1}{N}{\sum}_{i=1}^k{\mathit{\max}}_j\left|{c}_i\cap {t}_j\right|\le 1, $$

where N=number of observed data points, k = number of clusters, ci is the ith cluster, and tj is the classification that has the maximum number of points with ci class labels. High purity suggests that we correctly label points (Fig. 16.10).

Fig. 16.10
figure 10

5-Means stream clustering of the SOCR knee pain data

Next, we can use K-means clustering.

figure bc

Again, the graphical output of the animation sequence of frames is suppressed, however, the readers are encouraged to run the command line and inspect the graphical outcome (Figs. 16.11 and 16.12).

figure bd
Fig. 16.11
figure 11

Animated continuous 5-means stream clustering of the knee pain data

figure be
Fig. 16.12
figure 12

Continuous stream clustering and purity index across iterations

figure bf

16.3.9 Evaluation of Data Stream Clustering

Figure 16.13 shows the average clustering purty as we evaluate the stream clustering across the streaming points.

figure bg
Fig. 16.13
figure 13

Average clustering purity

figure bh
figure bi

The dsc_streamKnee represents the result of the stream clustering, where n is the number of data points from the streamKnee stream. The evaluation measure can be specified as a vector of character strings. Points are assigned to clusters in dsc_streamKnee using get_assignment() and can be used to assess the quality of the classification. By default, points are assigned to micro-clusters, or can be assigned to macro-cluster centers by assign = "macro". Also, new points can be assigned to clusters by the rule used in the clustering algorithm by assignmentMethod = "model" or using nearest-neighbor assignment (nn), Fig. 16.14.

Fig. 16.14
figure 14

Continuous k-means stream clustering with classificaiton purity

16.4 Optimization and Improving the Computational Performance

Here and in previous chapters, e.g., Chap. 15, we notice that R may sometimes be slow and memory-inefficient. These problems may be severe, especially for datasets with millions of records or when using complex functions. There are packages for processing large datasets and memory optimization – bigmemory, biganalytics, bigtabulate, etc.

16.4.1 Generalizing Tabular Data Structures with dplyr

We have also seen long execution times when running processes that ingest, store or manipulate huge data.frame objects. The dplyr package, created by Hadley Wickham and Romain Francoi, provides a faster route to manage such large datasets in R. It creates an object called tbl, similar to data.frame, which has an in-memory column-like structure. R reads these objects a lot faster than data frames.

To make a tbl object we can either convert an existing data frame to tbl or connect to an external database. Converting from data frame to tbl is quite easy. All we need to do is call the function as.tbl().

figure bj

This looks like a normal data frame. If you are using R Studio, displaying the nof1_tbl will show the same output as nof1.

16.4.2 Making Data Frame s Faster with Data.Table

Similar to tbl, the data. table package provides another alternative to data frame object representation. data.table objects are processed in R much faster compared to standard data frames. Also, all of the functions that can accept data frame could be applied to data.table objects as well. The function fread() is able to read a local CSV file directly into a data.table.

figure bk

Another amazing property of data. table is that we can use subscripts to access a specific location in the dataset just like dataset[row, column]. It also allows the selection of rows with Boolean expression and direct application of functions to those selected rows. Note that column names can be used to call the specific column in data.table, whereas with data frames, we have to use the dataset$columnName syntax.

figure bl

This useful functionality can also help us run complex operations with only a few lines of code. One of the drawbacks of using data. table objects is that they are still limited by the available system memory.

16.4.3 Creating Disk-Based Data Frame s with ff

The ff (fast-files) package allows us to overcome the RAM limitations of finite system memory. For example, it helps with operating datasets with billions of rows. ff creates objects in ffdf formats, which is like a map that points to a location of the data on a disk. However, this makes ffdf objects inapplicable for most R functions. The only way to address this problem is to break the huge dataset into small chunks. After processing a batch of these small chunks, we have to combine the results to reconstruct the complete output. This strategy is relevant in parallel computing, which will be discussed in detail in the next section. First, let’s download one of the large datasets in our datasets archive, UQ_VitalSignsData_Case04.csv.

figure bm

As mentioned earlier, we cannot apply functions directly on this object.

figure bn

For basic calculations on such large datasets, we can use another package, ffbase. It allows operations on ffdf objects using simple tasks like: mathematical operations, query functions, summary statistics and bigger regression models using packages like biglm, which will be mentioned later in this chapter.

figure bo

16.4.4 Using Massive Matrices with bigmemory

The previously introduced packages include alternatives to data.frames. For instance, the bigmemory package creates alternative objects to 2D matrices (second-order tensors). It can store huge datasets and can be divided into small chunks that can be converted to data frames. However, we cannot directly apply machine-learning methods on this type of objects. More detailed information about the bigmemory package is available online.

16.5 Parallel Computing

In previous chapters, we saw various machine-learning techniques applied as serial computing tasks. The traditional protocol involves: First, applying function 1 to our raw data. Then, using the output from function 1 as an input to function 2. This process may be iterated over a series of functions. Finally, we have the terminal output generated by the last function. This serial or linear computing method is straightforward but time consuming and perhaps sub-optimal.

Now we introduce a more efficient way of computing - parallel computing , which provides a mechanism to deal with different tasks at the same time and combine the outputs for all of processes to get the final answer faster. However, parallel algorithms may require special conditions and cannot be applied to all problems. If two tasks have to be run in a specific order, this problem cannot be parallelized.

16.5.1 Measuring Execution Time

To measure how much time can be saved for different methods, we can use function system.time().

figure bp

This means calculating the mean of Pulse column in the vitalsigns dataset takes less than 0.001 seconds. These values will vary between computers, operating systems, and states of operations.

16.5.2 Parallel Processing with Multiple Cores

We will introduce two packages for parallel computing multicore and snow (their core components are included in the package parallel). They both have a different way of multitasking. However, to run these packages, you need to have a relatively modern multicore computer. Let’s check how many cores your computer has. This function parallel::detectCores() provides this functionality. parallel is a base package, so there is no need to install it prior to using it.

figure bq

So, there are eight (8) cores in my computer. I will be able to run up to 6-8 parallel jobs on this computer.

The multicore package simply uses the multitasking capabilities of the kernel, the computer’s operating system, to “fork” additional R sessions that share the same memory. Imagine that we open several R sessions in parallel and let each of them do part of the work. Now, let’s examine how this can save time when running complex protocols or dealing with large datasets. To start with, we can use the mclapply() function, which is similar to lapply(), which applies functions to a vector and returns a vector of lists. Instead of applying functions to vectors mcapply() divides the complete computational task and delegates portions of it to each available core. To demonstrate this procedure, we will construct a simple, yet time consuming, task of generating random numbers. Also, we can use the system.time() function to track execution time.

figure br

The unlist() is used at the end to combine results from different cores into a single vector. Each line of code creates 10,000,000 random numbers. The c1 call took the longest time to complete. The c2 call used two cores to finish the task (each core handled 5,000,000 numbers) and used less time than c1. Finally, c4 used all four cores to finish the task and successfully reduced the overall time. We can see that when we use more cores the overall time is significantly reduced.

The snow package allows parallel computing on multicore multiprocessor machines or a network of multiple machines. It might be more difficult to use but it’s also certainly more flexible. First we can set how many cores we want to use via makeCluster() function.

figure bs

This call might cause your computer to pop up a message warning about access though the firewall. To do the same task we can use parLapply() function in the snow package. Note that we have to call the object we created with the previous makeCluster() function.

figure bt

While using parLapply(), we have to specify the matrix and the function that will be applied to this matrix. Remember to stop the cluster we made after completing the task, to release back the system resources.

figure bu

16.5.3 Parallelization Using foreach and doParallel

The foreach package provides another option of parallel computing. It relies on a loop-like process basically applying a specified function for each item in the set, which again is somewhat similar to apply(), lapply() and other regular functions. The interesting thing is that these loops can be computed in parallel saving substantial amounts of time. The foreach package alone cannot provide parallel computing. We have to combine it with other packages like doParallel. Let’s reexamine the task of creating a vector of 10,000,000 random numbers. First, register the 4 compute cores using registerDoParallel().

figure bv

Then we can examine the time saving foreach command.

figure bw

Here we used four items (each item runs on a separate core), .combine=c allows foreach to combine the results with the parameter c() , generating the aggregate result vector.

Also, don’t forget to close the doParallel by registering the sequential backend.

figure bx

16.5.4 GPU Computing

Modern computers have graphics cards, GPUs (Graphical Processing Units), that consists of thousands of cores, however they are very specialized, unlike the standard CPU chip. If we can use this feature for parallel computing, we may reach amazing performance improvements, at the cost of complicating the processing algorithms and increasing the constraints on the data format. Specific disadvantages of GPU computing include reliance on proprietary manufacturer (e.g., NVidia) frameworks and Complete Unified Device Architecture (CUDA) programming language. CUDA allows programming of GPU instructions into a common computing language. This paper provides one example of using GPU computation to significantly improve the performance of advanced neuroimaging and brain mapping processing of multidimensional data.

The R package gputools is created for parallel computing using NVidia CUDA. Detailed GPU computing in R information is available online.

16.6 Deploying Optimized Learning Algorithms

As we mentioned earlier, some tasks can be parallelized easier than others. In real world situations, we can pick the algorithms that lend themselves well to parallelization. Some of the R packages that allow parallel computing using ML algorithms are listed below.

16.6.1 Building Bigger Regression Models with biglm

biglm allows training regression models with data from SQL databases or large data chunks obtained from the ff package. The output is similar to the standard lm() function that builds linear models. However, biglm operates efficiently on massive datasets.

16.6.2 Growing Bigger and Faster Random Forests with bigrf

The bigrf package can be used to train random forests combining the foreach and doParallel packages. In Chap. 15, we presented random forests as machine learners ensembling multiple tree learners. With parallel computing, we can split the task of creating thousands of trees into smaller tasks that can be outsourced to each available compute core. We only need to combine the results at the end. Then, we will obtain the exact same output in a relatively shorter amount of time.

16.6.3 Training and Evaluation Models in Parallel with caret

Combining the caret package with foreach, we can obtain a powerful method to deal with time-consuming tasks like building a random forest learner. Utilizing the same example we presented in Chap. 15, we can see the time difference of utilizing the foreach package.

figure by

It took more than a minute to finish this task in standard execution model purely relying on the regular caret function. Below, this same model training completes much faster using parallelization (less than half the time) compared to the standard call above.

figure bz

16.7 Practice Problem

Try to analyze the co-appearance network in the novel “Les Miserables”. The data contains the weighted network of co-appearances of characters in Victor Hugo’s novel “Les Miserables”. Nodes represent characters as indicated by the labels and edges connect any pair of characters that appear in the same chapter of the book. The values on the edges are the number of such co-appearances.

miserables<-read.table("https://umich.instructure.com/files/330389/download?download_frd=1", sep="", header=F) head(miserables)

Also, try to interrogate some of the larger datasets we have by using alternative parallel computing and big data analytics.

16.8 Assignment: 16. Specialized Machine Learning Topics

16.8.1 Working with Website Data

  • Download the Main SOCR Wiki Page and compare RCurl and httr.

  • Read and write XML code for the SOCR Main Page.

  • Scrape the data from the SOCR Main Page.

16.8.2 Network Data and Visualization

  • Download 03_les_miserablese_GraphData.txt

  • Visualize this undirected network.

  • Summary the graph and explain the output.

  • Calculate degree and the centrality of this graph.

  • Find out some important characters.

  • Will the result change or not if we assume the graph is directed?

16.8.3 Data Conversion and Parallel Computing

  • Download CaseStudy12_ AdultsHeartAttack_Data.xlsx or require online.

  • load this data as data frame.

  • Use Export() or write.xlsx() to renew the xlsx file.

  • Use rio package to convert this ".xlsx" "file to" ".csv".

  • Generate generalizing tabular data structures.

  • Generate data.table.

  • Create disk-based data frames and perform basic calculation.

  • Perform basic calculation on the last 5 columns as a big matrix.

  • Use DIAGNOSIS, SEX, DRG, CHARGES, LOS and AGE to predict DIED with randomForest setting ntree=20000. Notice: sample without replacement to get an as large as possible balanced dataset.

  • Run train() in caret and detect the execute time.

  • Detect cores and make proper number of clusters.

  • Rerun train() parallelized and compare the execute time.

  • Use foreach and doMC to design a parallelized random forest with ntree=20000 totally and compare the execute time with sequential execution.