1 Getting Started

In this chapter, we cover just enough of the R language to get you going. If you’re new to programming, this chapter will get you started well enough to be productive and we’ll call out ways to learn more at the end. R is a great place to learn to program because its environment is clean and much simpler than traditional programming languages such as Java or C++. If you’re an experienced programmer in another language, you should skim this chapter to learn the essentials.

We recommend you work through this chapter hands-on and be patient; it will prepare you for marketing analytics applications in later chapters.

1.1 Initial Steps

If you haven’t already installed R, please do so. We’ll skip the installation details except to say that you’ll want at least the basic version of R (known as “R base”) from the Comprehensive R Archive Network (CRAN): http://cran.r-project.org. If you are using:

  • Windows or Mac OS X: Get the compiled binary version from CRAN.

  • Linux: Use your package installer to add R. This might be a GUI installer as in Ubuntu’s Software Center or a terminal command such as sudo apt-get install R. (See CRAN for more options.)

In either case, you don’t need the source code version for purposes of this book.

After installing R, we recommend also to install RStudio [172], an integrated environment for writing R code, viewing plots, and reading documentation. RStudio is available for Windows, Mac OS X, and Linux at http://www.rstudio.com. Most users will want the desktop version. RStudio is optional and this book does not assume that you’re using it, although many R users find it to be convenient. Some companies may have questions about RStudio’s Affero General Public License (AGPL) terms; if relevant, ask your technology support group if they allow AGPL open source software.

There are other variants of R available, including options that will appeal to experienced programmers who use Emacs, Eclipse, or other development environments. For more information on various R environments, see Appendix A.

1.2 Starting R

Once R is installed, run it; or if you installed RStudio, launch that. The R command line starts by default and is known as the R console. When this book was written, the R console looked like Fig. 2.1 (where some details depend on the version and operating system).

Fig. 2.1
figure 1

The R console

The “>” symbol at the bottom of the R console shows that R is ready for input from you. For example, you could type:

figure a

As we show commands with “>”, you should try them for yourself. So, right now, you should type “x <- c(2, 4, 6, 8)” into the R console followed by the Enter key.

This is a simple assignment command using the assignment operator “\(\texttt {<-}\)” to create a named object x that comprises a vector of numbers, (2, 4, 6, 8). The assignment operator \(\texttt {<-}\) can be pronounced as “gets” and is the way to assign values to R variables (“objects”).

In reading our code listings, a few notes might help those who are new to programming. We list commands to R proceeded by the “>” symbol just as you would see in R. Sometimes a command is longer than one line and in those cases it continues with a “\(+\)” symbol that you don’t type (R adds it automatically). Everything else in the code listings is output from R.

In code listings, we abbreviate long output with ellipses (“...”) and sometimes add comments, which are anything on a line after “#”. When we refer to code outside a listing box, we set it in monospace font so you will know it’s an R command or object. In short, anything after “>” or “\(+\)” is something for you to type.

For some commands, R responds by printing something in the console. For example, when you type the name of a variable into the console like this:

figure b

R responds by printing out the value of x. In this case, we defined x above as a vector of numbers:

figure c

We’ll explain more about these results and the preceding “[1]” below.

2 A Quick Tour of R’s Capabilities

Before we dive into the details of programming, we’d like to start with a tour of a relatively powerful analysis in R. This is a partial preview of other parts of this book, so don’t worry if you don’t understand the commands. We explain them briefly here to give you a sense of how an R analysis might be conducted. In this and later chapters, we explain all of these steps and many more analyses.

To begin, we install some add-on packages that we’ll need:

figure d

Most analyses require one or more packages in addition to those that come with R. After you install a package once, you don’t have to install it again unless there is an update.

Now we load a data set from this book’s website and examine it:

figure e

This data set exemplifies observations from a simple sales and product satisfaction survey. Such data might be gathered from a satisfaction survey answered by customers after purchasing a product, such as high end electronics or an automobile. The data set has 500 (simulated) consumers’ answers to a survey with four items asking about satisfaction with a product (iProdSAT), sales (iSalesSAT) experience, and likelihood to recommend the product and salesperson (iProdREC and iSalesREC respectively).

The four satisfaction items have been answered on a 7 point rating scale that ranges from extremely dissatisfied (“1”) to extremely satisfied (“7”). Each respondent is also assigned to a numerically coded segment (Segment). In the second line of R code above, we set Segment to be a categorical factor variable (a nominal value, because we don’t want to model segments in terms of the arbitrary mathematical values). The segment membership was assigned by a clustering algorithm applied to the consumers’ responses, such as one of the methods we explore in Chap. 11.

Next we chart a correlation matrix for the satisfaction responses, omitting the categorical Segment variable in column 3:

figure f

The library() command here is one we’ll see often; it loads an add-on library of additional functions for R. The resulting chart is shown in Fig. 2.2. The lower triangle in Fig. 2.2 shows the correlations between item pairs, while the upper triangle visualizes those with circle size and color. The satisfaction items are highly correlated with one another, as are the likelihood-to-recommend items.

Fig. 2.2
figure 2

A plot visualizing correlation between satisfaction and likelihood to recommend variables in a simulated consumer data set, N \(=\) 500. All items are positively correlated with one another, and the two satisfaction items are especially strongly correlated with one another, as are the two recommendation items. Chapter 4 discusses correlation analysis in detail

Does product satisfaction differ by segment? We compute the mean satisfaction for each segment using the  aggregate() function, which we will discuss in Sect. 3.4.5:

figure g

Segment 4 has the highest level of satisfaction, but are the differences statistically significant? We perform a oneway analysis of variance (ANOVA) and see that satisfaction differs significantly by segment:

figure h

We plot the ANOVA model to visualize confidence intervals for mean product satisfaction by segment:

figure i

The resulting chart is shown in Fig. 2.3. It is easy to see that Segments 1, 2, and 3 differ modestly while Segment 4 is much more satisfied than the others. We will learn more about comparing groups and doing ANOVA analyses in Chap. 5.

Fig. 2.3
figure 3

Mean and confidence intervals for product satisfaction by segment. The X axis represents a Likert rating scale ranging 1–7 for product satisfaction. Chapter 5 discusses methods to compare groups

R’s open source platform has promoted a proliferation of powerful capabilities in advanced statistical methods. For example, many marketing analysts are interested in structural equation models, and R has multiple packages to fit structural equation models.

Let’s fit a structural equation model to the satisfaction data. We define a model with latent variables—which we discuss in Chaps. 8 and 10—for satisfaction (“SAT”) and likelihood-to-recommend (“REC”). We propose that the SAT latent variable is manifest in the two satisfaction items, while REC is manifest in the two likelihood-to-recommend items. As marketers, we expect and hope that the latent likelihood-to-recommend variable (REC) would be affected by the latent satisfaction (SAT).

This latent variable model is simpler to express in R than in English (note that the following is a single command, where the + at the beginning of lines is generated by R, not typed):

figure j

This model might be paraphrased as “Latent SATisfaction is observed as items iProdSAT and iSalesSAT. Latent likelihood to RECommend is observed as items iProdREC and iSalesREC. RECommendation varies with SATisfaction.”

Next we fit that model to the data using the lavaan package:

figure k

The model converged and reported many statistics that we omit above, but we note that the model fits the data well with a Comparative Fit Index near 1.0 (see Chap. 10).

We visualize the structural model using the semPlot package:

figure l

This produces the chart shown in Fig. 2.4. Each proposed latent variable is highly loaded on its manifest (observed) survey items. With an estimated coefficient of 0.76, customers’ latent satisfaction is shown to have a strong association with their likelihood to recommend. See Chap. 10 for more on structural models and how to interpret and compare them.

Fig. 2.4
figure 4

A structural model with path loadings for a model of product satisfaction and likelihood-to-recommend, using the lavaan and  semPlot packages. Satisfaction has a strong relationship to likelihood-to-recommend (coefficient \(=\) 0.76) in the simulated consumer data. Chapter 10 discusses structural models

That ends the tour. If this seems like an impressive set of capabilities, it is only the tip of the iceberg. Apart from loading packages, those analyses and visualizations required a total of only 15 lines of R code!

There is a price to pay for this power: you must learn about the structure of the R language. At first this may seem basic or even dull, but we promise that understanding the language will pay off. You will be able to apply the analyses we present in this book and understand how to modify the code to do new things.

3 Basics of Working with R Commands

Like many programming languages, R is case sensitive. Thus, x and X are different. If you assigned x as in Sect. 2.1.2 above, try this:

figure m

When working with the R console, you’ll find it convenient to use the keyboard up and down arrow keys to navigate through previous commands that you’ve typed. If you make a small error, you can recall the command and edit it without having to type it all over. It’s also possible to copy from and paste into the console when using other sources such as a help file.

Tip: although you could type directly into the R console, another option is to use a separate text editor such as the one built into R (select File \(\mid \) New Script from the R GUI menu in Windows, File \(\mid \) New Document in Mac OSX, or File \(\mid \) New File \(\mid \) R Script in RStudio).

With code in a separate file, you can easily edit or repeat commands. To run a command from a text file, you can copy and paste into the console, or use a keyboard shortcut to run it directly from R: use CTRL+R in standard R on Windows, CTRL+Enter in RStudio on Windows, or Command+Enter in standard R or RStudio on a Mac. (See Appendix A for other suggestions about R editors.) You do not have to highlight an entire line to run it; just type CTRL+Enter or Command+Enter anywhere on the line.

When you put code into a file, it is helpful to add comments. The “#” symbol signifies a comment in R, and everything on a line after it is ignored. For example:

figure n

In this book, you don’t need to type any of those comments; they just make the code more readable.

The command above defines x and ends with a comment. One might instead prefer to comment a whole line; R doesn’t care:

figure o

Our code includes comments wherever we think it might help. As a politician might say about voting, we say comment early and comment often. It is much easier to document your code now than later.

4 Basic Objects

Like most programming languages, R differentiates between data and functions that perform actions. We’ll spend a bit of time first looking at common data types in R, and then examine functions. We describe the three most important R data types: vectors, lists, and data frames. Later we introduce the process of writing functions. Sometimes we also use the term object; in R, “object” is a generic term that refers to data, functions, or anything else that the R system processes. (Experienced programmers: R is a functional language; although it is similar in some ways to procedural languages such as C++ and Visual Basic, in more important ways it is similar to Scheme and Lisp. For details, see the references in Sect. 2.10.)

4.1 Vectors

The simplest R object is a vector, a one-dimensional collection of data points of a similar kind (such as numbers or text). For instance, in the following code

figure p

...we tell R to create a vector of 4 numbers and name it x. The command c() indicates to R that you are entering the elements of a vector. Vectors commonly comprise numeric data, logical values, or character strings. Each of the following statements defines a vector with 4 items as members (and if you’re not typing along in R, now is the time to start):

figure q

The fourth element of xMix is the character string Hello, world!. The comma inside that string falls inside quotation marks and thus does not cause separation between elements as do the other commas. These four objects, xNum, xLog, xChar, and xMix, have different types of data. We’ll say more about that in a moment.

Vectors may be added to one another with c():

figure r

An overall view of an object can be obtained with the  summary() function, whose results depend on the object type. For vectors of numerics,  summary() gives range and central tendency statistics, whereas for vectors of characters it reports counts of the most frequent unique values—in this case, that each word occurs exactly once:

figure s

Indexing denotes particular elements of a data structure. Vectors are indexed with square brackets, [ and ]. For instance, the second element of xNum is:

figure t

We discuss indexing in depth below (Sect. 2.4.3).

At its core, R is a mathematical language that understands vectors, matrices, and other structures, as well as common mathematical functions and constants. When you need to write a statistical algorithm from scratch, many optimized mathematical functions are readily available. For example, R automatically applies operators across entire vectors:

figure u

The last example shows something to watch out for: when working with vectors, R recycles the elements to match a longer set. In the last command, x2 has 8 elements, while x has only 4. R will line them up and multiply x[1] \(*\) x2[1], x[2] \(*\) x2[2], and so forth. When it comes to x2[5], there is no matching element in x, so it goes back to x[1] and starts again. This can be a source of subtle and hard-to-find bugs. When in doubt, check the length() of vectors as one of the first steps in debugging:

figure v

In order to keep things clear, matrix math uses different operators than vector math. For instance, %*% is used to multiply matrices instead of \(*\). We do not cover math operations in detail here; see Sect. 2.4.6 below if you want to learn details about math operators in R.

When you create a vector, R automatically assigns a data type or class to all elements in the vector. Some common data types are logical (TRUE/FALSE), integer (0, 1, 2, ...), double (real numbers such as 1.1, 3.14159, etc.), and character (“a”, “hello, world!”, etc.).

When types are mixed in a vector, it holds values in the most general format. Thus, the vector “c(1, 2, 3.5)” is coerced to type double because the real number 3.5 is more general than an integer such as 1:

figure w

This may lead to surprises. When we defined the vector xMix above, it was coerced to a character type because only a character type can preserve the basic values of types as diverse as TRUE and “Hello, world!”:

figure x

When operating on these, R tries to figure out what to do in a sensible way, but sometimes needs help. Consider the following operations:

figure y

When we attempt to add 1 to xNum and xMix, xNum[1]+1 succeeds while xMix[1]+1 returns an error that one of the arguments is not a number. We can explicitly force it to be numeric by coercion with the  as.numeric() function:

figure z

It would be tedious to go though all of R’s rules for coercing from one type to another, so we simply caution you always to check variable types when debugging because confusion about types is a frequent source of errors. The  str() (“structure”) function is a good way to see detailed information about an object:

figure aa

In these results, we see that xNum is a numeric vector (abbreviated “num”) with elements that are indexed 1:4, while xChar and xMix are character vectors (abbreviated “chr”).

4.2 Help! A Brief Detour

This is a good place to introduce help in R. R and its add-on packages form an enormous system and even advanced R users regularly consult the help files.

How to find help depends on your situation. If you know the name of a command or related command, use “?”. For instance, now that you know the as.numeric() command, you may wonder whether there are similar commands for other types. Looking at help for a command you know is a good place to start:

figure ab

This calls up the R help system, as shown in Fig. 2.5.

Fig. 2.5
figure 5

R help for the  as.numeric() command, using  ?as.numeric

R help files are arranged according to a specific structure that makes it easier for experienced R users to find information. Novice R users sometimes dislike help files because they can be very detailed, but once you grow accustomed to the structure, help files are a valuable reference.

Help files are organized into sections titled Description, Usage, Arguments, Details, Value, References, See Also, and Examples. We often find it helpful to go directly to the Examples section. These examples are designed to be pasted directly into the R console to demonstrate a function. If there isn’t an example that matches your use case, you can go back to the Usage and Arguments sections to understand more generally how to use a function. The Value section explains what type of object the function returns. If you find that the function you are looking at doesn’t do quite what you want, it can be helpful to check out the See Also section, where you will find links to other related functions.

Now suppose you do not know the name of a specific command, but wish to find something related to a concept. The “??” command searches the Help system for a phrase. For example, the command ??anova finds many references to ANOVA models and utility functions, as shown in Fig. 2.6.

Fig. 2.6
figure 6

Searching R help with ??anova, as shown in RStudio. The exact results depend on packages you have installed

The ? and  ?? commands understand quotation marks. For instance, to get help on the ? symbol itself, put it inside quotation marks (R standard is the double quote character: "):

figure ac

Note that the help file for ? has the same subject headings as any other help file. It doesn’t tell you how to get help; it tells you how to use the ? function. This way of thinking about help files may be foreign at first, but as you get to know the language the consistency across the help files will make it easy for you to learn new functions as the need arises.

There are other valuable resources besides the built-in help system. If you’re are looking for something related to a general area of investigation, such as regression models or econometrics, and are not sure what exists, CRAN is very useful. CRAN Task Views (http://cran.r-project.org/web/views/) provide annotated lists of packages of interest in high-level areas such as Bayesian statistics, machine learning, and econometrics.

When working with an add-on package, you can check whether the authors have provided a vignette, a PDF file that describes its usage. They are often linked from a package’s help file, but an especially convenient way to find them is with the command browseVignettes(), which lists all vignettes for the packages you’ve installed in a browser window.

If you run into a problem with something that seems it ought to work but doesn’t, try the official R-help mailing list (https://stat.ethz.ch/mailman/listinfo/r-help or the R forums on StackOverflow (http://stackoverflow.com/tags/r/info). Both are frequented by R contributors and experts who are happy to help if you provide a complete and reproducible example of a problem.

Google web search understands “R” in many contexts, such as searching for “R anova table”.

Finally, there is a wealth of books covering specific R topics. At the end of each chapter, we note books and sites that present more detail about the chapter’s topics.

4.3 More on Vectors and Indexing

Now that you can find help when needed, let’s look at vectors and indexing again. Whereas  c() defines arbitrary vectors, integer sequences are commonly defined with the  : operator. For example:

figure ad

When applying math to : sequences, be careful of operator precedence; “:” is applied before many other math operators. Use parentheses when in doubt and always double-check math on sequences:

figure ae

Sequences are useful for indexing and you can use sequences inside [ ]:

figure af

For complex sequences, use  seq() (“sequence”) and  rep() (“replicate”). We won’t cover all of their options, but here is a preview. Read this, try to predict what the commands do, and then run them:

figure ag

With the last example, deconstruct it by looking first at the inner expression  seq(from=-3, to=13, by=4). Each element of that vector will be replicated a certain number of times as specified in the second argument to  rep(). More questions? Try ?rep.

Exclude items by using negative indices:

figure ah

In all of the R output, we’ve seen “[1]” at the start of the row. That indicates the vector position index of the first item printed on each row of output. Try these:

figure ai

The result of an R vector operation is itself a vector. Try this:

figure aj

The new object xSub is created by selecting the elements of xNum. This may seem obvious, yet it has profound implications because it means that the results of most operations in R are fully-formed, inspectable objects that can be passed on to other functions. Instead of just output, you get an object you can reuse, query, manipulate, update, save, or share.

Indexing also works with a vector of logical variables (TRUE/FALSE) that indicate which elements you want to select:

figure ak

This allows you to use logical expressions—which evaluate as a vector of logical values—to select subsets of data based on specific criteria. We discuss this more in later chapters and will use it frequently. Here is an example:

figure al

When we index using the logical expression xNum> 3, R selects elements that correspond to TRUE values of that expression.

4.4 aaRgh! A Digression for New Programmers

At about this point when learning R, some students become incredulous. “I’ve got to type the name of a data set over and over?!” Yes. “I have to manually pick which rows or columns to include?!” Yes, sometimes, but you’ll learn code approaches that are more general. “I can’t just point and click on the data I want?!” No, you can’t, at least not in this book or most R books. (Limited point and click and menus are available as add-ons in R—see Appendix A—but we strongly believe you’ll be better suited by learning the power of the command line from the beginning.)

Thousands of analysts before you have felt the same way. What’s different this time? They gave up but you won’t! Seriously, R is not simple and yes, it demands a bit of effort. Our job is to help you through the difficulty so the effort pays off.

R reminds us of a mountain town, Holden, Washington. Holden is a remote village in the North Cascades; to get there requires a three hour ferry ride followed by an hour-long bus trip. Each bus up the mountain has a sign that declares, “The ride up is free. The trip down is costly.” In other words, everyone is welcomed ... but after one settles in, the place may become beloved and difficult to leave. Some people intend to make a short visit, yet end up staying for months or years.

R is similar to that mountain village: although it takes time and effort to arrive, after you settle in and know your way around, you might not want to leave. It has been many years since we have had a reason to use a statistics environment other than R.

4.5 Missing and Interesting Values

In statistics, missing values are important, and as a statistics environment, R understands them and includes a special constant for a missing value: NA. This is not a character object ("NA") but a constant in its own right. It is useful in several contexts. For instance, you might create a data object that will be filled in with values later:

figure am

Any math performed on a value of NA becomes NA:

figure an

This may not be what you want, and you may tell R to ignore NA data rather than calculating on it. Many commands include an argument that instructs them to ignore missing values: na.rm=TRUE:

figure ao

A second approach is to remove NA values explicitly before calculating on them or assigning them elsewhere. This may be done most easily with the function na.omit():

figure ap

A third and more cumbersome alternative is to test for NA using the is.na() function, and then index data for the values that are not NA by adding the ! (“not”) operator:

figure aq

One thing never to do in R is to use an actual numeric value such as -999 to indicate missing data. That will cause headaches at best and wrong answers at worst. Instead, as soon as you load such data into R, replace those values with NA using indices:

figure ar

The third command tells R to select my.test.scores where the value is lower than −900 and replace those elements  NA with.

R also handles infinity and undefined numbers, with constants Inf and  NaN (“not a number”). For example, if we take the natural logarithm of positive and negative numbers:

figure as

We get a warning because log() is undefined for negative numbers and log(-1) gives a value of  NaN. Note also that \(log(0) = -\infty \) (-Inf).

R tries to be helpful by watching out for such issues, warning you, and carrying on as best it can. You should watch for “Warning message” and clean up your data or math when it appears.

4.6 Using R for Mathematical Computation

As a programming environment for computational statistics, R has powerful capabilities for mathematics. In particular, it is highly optimized for vector and matrix operations, which include everything from indexing and iteration to complex operations such as matrix inversion and decomposition. This makes R an attractive alternative to software like Matlab for computation, simulation and optimization.

We do not cover such math in detail here for several reasons: it is tedious to read, many operations are obvious or easy to find, and advanced math is not necessarily used in day to day marketing analytics. Instead, we use math commands and operators with minor explanations as needed, trusting that you may use ? to learn more.

If you are interested in using R for mathematical computation, remember that ? understands quotation marks so you can read about operators using a help command such as ?"*". An entry point to matrix math is the matrix multiplication operator, %*%. If you need especially high performance, we have pointers on enhancing R’s computation power in Appendix C.

4.7 Lists

Lists are collections of objects of any type. They are useful on their own, and are especially important to understand how R stores data sets, the topic of the following section.

Let’s look at two of the objects we defined above, inspecting their structures with the  str() command:

figure at

We see that these vectors are of type “numeric” and “character,” respectively. All the elements in a vector must be the same type. We can combine these two vectors into a list using  list():

figure au

Using str(), we see that objects inside the list retain the types that they had as separate vectors:

figure av

Lists are indexed with double brackets ([[ and ]]) instead of the single brackets that vectors use, and thus xList comprises two objects that are indexed with [[1]] and [[2]]. We might index the objects and find summary information one at a time, such as:

figure aw

It is often more convenient to run such a command on all members of the list at once. We can do that with the lapply() or “list apply” command.

With lapply() we must pay special attention to the argument order: lapply(OBJECT, FUNCTION). We use lapply() to produce a summary() for each member of the list:

figure ax

What this did was to separate xList into its separate list elements, [[1]] and [[2]]. Then it ran  summary() on each one of those.

Using lapply() to iterate in this way saves a lot of work, especially with lists that may comprise dozens or hundreds of objects. It demonstrates that lists have two advantages: they keep data in one place regardless of constituent types, and they make it possible to apply operations automatically to diverse parts of that data.

Each element in a list may be assigned a name, which you can access with the   names() function. You may set the names() when a list is created or at a later time. The following two list creation methods give the same result:

figure ay

A list may be indexed using its names rather than a numeric index. You can use $name or  [["name"]] as you prefer:

figure az

List names are character strings and may include spaces and various special characters. Putting the name in quotes is useful when names include spaces.

This brings us to the most important object type in R: data frames.

5 Data Frames

Data frames are the workhorse objects in R, used to hold data sets and to provide data to statistical functions and models. A data frame’s general structure will be familiar to any analyst: it is a rectangular object comprised of columns of varying data types (often referred to as “variables”) and rows that each have a value (or missing value, NA) in each column (“observations”).

You may construct a data frame with the  data.frame() function, which takes as input a set of vectors of the same length:

figure ba

In this code, we use dot notation with a suffix .df that helps to clarify that x.df is a data frame. The .df is just part of the name as far as R is concerned—it doesn’t enforce any special rules or type checking—and we use it only as a reminder.

In the resulting data frame we find three named columns that inherit their names from the contributing vectors. Each row is numbered sequentially starting from 1. Elements of a data frame may be indexed using [ROW, COLUMN] notation:

figure bb

The latter example shows us something new: by default, R converts character data in data frames to nominal factors. When xChar was added to the data frame, its values were added as the levels of a categorical (nominal) data type. Marketing analysts often work with categorical data such as gender, region, or different treatments in an experiment. In R, such values are stored internally as a vector of integers and a separate list of labels naming the categories. The latter are called levels and are accessed with the levels() function.

Converting character strings to factors is a good thing for data that you might use in a statistical model because it tells R to handle it appropriately in the model, but it’s inconvenient when the data really is simple text such as an address or comments on a survey. You can prevent the conversion to factors by adding an option to  data.frame() that sets stringsAsFactors=FALSE:

figure bc

The value of x.df[1, 3] is now a character string and not a factor.

Indices can be left blank, which selects all of that dimension:

figure bd

Index data frames by using vectors or ranges for the elements you want. Use negative indices to omit elements:

figure be

Indexing a data frame returns an object. The object will have whatever type suits that data: choosing a single element (row \(+\) column) yields a singular object (a vector of length one); choosing a column returns a vector; and choosing rows or multiple columns yields a new data frame. We can see this by using the  str() inspector, which tells you more about the structure of the object:

figure bf

As with lists, data frames may be indexed by using the names of their columns:

figure bg

In short, data frames are the way to work with a data set in R. R users encounter data frames all the time, and learning to work with them is perhaps the single most important set of skills in R.

Let’s create a new data set that is more representative of data in marketing research. We’ll clean up our workspace and then create new data:

figure bh

Notice that we specified that store number is a nominal factor, to tell R that it looks like a number but really isn’t. We’ll discuss that more in Sect. 3.1.1.

In the final command above, by putting parentheses around the whole expression, we tell R to assign the result of  data.frame(store.num, store.rev, ...) to store.df and then evaluate the resulting object (store.df). This has the same effect as assigning the object and then typing its name again to see its contents. This trick sometimes saves typing.

We can now get a list of our store managers by selecting that column using the same  $ notation that we used with lists:

figure bi

We can easily pass columns from the data frame to statistical functions using  $ and a column name. For example, we can compute the average of store.rev from the store.df data frame using  mean():

figure bj

Similarly, we could use the cor() function, which computes the Pearson product-moment correlation coefficient (aka Pearson’s r), to gauge the association between store visits and revenue in our data:

figure bk

We discuss correlation analysis in depth in Chap. 4.

You can obtain basic statistics for a data frame with summary():

figure bl

This shows us the frequency counts for the factor variable (store number), arithmetic summaries of the numeric variables, and the overall length of the text variable. Chapter 3 says much more about describing and summarizing data. (Note: the store.manager column might be summarized slightly differently, depending on the versions of packages loaded earlier in this chapter.)

6 Loading and Saving Data

There many ways to load and save data in R. In this section, we focus on the methods for storing data that are common in typical projects including how to save and read native R objects, how to save entire R sessions, and how to read and write CSV formats to move data in and out of other environments like Microsoft Excel.

Native (“binary”) R objects are representations of objects in an R-specific format. If you need to save an object exclusively for R then this format will be useful to you. Use  save() to write a binary object to disk and  load() to read it.

Let’s back up the store.df object to disk using  save(OBJECT, FILE). Then we’ll delete it from memory and use  load(FILE) to restore it:

figure bm

save() can also take a group of objects as an argument; just replace the single object name with list=c() and fill in c() with a character vector. For instance:

figure bn

When a file is loaded, its objects are placed into memory with the same names that they had when saved. Important: when a file is loaded, its objects silently overwrite any objects in memory with the same names! Consider the following:

figure bo

In the example above, store.df is first assigned a new, simple value of 5 but this is overwritten by  load() with no warning. When loading objects from files, we recommend to begin from a clean slate with no other objects in memory in order to reduce unexpected side effects.

Filenames may be specified with just the file name as above, in which case they are saved to the current R working directory, or as full paths in the format appropriate to your system. Note that Microsoft Windows uses \ to denote folders, which doesn’t work in R (which expects Unix-style directory names using “/”). You must convert \ to either \\ or /, or else R will give an error.

Assuming the appropriate “R” folder exists, and replacing user to match your system, you could try:

figure bp

The standard file suffix for native data files in R is .RData and we recommend to use that.

If specifying full paths seems cumbersome, you may change the R working directory. getwd() reports the working directory while  setwd(PATH) sets it to a new location:

figure bq

These commands do not create directories; you should do that in the operating system.

6.1 Image Files

The memory image of an entire session can be saved with the command  save.image(FILE). If FILE is excluded, then it defaults to a file named ".RData". Standard R and R Studio both prompt you to save a memory image on closing, but you can also do it yourself by typing:

figure br

It can be useful to save the contents of working memory if you wish to back up work in progress, although care is needed (Sect. 2.8). Do not let this substitute for creating reproducible scripts; a best practice is to create a script file as you work that can always reproduce an analysis up to the current point. By default, images save to the working directory as set above.

Workspace images are re-loaded with the general load() command, not with a special “image” version; an image is a collection of objects and no different than other files produced by save(). As we warned above, loading an image will silently overwrite current memory objects that have the same names as objects in the image, but does not remove other objects. In other words, loading an image does not restore memory to a snapshot of a previous state, but rather adds those contents to current memory.

figure bs

You can view files with the list.files() command, and delete them with file.remove() which accepts any number of file names. If you wish to clean up the files we made above (assuming you have not changed working directory):

figure bt

The status returned by file.remove() is a vector noting whether each file was removed (if so, then its status is TRUE) or not (FALSE, if it doesn’t exist or is currently in use and cannot be removed).

6.2 CSV Files

Many analysts save data in delimited files such as comma-separated value (CSV) files and tab-separated value (TSV) files to move data between tools such as R, databases, and Microsoft Excel. We focus on CSV files; TSV and other delimited files are handled similarly.

First, let’s create a CSV by writing store.df to a file. This works similarly to the  save() command above, with syntaxwrite.csv(OBJECT, file="FILENAME"). We strongly recommend to add the option row.names=FALSE to eliminate an extra, unnamed column containing labels for each row; those mostly get in the way when interchanging CSV files with other programs.

A handy way to test CSV files is to use the command without a file name, which sends the output to the console just as it would be written to a file:

figure bu

R automatically includes a header row with variable names and puts quotation marks around character data.

Now let’s write a real file and then read it using  read.csv(file=...):

figure bv

By default, read.csv() prints the CSV contents to the R console formatted as a data frame. To assign the data to an object, use the assignment operator (\(\texttt {<-}\)). Let’s read the CSV file and assign its data to a new object:,

figure bw

After reading the CSV file, we recreate store.num as a factor variable. One of the problems with CSV files is that they lose such distinctions because they are written out in plain text.

Now we check that the values are identical to the original data frame:

figure bx

The operator == tells R to test whether the the two data frames are the same, element-by-element. Although == confirms equality, in general the function all.equal(X, Y) is more useful because it ignores tiny differences due to binary rounding error (there is an infinity of real numbers, which computers store as finite approximations). Also, the output of all.equal() is more compact:

figure by

R can handle many other file formats that we do not discuss in this book. These include fixed format files, databases, and binary files from other software such as Microsoft Excel, MATLAB, SAS, and SPSS. If you need to work with such data, we describe some of the options in Appendix C. A more general overview of options for data exchange is provided by the R Data Import/Export manual [157].

We’ll clean up the unneeded object, “store.df2” (see Sect. 2.8 below):

figure bz

7 Writing Your Own Functions*

The asterisk (*) in the title indicates that this is an optional section. We examine the basics of writing reusable functions, a fundamental programming skill. If you are new to programming, you might wish to skip this section for now and refer back to it when you encounter functions again in later chapters.

Many analyses in R are repetitive: compute statistics across slices of data such as different sales regions, produce analyses from new data sets such as successive calendar quarters, and so forth. R provides functions to let you write a set of commands once and reuse it with new data.

We can create a function in R quite easily. A common function we write is to compute the standard error of the mean for a vector of observed data. Such a function already exists in R, but is so simple that we sometimes write our own. In the infinite population version, the standard error is computed as the standard deviation of the data (sd()) divided by square root (sqrt()) of the sample size, which is the length of the vector holding the data. We can declare a function to do this in one line:

figure ca

The new function se() can then be used just like any other built-in function in R:

figure cb

A function’s results can also be assigned to other variables or used in additional functions. For example, we might compute the upper-bound 95% confidence interval as the mean \(+\) 1.96 standard error:

figure cc

This tells us that, if the present data are a good random sample from a larger set, we could expect the mean of other such samples to be 65.51 or less in 97.5% of the samples (97.5% because the 95% confidence interval is symmetric around 50%, extending from 2.5% to 97.5%). In other words, we can be highly confident from these data that the mean number of store visits is less than 65.52.

A schematic for a new function is: FUNCTIONNAME<- function(INPUTS) EXPR . In most cases, EXPR is a set of multiple lines that operate on the inputs. When there are multiple lines, they must be enclosed with braces { and }. By default, the return value of the function is the output of the last command in the function declaration.

As for the inputs to functions (such as x in se() above), there are a few things to know. First, you can name them with any legal variable name in R. They can accept any type of input. We use the term argument for inputs in this book (instead of parameter, which we reserve for statistical models). An argument has meaning only within its function; in programming jargon, it is scoped to the function. Thus, if you declare x as an argument, then x has a value inside that function as assigned when the function is called; outside the function it could have another value or not be declared. It is good practice in a function to use only variables that have been declared as arguments to the function; don’t refer to global workspace variables whose existence is unpredictable.

If you’ve programmed in other languages, you may find it unusual that R does not specify types for function arguments. It allows an argument to be of any type and will try to use it as is, issuing warnings and errors as necessary. (Pay attention to them!) For example, if we try to compute the standard error of the character vector store.df$store.manager, we get a return value of  NA along with a warning:

figure cd

In Sect. 12.3.3 we introduce ways to identify object types when you need to determine them.

When writing a function, we recommend four conventions:

  • Put braces around the body using { and }, even if it’s just a one line function

  • Create temporary values to hold results along the way inside the function

  • Comment the function profusely

  • Use the keyword  return() to show the explicit value returned by the function.

Putting those recommendations together, the se function above might be rewritten as follows:

figure ce

Perhaps this is overkill for such a simple function. However, when your functions get longer and you or your colleagues refer to them years later, you’ll be glad that they are clean and well-documented.

A function is an object in memory just like data, and may be inspected, listed, and deleted in the same ways. In particular, one may inspect a function simply by typing its name (without the parentheses):

figure cf

This makes it possible to examine what a function is doing and works for many functions in R and add-on packages.

7.1 Language Structures*

This optional section is for experienced programmers and describes how the R language controls a sequence of commands in a script or function.

If you program in a language such as C or Java, the control structures in R will be familiar. Using TEST to indicate a Boolean value (or value coercible to Boolean) and EXPR for any language expression—which may include a block of expressions inside { and }—R provides:

figure cg

Of these, we only use  if() and  for() in this book. We describe for() in more detail in Sect. 5.12, and cover if() in Sect. 5.1.3.

There is a caveat to these control structures. On the surface, R syntax appears similar to imperative programming languages (such as C, C++, and Java) but underneath it is a functional language whose approach more closely resembles Lisp, Clojure, or in particular, Scheme. To advance as an R programmer, you will wish to learn more about functional programming and the object models that underlie it. See Sect. 2.10 for pointers on advanced programming skills.

In addition to the standard  if() statement, R provides a vectorized version: ifelse(TEST, YES, NO). ifelse() applies  TEST to every element in a vector and returns the value of the expression YES for elements that pass the test as TRUE and the value of the expression NO for those that do not pass.

For example, here’s how we can use ifelse() to test each number in a vector before applying a math function to it, and thus avoid a common error:

figure ch

7.2 Anonymous Functions*

Another useful feature is ananonymous function (also known as a lambda expression) which can substitute for a general expression and does not need to be declared separately as a named function. (We use the apply() function here, which is similar to lapply() that we saw above, but works on non-list data such as data frames; for full details, see Sect. 3.3.4.)

Suppose for some reason we want the median divided by 2 for columns of data. One solution is to take the median() of each column using the apply() function on the data’s 2nd dimension (the columns), and then divide the result by 2:

figure ci

The second command here applies the median() function to each column of data (because the MARGIN is given the value 2), and then divides the resulting vectorby 2.

A second solution is a function with a name such as  halfmedian, with  apply():

figure cj

This now applies our custom halfmedian() function to each column.

However, creating such a function adds clutter to the namespace. Unless you want to use such a function in multiple places, that is inefficient. A third way to solve the problem is to create an anonymous function that does the work in place with no function name:

figure ck

If you find yourself creating a short function that is only used once, consider whether an anonymous function might be simpler and clearer.

This example reveals a truth about R: there are often many ways to solve a problem, and the best way in general is the one that makes sense to you. As you learn more about R, your opinion of what is best will change and your code will become more elegant and efficient. R analysts are thus like economists in the famous joke: “if you ask five economists, you’ll get six different opinions.”

For further reference (without jokes), a formal outline of the R language is available in the R Language Definition, http://cran.r-project.org/doc/manuals/R-lang.pdf [158].

Because this book is about analytics, not programming, we don’t cover the complete details of functions but just use them as necessary. To learn more about R’s programming model, see the Learning More Sect. 2.10 and a longer example in Chap. 12.

8 Clean Up!

R keeps everything in memory by default, and when you exit (use the command q(), for quit) R offers to save the memory workspace to disk to be loaded next time. That is convenient but means that your workspace will become crowded unless you keep it clean. This can lead to subtle and irreproducible bugs in your analyses, when you believe an object has one value but in reality it has been kept around with some other, forgotten value.

We recommend a few steps to keep your workplace clean. Use the ls() (list objects) command periodically to see what you have in memory. If you don’t recognize an object, use the  rm() command to remove it. You can remove a single object by using its name, or a group of them with the list= argument plus a character vector of names, or a whole set following a pattern with list=ls(pattern=" STRING ") (tip: don’t use "*" because it will match more than you expect):

figure cl

It’s better to start every session clean instead of saving a workspace. And as we’ve said, it’s a good idea to keep all important and reproducible code in a working script file. This will make it easy to recreate an analysis and keep a workspace clean and reproducible.

To clean out memory and ensure you’re starting from scratch at a given time, first you will wish to remove old data and other objects. In RStudio, you can do this by clicking the small “broom” icon in the environment window, or selecting Session \(\mid \) Clear workspace from the menu. Or, at the command line:

figure cm

A good second step is to restart the R interpreter. In RStudio, select Session \(\mid \) Restart R from the menu. This recovers memory and resets the workspace for subsequent analyses.

Alternatively, you may accomplish both steps by exiting without saving the workspace, and then restarting R or RStudio.

9 Key Points

Most of the present chapter is foundational to R, yet there are a few especially important points:

  • For work that you want to preserve or edit, use a text editor and run commands from there (Sect. 2.3).

  • Create vectors using  c() for enumerated values,  seq() for sequences, and rep() for repeated values (Sects. 2.4.1 and 2.4.3).

  • Use the constant NA for missing values, not an arbitrary value such as −999 (Sect. 2.4.5).

  • In R, data sets are most commonly  data.frame objects created with a command such as my.df<- data.frame(vector1, vector2, ...) (Sect. 2.5) or by reading a data file.

  • Vectors and data frames are most often indexed with specific numbers (x[1]), ranges (x[2:4]), negative indices (x[-3]) to omit data, and by boolean selection (x[x>3]) (Sects. 2.5 and 2.4.3).

  • Data frames are indexed by [ROW, COLUMN], where a blank value means “all of that dimension” such as my.df[2, ] for row 2, all columns (Sect. 2.5).

  • You can also index a data frame with$ and a column name, such as my.df$id (Sect. 2.5).

  • Read and write data in CSV format with  read.csv() and  write.csv() (Sect. 2.6.2).

  • Functions are straightforward to write and extend R’s capabilities. When you write a function, organize the code well and comment it profusely (Sect. 2.7).

  • Clean up your workspace regularly to avoid clutter and bugs from obsolete variables (Sect. 2.8).

10 Learning More*

In this chapter, we have described enough of the R language to get you started for the applications in this book. Later chapters include additional instruction on the language as needed for their problems, often presented as separate Language Brief sections. If you wish to delve more deeply into the language itself, the following books can also help.

If you are new to statistics, programming, and R, Dalgaard’s An Introduction to R [40] gives well-paced grounding in R and basic statistics commands. It is a great complement to this book for more practice with the R language.

For those who are experienced with statistics, A Beginner’s Guide to R by Zuur et al [209] dives into R broadly at a more accelerated pace.

If you are an experienced programmer or want to learn the R language in detail, Matloff’s The Art of R Programming [135] is a readable and enjoyable exposition of the language from a computer science perspective. John Chambers’s Software for Data Analysis [29] is an advanced description of the R language model and its implementation. Wickham’s Advanced R [197] focuses on functional programming in R and how to write more effective and reusable code.

Whereas this book focuses on teaching R at a conceptual level, it is also helpful to have more examples in a cookbook format. Albert and Rizzo approach that task from a largely regression-oriented perspective in R by Example [4]. A code-oriented collection that is lighter on statistics but deeper on programming is Teetor’s R Cookbook [186]. Lander (2017) presents a mix of both approaches, language and statistics, applied to a variety of analytic problems in R for Everyone [124].

11 Exercises

11.1 Preliminary Note on Exercises

The exercises in each chapter are designed to reinforce the material. They are provided primarily for classroom usage but are also useful for self-study. On the book’s website, we provide R files with example solutions at http://r-marketing.r-forge.r-project.org/exercises.

We strongly encourage you to complete exercises using a tool for reproducible results, so the code and R results will be shown together in a single document. If you are using RStudio, an easy solution is to use an R Notebook; see Appendix B for a brief overview of R Notebooks and other options. A simple R Notebook for classroom exercises is available at the book’s website noted above.

For each answer, do not simply determine the answer and report it; instead write R code to find the answer. For example, suppose a question could be answered by copying two or more values from a  summary command, and pasting them into the R console to compute their difference. Better programming practice is to write a command that finds the two values and then subtracts them with no additional requirement for you to copy or retype them. Why is that better? Although it may be more difficult to do once, it is more generalizable and reusable, if you needed to do the same procedure again. At this point, that is not so important, but as your analyses become complex, if will be important to eliminate manual steps that may lead to errors.

Before you begin, we would reemphasize a point noted in Sect. 2.7.2: there may be many ways to solve a problem in R. As the book progresses, we will demonstrate progressively better ways to solve some of the same problems. And R programmers may differ as to what constitutes “better.” Some may prefer elegance while others prefer speed or ease of comprehension. At this point, we recommend that you consider whether a solution seems optimal, but don’t worry too much about it. Getting a correct answer in any one of multiple possible ways is the most important outcome.

In various chapters the exercises build on one another sequentially; you may need to complete previous exercises in the chapter to answer later ones. Exercises preceded by an asterisk (*) correspond to one of the optional sections in a chapter.

11.2 Exercises

  1. 1.

    Create a text vector called Months with names of the 12 months of the year.

  2. 2.

    Create a numeric vector Summer, with Calendar month index positions for the summer months (inclusive, with 4 elements in all).

  3. 3.

    Use vector indexing to extract the text values of Months, indexed by Summer.

  4. 4.

    Multiply Summer by 3. What are the values of Months, when indexed by Summer multiplied by 3? Why do you get that answer?

  5. 5.

    What is the mean (average) summer month, as an integer value? Which value of Months corresponds to it? Why do you get that answer?

  6. 6.

    Use the floor() and ceiling() functions to return the upper and lower limits of Months for the average Summer month. (Hint: to find out how a function works, use R help if needed.)

  7. 7.

    Using the store.df data from Sect. 2.5, how many visits did Bert’s store have?

  8. 8.

    It is easy to make mistakes in indexing. How can you confirm that the previous answer is actually from Bert’s store? Show this with a command that produces no more than 1 row of console output.

  9. 9.

    *Write a function called PieArea that takes the length of a slice of pie and returns the area of the whole pie. (Assume that the pie is cut precisely, and the length of the slice is, in fact, the radius of the pie.) Note that is the exponentiation operator in R.

  10. 10.

    *What is PieArea for slices with lengths 4.0, 4.5, 5.0, and 6.0?

  11. 11.

    *Rewrite the previous command as one line of code, without using the PieArea() function. Which of the two solutions do you prefer, and why?