Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Multiple regression analysis is one of the dependence technique in which the researcher can analyze the relationship between a single-dependent (criterion) variable and several independent variables. In multiple regression analysis, we use independent variables whose values are known or fixed (non-stochastic) to predict the single-dependent variable whose values are random (stochastic). In multiple regression analysis, our dependent and independent variables are metric in nature; however, in some situations, it is possible to use non-metric data as independent variable (as dummy variable).

Gujarati and Sangeetha (2008) defined regression as:

‘It is concerned with the study of the dependence of one variable, the dependent variable, on one or more other variables, the explanatory variables, with a view to estimating and/or predicting the (population) mean or average value of the former in terms of the known or fixed (in repeated sampling) value of the later’.

2 Important Assumptions of Multiple Regression

  1. 1.

    Linearity—the relationship between the predictors and the outcome variable should be linear

  2. 2.

    Normality—the errors should be normally distributed—technically normality is necessary only for the t-tests to be valid, estimation of the coefficients (errors are identically and independently distributed

  3. 3.

    Homogeneity of variance (homoscedasticity)—the error variance should be constant

  4. 4.

    Independence (no autocorrelation)—the errors associated with one observation are not correlated with errors of any other observation

  5. 5.

    There is no multicollinearity or perfect correlation between independent variables.

Additionally, there are issues that can arise during the analysis that, while strictly speaking, are not assumptions of regression, are none the less, of great concern to regression analysis. These are

  1. 1.

    Outliers; it is an observation whose dependent variable value is unusual given its values on the predictor variable (independent variable).

  2. 2.

    Leverage; an observation with an extreme value on a predictor variable is called a point with high leverage.

  3. 3.

    Influence; an observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outliers.

3 Multiple Regression Model with Three Independent Variables

One of the well-known supermarket chains (ABC group) in the country has adopted an aggressive marketing decision particularly to increase the sales of its own private brands in the last 19 months. Recently, the company decided to investigate its product sales in the last 19 months. In the last 19 months, the company has invested lot of money in three strategic areas: Advertisement, marketing (excluding advertisement and distribution) and its distribution network. The company decided to do a multiple regression analysis to predict the impact of advertisement, marketing, and distribution expenses on its sales (Table 8.1a).

Table 8.1a Sales, advertising, marketing, and distribution expenses

4 Multiple Regression Equation

A multiple regression equation with three independent variables is given below:

$$ Y_{t} = \beta_{1} + \beta_{2} x_{2t} + \beta_{3} x_{3t} + \beta_{4} x_{4t} + u^{\prime}_{t} $$
(1)
$$ \begin{aligned} Sales_{t} & = \beta_{1} \left( {constant} \right) + \beta_{2} \left( {Advertisement\;Ex.} \right)_{t} + \beta_{3} \left( {Marketing\;Ex.} \right)_{t} \\ & \, + \beta_{4} \left( {Distribution\;Ex.} \right)_{t} \\ & \, + u^{\prime}_{t} \\ \end{aligned} $$
(2)

Here, \( Y_{t} \) is the value of the dependent variable (here it is sales) on time period t, \( \beta_{1} \) is the intercept or average value of dependent variable when all the independent variables are absent. \( \beta_{2} \beta_{3} ,\quad and\quad \beta_{4} , \) are the slope of sales (partial regression coefficients) with respect independent variables like advertisement expenses, marketing expenses, and distribution expenses holding other variables constant. For example, the coefficient value \( \beta_{2} \) implies that one unit change (increase or decrease) of in advertisement will lead to \( \beta_{2} \) unit time changes (increase or decrease) in sales holding other variables constant. \( u^{\prime}_{t} \) is the random error in Y, for time period t.

5 Regression Analysis Using SPSS

Step 1 :

Open the data file named Supermarket.sav (Fig. 8.1).

Fig. 8.1
figure 1

SPSS data view window

Step 2 :

Go to Analyze => Regression =>Linear to get the Linear Regression window as given in Fig. 8.2.

Fig. 8.2
figure 2

SPSS linear regression window

Step 3 :

Click the dependent variable Sales from the left panel of the Linear Regression window into dependent variable (right panel) and other three variables into Independent window (Fig. 8.3).

Fig. 8.3
figure 3

SPSS linear regression window

Step 4 :

Click the Statistics option and select Estimates, Model fit, and Descriptives, then click on Continue to get the main window of Linear Regression (Fig. 8.4).

Fig. 8.4
figure 4

Linear regression statistics window

Step 5 :

Go to the main window of linear regression and click OK (Fig. 8.5).

Fig. 8.5
figure 5

SPSS linear regression window

6 Output Interpretation for Regression Analysis

Table 8.1b in SPSS regression output shows the model summary, which provides the value of R (Multiple Correlation), R 2 (Coefficient of Determination) and Adjusted R 2 (R 2 adjusted with Degrees of Freedom). In this model, R has a value of 0.970. This value represents the multiple correlation between dependent and independent variables. The value of R 2 shows all the three independent variables can account for 94 % of the variation in sales. In other words, if the researcher would like to explain the contribution of all these three expenses on sales, looking at the R 2 it is possible. This means that around 6 % of the variation is sales cannot be explained by all these expenses. Therefore, it can be concluded that there must be other variables that have influence on sales.

Table 8.1b Model summary

Table 8.2 reports an analysis of variance (ANOVA). This table shows all the sums of squares associated with regression. The regression sum of square explains the sum of squares explained by the model or all the independent variables. Residual sum of squares explains the sum of squares for the residual or unexplained part. Total sum of squares explains the sum of squares of the dependent variable. The third column shows the associated degrees of freedom for each sum of squares. The mean sum of squares for the regression and residuals are calculated by dividing respective sum of squares by its degrees of freedom. The most important part in this table is F value, which is calculated by taking the ratio of mean square of regression and mean square of residual. For this model, the F value is 78.742, which is significant (p < .01). This result tells us that there is less than a 0.1 % chance that an F-ratio this large would happen if the null hypothesis were true. Therefore, looking at the ANOVA table, we can infer that our regression model results in significantly better prediction of sales.

Table 8.2 ANOVAa

Looking at the ANOVA explained in Table 8.2, we cannot make inference about the predictive ability of individual independent variables. Table 8.3 provides details about the model parameters. Looking at the beta vales and its significance, one can interpret the significance of each predictor on the dependent variable. The value 6908.926 is the constant term which is \( \beta_{1} \) in Eqs. 1 and 2. This can be interpreted as when no money is spent on all these three areas (advertising, marketing, and distribution) or X 2 X 3 X 4 = 0, the model predicts that average sales would be 6908.92 (remember our unit of measurement is in lakhs). The coefficient value for advertising expenses is 33.56(β 2) is the partial regression coefficient for advertising expenses. This value represents the change in the outcome associated with the unit change in the predictor or independent variable, while other variables hold constant. Therefore, it can be interpreted that if our independent variable is increased by one unit (here advertising expenses), then our model predicts that 33.56 unit times change in depended variable (here sales) occurs while holding other variables like marketing expenses and distribution expenses constant. As our unit of measurement for the advertising expenses were in lakhs, it can be interpreted that an increase in advertising expenses of Rs. 1 lakhs will increase the sales 33,56000 lakhs (100000 * 33.569) holding other expenses constant. In the same fashion, one can also interpret the other coefficients. The negative sign of the coefficients indicates an inverse relationship between dependent and independent variables.

Table 8.3 Coefficientsa

Standard Error Column explains the standard error associated with each estimate or coefficients. The standardized coefficients column shows the standardized coefficient values for each estimate in which the unit of measurement is common. These coefficients can be used for explaining the relative importance of each independent variable when the unit of measurement is different for different independent variables. Looking at the coefficients, one can infer that advertising expense is the most important predictor followed by distribution expenses.

The last two columns show t-value and associated probability. The t-value can be calculated as unstandardized coefficients divided by its respective standard error. The t-test tells us whether the β-value is different from 0 or not. The last column of the Table 8.3 shows the exact probability that the observed value of t would occur if the value of β in the population were 0. If the probability value is less than 0.05, then the researcher agree that result reflect a genuine effect or β is different from 0. From the table, it is evident that for all the three independent variables, the probability value is less than that the assumed 0.05 level, and so we can say that in all the three cases, the coefficient values are significantly different from zero or it significantly contributes to the model.

7 Examination of Major Assumptions of Multiple Regression Analysis

7.1 Examination of Residual

Examining the residual provide useful insights in examining the appropriateness of the underlying assumptions and regression model fitted A residual is the difference between the observed value of Y i and the value predicted by the regression equation Yˆ i.. Residuals are used in the calculation of several statistics associated with regression. Without verifying that your data have met the regression assumptions, the results may be misleading.

7.2 Test of Linearity

When we do linear regression, we assume that the relationship between the response variable and the predictors is linear. This is the assumption of linearity. If this assumption is violated, the linear regression will try to fit a straight line to data that does not follow a straight line. Checking the linear assumption in the case of simple regression is straightforward, since we only have one predictor. All we have to do is a scatter plot between the each response variable (independent variable) and the predictor (dependent variable) to see if nonlinearity is present, such as a curved band or a big wave-shaped curve. The examination of linearity can be examined through the following video.

7.3 Test of Normality

The assumption of a normally distributed error term can be examined by constructing a histogram of the residuals. A visual check reveals whether the distribution is normal. It is also useful to examine the normal probability of plot of standardized residuals compared with expected standardized residuals from the normal distribution. If the observed residuals are normally distributed, they will fall on the 45-degree line. Additional evidence can be obtained by determining the percentages of residuals falling within ± 2 SE or ± 2.5 SE. More formal assessment can be made by running the tests: Shapiro–Wilk, Kolmogorov–Smirnov, Cramer–von Mises and Anderson–Darling.Footnote 1

7.4 Test of Homogeneity of Variance (Homoscedasticity)

The assumption of constant variance of the error term can be examined by plotting the residuals against the predicted values of the dependent variable, Yˆ i . If the pattern is not random, the variance of the error term is not constant. See the video How to check Normality Assumption.

7.5 Test of Autocorrelation

A plot of residuals against time, or the sequence of observations, will throw some light on the assumption that the error terms are uncorrelated or no autocorrelation. A random pattern should be seen if this assumption is true. A more formal procedure for examining the correlations between the error terms is the Durbin–Watson test (Applicable only for time series data).

7.6 Test of Multicollinearity

The presence of multicollinearity or perfect linear relationship between independent variables can be identified using different methods. These methods are:

  1. 1.

    VIF (Variance-Inflating factor): As a rule of thumb, If the VIF value exceeds 10, which will happen only if correlation between independent variables exceeds 0.90, that variable is said to be highly collinear (Gujarati and Sangeetha 2008).

  2. 2.

    TOL (Tolerance): The closer the TOL to zero, the greater the degree of collinearity of the variables (Gujarati and Sangeetha 2008).

  3. 3.

    Conditional Index (CI): If CI exceeds 30, there is severe multicollinearity (Gujarati and Sangeetha 2008).

  4. 4.

    Partial Correlations: High partial correlation between independent variables also shows the presence of multicollinearity.

7.7 Questions

Examine the following fictitious data

Model

R

R 2

Adjusted R square

Std. error of the estimate

1

0.863

0.849

0.850

13.8767

  1. 1.

    Which of the following statements can we not say?

    1. (a)

      The standard error is an estimate of the variance of y, for each value of x.

    2. (b)

      In order to obtain a measure of explained variance, you need to square the correlation coefficient.

    3. (c)

      The correlation between x and y is 86 %.

    4. d)

      The correlation is good here as the data points cluster around the line of fit quite well. So prediction will be good.

    5. (e)

      The correlation between x and y is 85 %.

  1. 2.

    The slope of the line is called:

    1. (a)

      Which gives us a measure of how much y changes as x changes.

    2. (b)

      Is the point where the regression line cuts the vertical axis.

    3. (c)

      A correlation coefficient indicates the variability of the points around the regression line in the scatter diagram.

    4. (d)

      None of the above.

    5. (e)

      The average value of the dependent variable.

  2. 3.

    Using some fictitious data, we wish to predict the musical ability for a person who scores 8 on a test for mathematical ability. We know the relationship is positive. We know that the slope is 1.63 and the intercept is 8.41. What is their predicted score on musical ability?

    1. (a)

      80.32

    2. (b)

      −4.63

    3. (c)

      21.45

    4. (d)

      68.91

    5. (e)

      54.55

  3. 4.

    We have a negative relationship between number of drinks consumed and number of marks in a driving test. One individual scores 3 on number of drinks consumed, another individual scores 5 on number of drinks consumed. What will be their respective scores on the driving test if the intercept is 18 and the slope 3?

    1. (a)

      It is not possible to predict from negative relationships.

    2. (b)

      Driving test scores (Y-axis) will be 51 and 87 [individual who scored 5 on drink consumption].

    3. (c)

      Driving test scores (Y-axis) will be 27 [individual who scored 3 on drink consumption] and 33 [individual who scored 5 on drink consumption].

    4. (d)

      Driving test scores ( Y -axis) will be 9 [individual who scored 3 on drink consumption] and 3 [individual who scored 5 on drink consumption].

    5. (e)

      None of these.

  4. 5.

    You are still interested in whether problem-solving ability can predict the ability to cope well in difficult situations; whether motivation can predict coping and whether these two factors together predict coping even better. You produce some more results.

Dependent variable coping skills in difficult situations

  

Unstandardized coefficients

Standardized coefficients

t

Sig.

  

B

Std. error

Beta

  
 

Constant

−0.466

0.241

 

1.036

0.302

 

Problem

0.200

0.048

0.140

2.082

0.030

 

Motivation

0.950

0.087

0.740

10.97

0.000

Which of the following statements is incorrect?

  1. (a)

    As motivation increases by one standard deviation, coping skills increases by almost three quarters of a standard deviation (0.74). Thus, motivation appears to contribute more to coping skills than problem solving.

  2. (b)

    As motivation increases by one unit coping skills increases by 0.95.

  3. (c)

    The t -value for problem solving is 2.082 and the associated probability is 0.03. This tells us the likelihood of such a result arising by sampling error, assuming the null hypothesis is true, is 97 in 100.

  4. (d)

    Problem solving has a regression coefficient of 0.20. Therefore, as problem solving increases by one unit coping skills increases by 0.20.

  5. (e)

    None of these.