Keywords

3.1 Basic Statistics

Statistics is the science (pure and applied) dealing with creating, developing, and applying techniques to evaluate uncertainty of inductive inferences. It helps to answer the question about different hypothesis. It can model the role of chance in our experiments in a quantitative way and gives estimates with errors. Propagation of error in input values could also be determined by the statistics. History of statistics goes back to the experience of gambling (seventeenth century) which leads to the concept of probability. Afterwards concepts of normal curve/normal curve of error were introduced. Charles Darwin (1809–1882) work was largely biostatistical in nature. Karle Pearson (1857–1936) founded the journal Biometrika and school of statistics. Pearson was mainly concerned with large data, and his student W. S. Gosset (Pseudonym, Student) (1876–1937) presented Student’s t-test which is a basic tool of statistician and experimenters throughout the globe. Genichi Taguchi (1924–2012) promoted the use of experimental designs.

Observations in the form of numbers are very important to perform different kind of statistical analysis. In case of crop production, observation can be phenology, leaf area, crop biomass, and yield. These numbers then constitute data, and its common characteristics include variability or variation. Variables may be quantitative or qualitative. Observations on quantitative variables may be further classified as discrete or continuous. Furthermore, probability of occurrence of value such as blondeness may be measured by probability function or probability density function (PDF). Chance and random variable terms are generally used for the variables possessing PDF. Population is all possible values of a variable, while part of population is called a sample. The concept of randomness is used to have true representative data sample from the population. Collected data could be characterized using tables, charts (pie chart, bars, etc.), and pictures (histogram). Afterwards data are presented in frequency tables, and measure of central tendency is used to locate center. This can help to find measure of spreading of the observation. Mean or average (μ) is the most common method to use the measure of central tendency. In case of dice, μ can be calculated by using following equation

$$ \mu =\frac{1+2+3+4+5+6}{6}=3\frac{1}{2} $$
(3.1)

If a sample is taken from the population having four observation, then \( \overline{Y} \) (sample mean) for the four observation (3, 5,7,9) is

$$ \overline{Y}=\frac{3+5+7+9}{4}=6 $$
(3.2)

This can be further symbolized by

$$ \overline{Y}=\frac{Y_1+{Y}_2+{Y}_3+{Y}_4}{4} $$
(3.3)

where Y1 = value of first observation, Y2 = value of second observation, Y3 = value of third observation, and Y4 = value of fourth observation. For the nth observations, Yi is used to represent the ith observation and \( \bar{Y} \) is given by

$$ \overline{Y}=\frac{Y_1+{Y}_2+{Y}_3+{Y}_4+\dots +{Y}_i+\dots .+{Y}_n}{n} $$
(3.4)

This equation can be further shortened to

$$ \overline{Y}=\frac{\sum_{i=1}^n{Y}_i}{n} $$
(3.5)

Difference between observations (Yi) and sample mean (\( \overline{Y}\Big) \) is called sample deviation (Yi\( \overline{Y}\Big),\mathrm{and}\ \mathrm{its}\ \mathrm{sum}\ \mathrm{is}\ \mathrm{equal}\ \mathrm{to}\ \mathrm{zero}\sum \left({Y}_i-\overline{Y}\right)=0 \).

For the different number of observations, it’s better to use weights that depend on the number of observations in each mean called weighted mean. A weighted mean is defined as follows:

$$ {\overline{Y}}_w=\frac{\sum {w}_i{Y}_i}{\sum {w}_i} $$
(3.6)

Another term supplement to the mean is median and it is value for which 50% of the observations lie on each side. However, if values are even, then median is average of the two middle values, e.g., 3, 6, 8, and 11 median is 7 (6 + 8)/2. If data is nonsymmetrical in that case, mean and median could be different, and data might be skewed in one direction; thus arithmetic mean may not be a good criteria to measure central value. Mode (most frequent value) is another measure to calculate central tendency. Central tendency provides summary about the data but does not provide information about variation. Standard deviation or variance or square root (Yi − μ)2 is used to measure variation or dispersion from the mean. It can be represented by two symbols: (i) σ2 (sigma square for the population) and (ii) S2 (sample). Population variance is defined as sum of squared deviations divided with total number, and it can be elaborated by the following equation if we intent to sample this population with replacement:

$$ {\sigma}^2=\frac{{\left({Y}_1-\mu \right)}^2+{\left({Y}_2-\mu \right)}^2+{\left({Y}_3-\mu \right)}^2+\dots +{\left({Y}_N-\mu \right)}^2}{N} $$
(3.7)
$$ =\frac{\sum_i{\left({Y}_i-\mu \right)}^2}{N} $$
(3.8)

However, when sampling is without replacement, then divisor is N−1, and it could be represented by the equation as follows:

$$ {S}^2=\frac{{\left({Y}_1-\mu \right)}^2+{\left({Y}_2-\mu \right)}^2+{\left({Y}_3-\mu \right)}^2+\dots +{\left({Y}_N-\mu \right)}^2}{N-1} $$
(3.9)
$$ =\frac{\sum_i{\left({Y}_i-\mu \right)}^2}{N-1} $$
(3.10)

The sample variance/mean square can be computed by using following formulas:

$$ {s}^2=\frac{{\left({Y}_1-\overline{Y}\right)}^2+{\left({Y}_2-\overline{Y}\right)}^2+{\left({Y}_3-\overline{Y}\right)}^2+\dots +{\left({Y}_N-\overline{Y}\right)}^2}{n-1} $$
(3.11)
$$ {s}^2=\frac{\sum_i{\left({Y}_i-\overline{Y}\right)}^2}{n-1} $$
(3.12)
$$ \left(n-1\right){s}^2={\sum}_i{\left({Y}_i-\overline{Y}\right)}^2 $$
(3.13)

s2 = SS (sum of squares). For example, for the numbers 3, 5, 7, and 9, the SS is

$$ {\left(3-6\right)}^2+{\left(5-6\right)}^2+{\left(7-6\right)}^2+{\left(9-6\right)}^2={\left(-3\right)}^2+{\left(-1\right)}^2+{(1)}^2+{(3)}^2=9+1+1+9=20 $$

The variance for this data set will be 20/3 = 6.66, and the square root of the sample variance is called the standard deviation (s). For the above example, it can be calculated by the following method:

$$ s=\sqrt{\frac{20}{3}}=2.58 $$

Thus Eq. (3.12) can be represented as follows:

$$ SS={\sum}_i{\left({Y}_i-\overline{Y}\right)}^2 $$
(3.14)

This Eq. (3.14) could be further modified to a computing formula as follow:

$$ {\sum}_i{\left({Y}_i-\overline{Y}\right)}^2=\sum \limits_i{Y_i}^2-\raisebox{1ex}{${\left({\sum}_i{Y}_i\right)}^2$}\!\left/ \!\raisebox{-1ex}{$n$}\right. $$
(3.15)

The term \( \raisebox{1ex}{${\left({\sum}_i{Y}_i\right)}^2$}\!\left/ \!\raisebox{-1ex}{$n$}\right. \) is called the correction factor (CF) or correction term or adjustment for the mean. The Eq. (3.15) could be easily validated by using following data set in the Table 3.1.

Table 3.1 Data set for the validation of sum of squares equation

Thus, \( SS={\sum}_i{\left({Y}_i-\overline{Y}\right)}^2=20\ \mathrm{and}\ \mathrm{by}\ \mathrm{the}\ \sum \limits_i{Y_i}^2-\raisebox{1ex}{${\left({\sum}_i{Y}_i\right)}^2$}\!\left/ \!\raisebox{-1ex}{$n$}\right.=164-\frac{(24)^2}{4}=20 \) (Table 3.1). Another term which is generally used is called degree of freedom (df) (number of values in the calculation that are free to vary), and it is equal to n−1. The absolute mean deviation or average deviation is calculated as:

$$ \mathrm{Average}\ \mathrm{deviation}\ \mathrm{or}\ \mathrm{Absolute}\ \mathrm{mean}\ \mathrm{deviation}=\frac{\sum_i\left|{Y}_i-\overline{Y}\right|}{n} $$
(3.16)

The absolute mean deviation or average deviation for the values 3, 5, 7, and 9 is 2 as vertical bars tell us consider all deviations as positive. The variance of the population \( \left({\sigma^2}_{\overline{Y}}\right) \) of \( \overline{Y} \) can be calculated by the following equation:

$$ {\sigma^2}_{\overline{Y}}=\frac{\sigma^2}{n} $$
(3.17)

However, \( {\sigma}_{\overline{Y}} \) for the population can be computed by the following expression:

$$ {\sigma}_{\overline{Y}}=\sqrt{\frac{\sigma^2}{n}} $$
(3.18)
$$ {\sigma}_{\overline{Y}}=\frac{\sigma }{\sqrt{n}} $$
(3.19)

Standard deviation of sample mean is called standard error (SE). Variance for the sample (\( {s^2}_{\overline{Y}}\Big) \)can be calculated by the following equations:

$$ {s^2}_{\overline{Y}}=\frac{s^2}{n} $$
(3.20)
$$ {\mathrm{SE}}_{\overline{Y}}=\sqrt{\frac{s^2}{n}} $$
(3.21)
$$ {\mathrm{SE}}_{\overline{Y}}=\frac{s}{\sqrt{n}} $$
(3.22)

SE can be calculated by using following equation for the numbers 3, 5, 7, and 9 as used above to calculate standard deviation.

$$ \mathrm{SE}=\sqrt{\frac{s^2}{n}}=\sqrt{\frac{6.66}{4}}=\sqrt{1.66}=1.29 $$

Variation can also be measured using coefficient of variability (CV) or relative standard deviation (RSD) which is widely used as a well-known indicator as described in Table 3.2. It is a measure of relative variability. It is the ratio of standard deviation (σ) to the mean (μ) and can be calculated by the following expression:

$$ \mathrm{coefficient}\ \mathrm{of}\ \mathrm{variation}\ \left(\mathrm{CV}\right)=\frac{\sigma }{\mu } $$
(3.23)
$$ \overline{Y}=\frac{\sum {Y}_i}{5}=\frac{7680}{5}=1536\;\mathrm{kg}\ {\mathrm{ha}}^{-1} $$
$$ {s}^2=\frac{\sum {Y_i}^2-\raisebox{1ex}{${\left(\sum {Y}_i\right)}^2$}\!\left/ \!\raisebox{-1ex}{$5$}\right.}{4}=\frac{\mathrm{12,045,400}-\raisebox{1ex}{${(7680)}^2$}\!\left/ \!\raisebox{-1ex}{$5$}\right.}{4}=\mathrm{62,230} $$
$$ s=\sqrt{62,230}=249.45\;\mathrm{kg}\ {\mathrm{ha}}^{-1} $$
$$ {s^2}_{\overline{Y}}=\frac{s^2}{5}=\frac{\mathrm{62,230}}{5}=\mathrm{12,446}\kern0.75em $$
$$ {\mathrm{SE}}_{\overline{Y}}=\sqrt{\frac{s^2}{5}}=\sqrt{\frac{62,230}{5}=12,446}=111.56\;\mathrm{kg}\ {\mathrm{ha}}^{-1} $$
$$ \mathrm{CV}=\frac{249.45}{1536}\times 100=16\% $$
Table 3.2 Example of the data set for the calculation of above concepts

3.2 Statistical Models

A model is an abstract representation of a system in a quantitative way. It is a way of describing a real system in mathematical functions or diagrams. It can also be used to represent the simplification in different process trying to represent biological systems. A model can summarize factors affecting different process in a system. Mathematical models use different notation and expressions from mathematics to describe process, while statistical model is a mathematical model that allows variability in the process. This variability might be due to the number of reasons such as sampling, biological, and inaccuracies in measurements or due to the influential variables being omitted from the model. Thus, statistical models have potential to measure uncertainty associated with it. Statistical models come in the category of empirical models where principle of correlation was used to build a simple equation to describe relationship with different explanatory variables. Furthermore, if the explanatory variables are in numbers (quantitative), they were referred as variates, while if they are qualitative, then they were considered as factors and distinct groups as factor levels. For example, qualitative trait height can be classified as short, medium, or tall. Linear models are most importantly used statistical model.

3.3 The Linear Additive Model

Natural phenomenon in science such as earth rotation could be explained by the models. Linear additive model (LAM) is a commonly used model to describe the observation which has mean and error. Assumption for the application of this model includes that population of Y should be selected at random as well as errors are at random. This model could be used to make inferences about population means and variance. The simple LAM could be represented by the following equation:

$$ {Y}_i=\mu +{\varepsilon}_i $$

where μ = mean and εi= sampling error.

The sampling error for the population having mean zero could be calculated by the following procedure in which sample from the population is drawn in a random manner. The steps include

$$ \overline{Y}=\frac{\sum_i{Y}_i}{n}=\frac{\sum_i\left(\mu +{\varepsilon}_i\right)}{n}=\mu +\frac{\sum_i{\varepsilon}_i}{n} $$

For random sampling the equation will be = \( \frac{\left({\sum}_i{\varepsilon}_i\right)}{n} \), and it is expected to be smaller as sample size increases and positive and negative epsilon will cancel. Generally variance of mean of large samples are usually small. Epsilon could be calculated by using \( \left({Y}_i-\overline{Y}\right) \).

3.4 Probability

Probability is a numerical description of how likely an event is to occur or how likely it is that a proposition is true. Probability is a number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty. The best example for understanding probability is flipping a coin: There are two possible outcomes—heads (H) or tails (T). What’s the probability of the coin landing on heads? We can find out using the equation

$$ \mathrm{probability}\ \mathrm{of}\ \mathrm{head}\ {P}_H=\frac{1}{2} $$

or

$$ \mathrm{Probability}\ \mathrm{of}\ \mathrm{an}\ \mathrm{event}=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{ways}\ \mathrm{it}\ \mathrm{can}\ \mathrm{happen}}{\mathrm{total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{outcomes}} $$

Similarly, in case of dice rolling, there are six different outcomes (1, 2, 3, 4, 5, and 6), and probability of getting a one will be:

$$ {P}_1=\frac{1}{6} $$

The probability of getting 1 or 6 can be calculated by following way:

$$ {P}_{1\ \mathrm{or}\ 6}=\frac{2}{6}=\frac{1}{3} $$

The probability of rolling an even number (2, 4, and 6) will be:

$$ {P}_{2,4\ \mathrm{or}\ 6}=\frac{3}{6}=\frac{1}{2} $$

For many experiments there are only two possible outcomes, for example, a tossed coin falls heads or tails or student fail or pass or plant could be tall or short. Such outcomes are referred as binomial, and sample space will consist of two points only. Thus, sample space is made up of sample points (represented with E and, if event does not occur, represented with −E or Ē or Ɇ) as shown in the following Fig. 3.1. Probability associated with each value of the random variable is called as binomial probability function or binomial distribution. Formula that can gives the probability associated with each chance event e.g. for a fair coin if we consider Y = 0 for tail and Y = 1 for head will be:

$$ {P}_{Y={Y}_i}=\frac{1}{2}\ {Y}_i=0\ \mathrm{and}\ 1 $$
Fig. 3.1
figure 1

Illustration of sample space and sample point

For tossing a fair dice, probability distribution would be:

$$ {P}_{Y={Y}_i}=\frac{1}{6}\ {Y}_i=1,2,3,4,5\ \mathrm{and}\ 6 $$

Ten thousand random digit tables are a very large sample for a population, and probability distribution for this table would be

$$ {P}_{Y={Y}_i}=\frac{1}{10}\ {Y}_i=0,1,2,3,4,5\dots 9 $$

If we consider only odd and even numbers, then we can relate ten thousand random digit tables with \( {P}_{Y={Y}_i}=\frac{1}{2} \) Yi = 0 and 1, \( {P}_{Y={Y}_i}=\frac{1}{6} \)Yi = 1, 2, 3, 4, 5 and 6 and \( {P}_{Y={Y}_i}=\frac{1}{10} \)Yi = 0, 1, 2, 3, 4, 5…9, but it would not be binomial now, it will be multinomial. Probabilities of binomial distribution in single statement can be elaborated by generating single equation. Consider an experiment that contains n independent trials. Let PE = P1 = p then \( {P}_{\bar{E}} \)P0 = 1 − p as we know that \( p=\frac{number\ of\ successes}{total\ number\ of\ events\ \left( Successes+ Failures\right)} \) and probability of an event (Eilies between 0 and 1 \( \left(0\le {P}_{E_i}\le 1\right) \) and sum of the probabilities of events in a mutually exclusive set is 1 \( \left(\sum \limits_i{P}_{E_i}=1\right). \) Five tosses of coins could result in (0, 0, 1, 1, 0), that is, two tails followed by two heads and final tail. Since trial is independent, thus probability of this outcome can be found by multiplying probabilities in each stage, i.e., (1−p) (1−p)pp(1−p) = p2(1 − p)3. If p = 0.5 then (0.5)5 = 0.03125 or 3 % . The random variable Y associates a unique value with each sample point, e.g., for sample vector (0, 0, 1, 1, 0), we have Y = 2, and there are possibilities of 10 sequences with Y = 2. Thus Y = 2 is 10p2(1 − p)3. The equatin which can be used to calculate this value directly will be:

$$ \left(\begin{array}{c}n\\ {}Y\end{array}\right)=\frac{n!}{Y!\left(n-Y\right)!} $$

where n! = n factorial = n(n−1)(n−2)…0.1. Thus, for Y = 2, i.e., two 1 s in n = 5 trials, the equation would be:

$$ \left(\begin{array}{c}5\\ {}2\end{array}\right)=\frac{\mathrm{5.4.3.2.1}}{\mathrm{2.1.3.2.1}}=10 $$

One formula which can be used to count sample points with the same Y and one that assigns probability to each sample point in the binomial probability distribution can be represented as:

\( P\left(Y={Y}_i|n\right)=\left(\begin{array}{c}n\\ {}{Y}_i\end{array}\right){p}^{Y_i}{\left(1-p\right)}^{n-{Y}_i} \) (In this equation the probability that the random variable Y takes the particular value Yi in a random experiment with n trials). For the coin above illustration, this equation will be:

$$ P\left(Y=2|5\right)=\left(\begin{array}{c}5\\ {}2\end{array}\right){\left(\frac{1}{2}\right)}^2{\left(\frac{1}{2}\right)}^3 $$

The mean and variance of a random variable with a binomial distribution could be calculated by using following equations:

$$ \mathrm{Mean}:\mu = np $$
$$ \mathrm{Variance}:{\sigma}^2= np\left(1-p\right) $$

3.5 Normal Distribution

Normal distribution is the most important widely used probability distribution as it fits with many natural processes such as heights, blood pressure, IQ score, and measurement error. It is also called as bell curve or Gaussian distribution. It is a standard reference for probability-related problems. The normal distribution has two parameters, i.e., mean (μ) and standard deviation (σ) (Fig. 3.2). The characteristics of normal distributions are as follows: (i) X lies between −∞ and ∞ (−∞ ≤ X ≤ ∞); (ii) symmetric; (iii) normal density function rule, \( f\left(x;\mu, {\sigma}^2\right)=\frac{1}{\sqrt{2\pi {\sigma}^2}}{e}^{-\raisebox{1ex}{${\left(x-\mu \right)}^2$}\!\left/ \!\raisebox{-1ex}{$2{\sigma}^2$}\right.} \); (iv) 2/3 of the most cases lies with one σ of μ, i.e., P(μσ ≤ X ≤ μ + σ) = 0.6826; and (iv) 95% of cases lies two σ of μ, i.e., P(μ−2σ ≤ X ≤ μ + 2σ) = 0.9544.

Fig. 3.2
figure 2

Normal distribution curve

3.6 Comparison of Means

Statistical concepts are used everywhere in daily life, e.g., while purchasing honey bottle from market, it may be labelled as 500 g, but to confirm this claim, we need to take random sample from the population. We could report the probability of obtaining a sample at least this uncommon if true mean is 500 g. This can be the problem of hypothesis testing. In such cases testing is done by using Student’s t-test or F-Test. If means are more than two, the analysis of variance (ANOVA) F-test is to be used. Thus, sample size should be considered while selecting a test. Hypothesis test and confidence interval (CI) are interlinked. The formula to apply Student’s t-test is

$$ t=\frac{\overline{Y}-\mu }{S_{\overline{Y}}} $$
$$ t=\frac{\overline{Y}-\mu }{\sqrt{\frac{s^2}{n}}} $$
$$ t=\frac{\overline{Y}-\mu }{\frac{s}{\sqrt{n}}} $$

For the data having two means, t-test equation will be:

$$ t=\frac{\overline{Y_1}-\overline{Y_2}}{S_{\overline{Y_1}}-{S}_{\overline{Y_2}}} $$

where \( \overline{Y}=\mathrm{sample}\ \mathrm{mean} \), s is the sample standard deviation, and n is the sample size.

Consider a null hypothesis Ho : μ = μo and alternative hypothesis H1 : μ ≠ μo, if t exceeds critical value t0.025, then Ho is rejected, but if null hypothesis is true and still, it has been rejected and is called type I error. However, if H1 is true and we accept Ho anyway, this type of error is called type II error.

3.7 Analysis of Variance (ANOVA)

It is an undeniable fact that agronomic research resulted to the improved quality of life and sustainability of the planet earth. The principles and procedures of analysis of variance (ANOVA) have been considered as fundamental tools in all agronomic research. ANOVA is an established statistical procedure that can be used to test the hypothesis by partitioning the sources of variation (SOV), variance components estimation, explanation and reduction of residual variation, and determination of the significance of effects. ANOVA history of application in agronomic field research and plant breeding trials goes back to the early twentieth century in which the main goal of research work was to have a better understanding of the effects of treatments, e.g., fertilizer, cultivars, planting dates, soil amendments, and their interactions. Earlier, trials main focus was on yield and thus to have better scientific understanding of the effects of treatments and guidance to the farmers; ANOVA was used widely. ANOVA helped in the early twentieth century to have good credibility of field agronomic trials. Furthermore, significant differences between treatment and check plots could be evaluated by ANOVA; however, there were issues between years as random effects of years could not be replicated (Loughin 2006). Fisher was a pioneer in the introduction of ANOVA, and he applied this concept in the 1920s on long-term wheat yield experiments (>half century) in response to the soil amendments (Fisher 1921). Fisher used ANOVA to disentangle large variability in average yield from other changes and evaluate significant difference between treatments. The basis of ANOVA was described as the variance (mean σ of variate from its mean thus square of its standard deviation) produced by all the causes at once in an operation is the sum of the values produced by each cause individually. Thus, with ANOVA we can partition the total variation into separate and independent SOV. To implement ANOVA accurately, it is important that treatment plots (experimental units) must be replicated and randomized. The basic assumptions to apply ANOVA are (i) Treatments and environment effects are additive and (ii) Experimental errors are random, independently and normally distributed about zero mean and with a common variance. Fisher in his experimental design work documented that the systematic arrangement of treatments resulted in the biased estimates of treatment averages, overestimation, and underestimation of error variation and correlated errors. Thus, replication is needed to estimate experimental error and randomization to have correct probability or level of significance. Generally, ANOVA divides total variation into two independent sources: (i) variation among treatments and (ii) variation within treatments (experimental error/residual error/error mean square/error variance). After considering that data is normally and independently distributed, F-ratio \( \left(F=\raisebox{1ex}{$\mathrm{variation}\ \mathrm{between}\ \mathrm{sample}\ \mathrm{means}\ $}\!\left/ \!\raisebox{-1ex}{$\mathrm{variation}\ \mathrm{within}\ \mathrm{the}\ \mathrm{sample}\mathrm{s}$}\right.\right) \) is used to test the null hypothesis that treatment means are equal or not. One-way ANOVA example could be best way to understand this ratio. Firstly, ANOVA was used for the fixed effect models (Model I, specific treatments or level of treatments of interest) but later used also for the random effect models (Model II). Afterwards it has been proposed that ANOVA should also be used for the mixed effect models (both fixed and random treatment factors) (Gbur et al. 2012; West and Galecki 2012). The importance of mixed effect models was shown in some of experiments where use of fixed model instead of mixed models resulted to the misleading results (Acutis et al. 2012; Bolker et al. 2009; Moore and Dixon 2015; Yang 2010). Fisher’s ANOVA is the most frequently used method to determine if differences among means are significant or not. His preference was to declare significance when P ≤ 0.05 (P value) by considering F table also. The components of ANOVA include sources of variations (SOV), degrees of freedom, sum of squares, mean squares, F values, and P values (Tables 3.3, 3.4 and 3.5). The ANOVA importance and applications in different earlier work have been presented in Table 3.6. Meantime as Fisher was working on his ANOVA framework, Neyman and Pearson presented the concept of type of errors (type I (true null hypothesis rejection) and type II errors (failing to reject false null hypothesis)) (McIntosh 2015).

Table 3.3 One-way analysis of variance with equal replication
Table 3.4 Analysis of variance in randomized complete block
Table 3.5 Analysis of variance for Latin square
Table 3.6 ANOVA importance and applications in different earlier work

3.7.1 Calculation of the F-Test

F-ratio calculation for one-way ANOVA is possible by using following equations and is reported in the representative Table 3.7.

$$ {\sigma}^2=\frac{\sum {\left({x}_i-\overline{x}\right)}^2}{n-1} $$

where σ2 = vraince, xi = observation, \( \overline{\mathrm{x}}=\mathrm{sample}\ \mathrm{population}\ \mathrm{mean},\mathrm{and}\ n=\mathrm{obsevtaion}\ \mathrm{number} \).

Table 3.7 Representative table for F-test calculation

Sum of squares (SS) in ANOVA is sum of the squared deviations of observation from the mean. Total sum of squares (SST) can be calculated by using following equation:

$$ {\mathrm{SS}}_{\mathrm{T}}=\sum {\left({x}_{ij}-\overline{x}\right)}^2 $$

where xij = ith observation in the jth group. The formulae can be rewritten as:

$$ {\mathrm{SS}}_{\mathrm{T}}=\sum {\left({x}_{ij}-\overline{x}\right)}^2=\sum \left({x}_{ij}^2\right)-{\frac{\left(\sum {x}_{ij}\right)}{n}}^2 $$

The total SS between group (SSB) and within group (SSw) can be calculated by using following equations:

$$ {\mathrm{SS}}_{\mathrm{B}}=\sum {\left(\overline{x_j}-\overline{x}\right)}^2=\sum \limits_j{n}_j\ \left({x_j}^2\right)=\frac{{\left(\sum {x}_{ij}\right)}^2}{n} $$
$$ {\mathrm{SS}}_{\mathrm{W}}=\sum \limits_j\sum \limits_i{\left({x}_{ij}-{\overline{x}}_j\right)}^2 $$

Total SS in the model can be calculated by following equation which can be further used to get SSw:

$$ {\mathrm{SS}}_{\mathrm{T}}={\mathrm{SS}}_{\mathrm{B}}+{\mathrm{SS}}_{\mathrm{W}} $$
$$ {\mathrm{SS}}_{\mathrm{W}}={\mathrm{SS}}_{\mathrm{TT}}+{\mathrm{SS}}_{\mathrm{B}} $$

The mean square (MS) (mean of entire sample population or average squared deviation of observation from grand mean) is calculated next which is sum of squares (SST) by the total number of degrees of freedom (df) or n–1. The mean square between groups (MSB) can be calculated by using following equation:

$$ {\mathrm{MS}}_{\mathrm{B}}=\frac{{\mathrm{SS}}_{\mathrm{B}}}{{\mathrm{df}}_{\mathrm{B}}} $$

Finally, R ratio is calculated by using following equation:

$$ F=\frac{{\mathrm{MS}}_{\mathrm{B}}}{{\mathrm{MS}}_{\mathrm{W}}} $$

3.8 Experimental Design and Its Principles

New knowledge can be easily obtained by careful planning, analysis, and interpretation of data. Designing of an efficient experiment needs consultation with statistician as they can help to have appropriate design which can enable researchers to have unbiased estimates of treatment means and experimental error. An experiment is planned inquiry to obtain new facts or to confirm earlier findings. Experiments are generally designed to answer the questions or test the hypothesis. Before designing an experiment, it is important that objectives of the experiment should be clear. The unit of material or place where one application of treatment is applied is called experimental unit or experimental plot. Variation is the characteristics of all experimental material and experimental error is used to measure the variation among experimental unit. Variation could be due to number of reasons. It can be due to inherent variability or lack of uniformity in the physical conduction of experiment. Replication is another important component of experimental design. The main functions of replication are to (i) estimate experimental error, (ii) improve precision of the experiment by minimizing standard deviation of treatments, (iii) control error variance, and (iv) increase the scope of inference of the experiment. Error in the experiments could be controlled by the selection of appropriate experimental design, use of parallel observations, and choice of size and shape of the experimental units. Furthermore, unbiased estimate of experimental error is possible by the application of randomization.

3.8.1 Completely Randomized Design (CRD)

Completely randomized design is used when experimental units are homogeneous and less to be gained by putting them into blocks due to similarity of response. For example, variety trial in greenhouse will be subjected to CRD because of uniformity of soil. Similarly, laboratory experiments where it’s easy to control variability and experimental units are homogenous; CRD is used. The advantages of CRD are as follows: number of replicates can vary from treatment to treatment, and loss of information due to missing data is small. The precision of experiment is high due to maximum degree of freedom (df) for estimating experimental error. In this design treatments are assigned at random so that each experimental unit receives same chance of getting treatment. The randomization procedure and layout for the pot experiment having four treatments (A, B, C, and D) replicated four times have following steps:

  1. 1.

    Determination of total number of plots or experimental unit (n): Determine the total number of plots or experimental unit by multiplying treatments (t) with the number of replications (R); n = Rt = 4 × 4 = 16. However, if replications are not the same, then “n” can be calculated by getting sum of the replications of each treatment.

  2. 2.

    Assigning of plot number

  3. 3.

    Assigning of treatments into plots using random number method and further its ranking as shown in the Table 3.8. Afterwards group number assigned based on random number ranking (Table 3.9) and treatments was placed in the experimental units as shown in the layout (Fig. 3.3).

Table 3.8 Random ranking of experimental unit
Table 3.9 Group numbers based on random numbers ranking
Fig. 3.3
figure 3

A layout of completely randomized design with four treatments (A, B, C, and D) replicated four times

In order to have ANOVA for the treatments mentioned in Table 3.10, we need to obtain Xi. and \( \sum \limits_j{X}^2 ij \) as mentioned in Table 3.10 (points 1 and 2). Afterwards each treatment total is squared and divided by r = 5 to get \( \raisebox{1ex}{${\left({X}_i.\right)}^2$}\!\left/ \!\raisebox{-1ex}{$r$}\right. \) named as treatments sum of square. Correction factor (CF) is calculated afterwards by dividing total sum of squares of all observations with total numbers (rt). The equation to calculate CF is:

$$ \mathrm{CF}=\frac{X^2..}{rt}=\frac{{\left(\sum \limits_{i,j}{X}_{ij}\right)}^2}{rt}=\frac{(670.6)^2}{(5)(6)}=\mathrm{14,990.15} $$
$$ {\displaystyle \begin{array}{l}\mathrm{SS}\ \left(\mathrm{total}\right)={\sum}_{i,j}{X}^2 ij-\mathrm{CF}=16,093.56\hbox{--} 14,990.15=1103.41\\ {}\mathrm{SS}\;\left(\mathrm{treatment}\right)\;\left(\mathrm{between}\kern0.5em \mathrm{or}\kern0.5em \mathrm{among}\kern0.5em \mathrm{groups}\right)=\frac{\ {X}^21.+\cdots +{X}^2t.}{r}-\mathrm{CF}\end{array}} $$
$$ {\displaystyle \begin{array}{l}=\frac{(148.1)^2+{(132.8)}^2+\cdots +{(100.9)}^2}{5}\\ {}=\frac{7,788,008.00}{5}-14,990.15\\ {}=15,576.02-14,990.15\\ {}=585.87\end{array}} $$
Table 3.10 Nitrogen contents of Lucerne plants inoculated with Rhizobium trifolii strains (RTS) (mg)

The sum of squares (SS) among individuals is called within group SS, residual SS, error SS, or discrepancy SS, and it can be obtained by following equation:

$$ {\displaystyle \begin{array}{c}{\mathrm{SS}}_{\mathrm{error}}={\mathrm{SS}}_{\mathrm{Total}}-{\mathrm{SS}}_{\mathrm{Treatment}}\\ {}=1103.41-585.87\\ {}=517.54\end{array}} $$

The error SS (SSerror) can also be calculated by pooling the within treatments SS as shown below:

$$ {\displaystyle \begin{array}{c}{\mathrm{SS}}_{\mathrm{error}}=\sum \limits_i\left(\sum \limits_j{X}^2 ij-\frac{X^2i.}{r}\right)\\ {}=\left(4593.45-\frac{148.1^2}{5}\right)+\left(3623.34-\frac{132.8^2}{5}\right)+\left(1980.28-\frac{95.8^2}{5}\right)\\ {}+\left(2406.37-\frac{109.3^2}{5}\right)+\left(1435.61-\frac{83.7^2}{5}\right)+\left(2054.51-\frac{100.9^2}{5}\right)\\ {}=517.54\end{array}} $$

These generated numerical results are presented in an AONVA (Table 3.11), and it shows that there is significant difference among treatments. The standard error of treatment mean (\( {\mathrm{SE}}_{\overline{X}}\Big) \) and differences between treatment, CV, and least significance difference (LSD) are calculated by using the following equations:

$$ {\mathrm{SE}}_{\overline{X}}=\sqrt{\frac{s^2}{r}}=\sqrt{\frac{21.56}{5}}\mathrm{mg}=\sqrt{4.312}=2.07\ \mathrm{mg} $$
$$ {SE}_{{\overline{X}}_i.-{\overline{X}}_i\dots .}=\sqrt{\frac{2{s}^2}{r}}=\sqrt{\frac{2(21.56)}{5}}=\sqrt{\frac{43.12}{5}}=\sqrt{8.62}=2.93\ \mathrm{mg} $$
$$ \mathrm{CV}\ \left(\mathrm{Coefficient}\ \mathrm{of}\ \mathrm{variability}\right)=\frac{\sqrt{S^2}}{\overline{X}}\times 100=\frac{\sqrt{21.56}}{22.4}\times 100=\frac{4.64}{22.4}\times 100=20.7\% $$
$$ \mathrm{LSD}={t}_{\raisebox{1ex}{$\alpha $}\!\left/ \!\raisebox{-1ex}{$2$}\right.}{S}_{\overline{X}i.-\overline{X}i\dots .}={t}_{\raisebox{1ex}{$\alpha $}\!\left/ \!\raisebox{-1ex}{$2$}\right.}S\sqrt{\frac{2}{r}}\ \left(\mathrm{for}\ \mathrm{equal}\ r\right) $$
$$ {\mathrm{LSD}}_{0.05}={t}_{0.025}{S}_{\overline{X}i.-\overline{X}i\dots .}=2.064\sqrt{\frac{2(21.56)}{5}}=2.064\sqrt{8.62}=2.064\times 2.93=6.06\ \mathrm{mg} $$
$$ {\mathrm{LSD}}_{0.01}={t}_{0.005}{S}_{\overline{X}i.-\overline{X}i\dots .}=2.797\sqrt{\frac{2(21.56)}{5}}=8.21\ \mathrm{mg} $$
Table 3.11 Analysis of variance for data of Table 3.10

The observed differences are \( \overline{X}1.-\overline{X}2.= \)29.62–26.56 = 3.06; \( \overline{X}3.-\overline{X}4.= \)19.16–21.86 = −2.7; and \( \overline{X}5.-\overline{X}6.= \)16.74–20.18 = −3.44. Now rank the means from the smallest to largest as shown below:

RTS1

RTS2

RTS3

RTS4

RTS5

Composite

29.62 (6)

26.56 (5)

19.16 (2)

21.86 (4)

16.74 (1)

20.18 (3)

Next is to calculate the difference and test significance level using LSD test at 5%.

  • 6–1 = 29.62–16.74 = 12.88 > 6.06 = significant

  • 6–2 = 29.62–19.16 = 10.46 > 6.06 = significant

  • 6–3 = 29.62–20.18 = 9.44 > 6.06 = significant

  • 6–4 = 29.62–21.86 = 7.76 > 6.06 = significant

  • 6–5 = 29.62–26.56 = 3.06 < 6.06 = nonsignificant

  • 5–1 = 26.56–16.74 = 9.82 > 6.06 = significant

  • 5–2 = 26.56–19.16 = 7.4 > 6.06 = significant

  • 5–3 = 26.56–20.18 = 6.38 > 6.06 = significant

  • 5–4 = 26.56–21.86 = 4.70 < 6.06 = nonsignificant

  • 4–1 = 21.86–16.74 = 5.12 < 6.06 = nonsignificant

  • 4–2 = 21.86–19.16 = 2.70 < 6.06 = nonsignificant

  • 4–3 = 21.86–20.18 = 1.68 < 6.06 = nonsignificant

  • 3–1 = 20.18–16.74 = 3.44 < 6.06 = nonsignificant

  • 3–2 = 20.18–19.16 = 1.02 < 6.06 = nonsignificant

  • 2–1 = 19.16–16.74 = 2.42 < 6.06 = nonsignificant

3.8.2 Randomized Complete Block Design (RCBD)

The randomized complete block design (RCBD) is one of the most widely used designs in an agronomic field research. In this design experimental unit can be meaningfully grouped, and number of units in a group is equal to the number of treatments. These groups are called block or replication. The objective to have groups in blocks is to minimize error and ensure that observed differences will be due to treatments only. The RCBD has more advantages than the CRD due to blocking and further randomization which results to the more precision. The main purpose of blocking is to have higher accuracy by minimizing the experimental error due to the known sources of variation (SOV) among the experimental units. Grouping is done in such a way that variability within each block is minimized, while among block it is maximized. Variation within a block will be part of the experimental error; thus blocking is most effective when experimental area has a predictable pattern of variability. An ideal known SOV which can be used as basis for the blocking includes soil heterogeneity in nitrogen fertilizer experiments or varietal trials at multiple sites or sowing date experiments.

Thus, basis of blocking depends on the main SOV. The size and shape of blocks are selected in such a way so that there should be maximum variability among blocks. To do blocking, firstly, identify the gradient and do blocking vertical to the gradients, and if gradient occurs in two directions (one strong and other weak), then consider that gradient which is stronger, e.g., in case of fertility gradient. If fertility gradient is strong on both sides and perpendicular to each other, then use square blocks and choose Latin square design as elaborated by Gomez and Gomez (1980). Furthermore, whenever blocking is done, blocks identity and purpose should be clear. Similarly, if SOV is beyond the control, then ensure that such variation occurs among blocks as compared to within blocks. For example, in case of application of herbicides or data collection which might not be possible to complete in one day. In such scenario, it is recommended that it should be completed firstly for all plots of the same block. In this way, variation due to collection of data by multiple observers or application of treatments in more than one day becomes part of block variation and excluded from the experimental error. Following steps should be followed to design layout for RCBD.

  1. 1.

    Division of experimental area into “R” equal blocks (R = replications). The experimental area is divided into four blocks as shown in Fig. 3.4.

  2. 2.

    Subdivision of blocks into experimental plots based on number of treatments. For example, here if we suppose there are six treatments, i.e. A, B, C, D, E, and F, then divide each block into six subplots and assign each treatment into subplot using the random numbers (Fig. 3.5).

  3. 3.

    Repetition of step 2 for the remaining blocks (Fig. 3.6).

Fig. 3.4
figure 4

Layout for the RCBD (division of experimental area into four blocks)

Fig. 3.5
figure 5

Subdivision of blocks into experimental plots based on number of treatments and randomization of treatments (A, B, C, D, E, and F)

Fig. 3.6
figure 6

A randomized layout for the RCBD (six treatments and four replications)

Let’s apply the concept of RCBD on the data provided in Table 3.12 to generate ANOVA table and see significant difference among different oil contents of different canola cultivars. Step 1 includes arranging of raw data in ways as shown in Table 3.4. Calculate ∑X2 and treatment (Xi.) and blocks (X.j) totals, i. e. , \( \sum \limits_j{X^2}_{ij}; \)i = 1, 2…t, and \( \sum \limits_i{X^2}_{ij}; \)j = 1, 2…r. Step 2 is to calculate sum of squares using following formulas:

$$ \mathrm{Correction}\ \mathrm{factor}=\mathrm{CF}=\frac{{Y^2}_{..}}{rt}=\frac{(1085.5)^2}{24}=\frac{(1085.5)^2}{24}=\frac{\mathrm{1,178,310.25}}{24}=\mathrm{49,096.26} $$
$$ {\mathrm{SS}}_{\mathrm{total}}=\sum \limits_{i,j}{X}^2 ij-\mathrm{CF} $$
$$ {\mathrm{SS}}_{\mathrm{total}}=\mathrm{49,150.77}-\mathrm{49,096.26}=54.51 $$
$$ {\mathrm{SS}}_{\mathrm{block}}=\frac{\sum \limits_j{Y^2}_{.j}}{t}-\mathrm{CF} $$
$$ {\mathrm{SS}}_{\mathrm{block}}=\frac{(269.8)^2+{(268.8)}^2+{(274.2)}^2+{(272.7)}^2}{6}-\mathrm{49,096.26} $$
$$ {\mathrm{SS}}_{\mathrm{block}}=\mathrm{49,099.4}-\mathrm{49,096.26}=3.14 $$
$$ {\mathrm{SS}}_{\mathrm{treatment}}=\frac{\sum \limits_i{Y^2}_{i.}}{r}-\mathrm{CF} $$
$$ {\mathrm{SS}}_{\mathrm{treatment}}=\frac{(179.2)^2+{(176.0)}^2+{(185.6)}^2+{(174.8)}^2+{(183.0)}^2+{(186.9)}^2}{4}-\mathrm{49,096.26} $$
$$ {\mathrm{SS}}_{\mathrm{treatment}}=\frac{\mathrm{196,511.70}}{4}-\mathrm{49,096.26} $$
$$ {\mathrm{SS}}_{\mathrm{treatment}}=\mathrm{49,127.91}-\mathrm{49,096.26}=31.65 $$
$$ {\mathrm{SS}}_{\mathrm{error}}={\mathrm{SS}}_{\mathrm{total}}-{\mathrm{SS}}_{\mathrm{block}}-{\mathrm{SS}}_{\mathrm{treatment}} $$
$$ {\mathrm{SS}}_{\mathrm{error}}=54.51-3.14-31.65=19.72 $$
Table 3.12 Oil content (%) data of different canola cultivars with analysis of variance table

3.8.3 Missing Values Estimation

Sometimes due to poor germination or due to climatic conditions, etc., data might be missing from the experimental unit. This missing data can be calculated by using following equation:

$$ y=\frac{r{B}_o+{tT}_o-{G}_o}{\left(r-1\right)\left(t-1\right)} $$

where y = missing value estimation; t = number of treatments; r = number of replications; Bo = replication total that contains missing value; To = treatments total that contains missing value; and Go = total of all observed values.

3.8.4 Latin Square Design

Treatments are arranged in rows and columns in Latin square design. Treatments (t) are repeated “t” times in such a way that t appear exactly one time in each column and row and denoted by Roman characters, thus called as Latin square design. The main purpose of this design is to reduce systematic error due to columns and rows (treatments) (n × n). The advantage in the use of this design is in the field experiment where two major SOVs exist, e.g., in case of soil difference in two directions, this design will help to remove variation. The disadvantage of this design is that number of rows, columns, and treatments should be equal. Latin square design for six treatments, i.e., A, B, C, D, E, and F, will be like as shown in Fig. 3.7. Analysis of variance for an r × r (6 × 6) Latin square data set oil yield (kg ha−1) of canola cultivars is given in Table 3.13. The calculation involves following steps:

  1. 1.

    Calculation of row totals (Xi.), column totals (X.j), treatment totals (Xt), and grand total (Y..). Similarly, calculate \( \sum \limits_j{X^2}_{ij} \) and \( \sum \limits_i{X^2}_{ij} \) for each value of rows and columns (Table 3.13).

  2. 2.

    Calculation of correction factor and sum of squares (SS):

    $$ \mathrm{CF}=\frac{X^2..}{r^2}=\frac{{\left(\mathrm{40,380}\right)}^2}{6^2}=\mathrm{452,92,900} $$
Fig. 3.7
figure 7

Layout for Latin square design

Table 3.13 Oil yield (kg ha−1) of different canola cultivars with analysis of variance table under Latin square design
$$ {\mathrm{SS}}_{\mathrm{total}}=\sum \limits_{i,j}{X^2}_{ij}-\mathrm{CF}=459,\mathrm{82,806}\hbox{--} \mathrm{452,92,900}=\mathrm{689,906} $$
$$ {\mathrm{SS}}_{\mathrm{row}}=\frac{\sum \limits_i{X^2}_{i.}}{r}-\mathrm{CF}=\frac{(6669)^2+{(6732)}^2+{(6781)}^2+{(6757)}^2+{(6718)}^2+{(6723)}^2}{6}-\mathrm{452,92,900}=452,\mathrm{94,108}\hbox{--} \mathrm{45,292,900}=1208 $$
$$ {\mathrm{SS}}_{\mathrm{column}}=\frac{\sum \limits_j{X^2}_{.j}}{r}-\mathrm{CF}=\frac{(6592)^2+{(6839)}^2+{(6750)}^2+{(6749)}^2+{(6680)}^2+{(6770)}^2}{6}-\mathrm{452,92,900}=452,\mathrm{98,864}-\mathrm{452,92,900}=5964 $$
$$ {\mathrm{SS}}_{\mathrm{treatment}}=\frac{\sum \limits_t{X^2}_t}{r}-\mathrm{CF}=\frac{(8049)^2+{(5772)}^2+{(5905)}^2+{(6876)}^2+{(6322)}^2+{(7456)}^2}{6}-\mathrm{452,92,900}=459,\mathrm{68,401}-\mathrm{452,92,900}=\mathrm{675,501} $$
$$ {\mathrm{SS}}_{\mathrm{error}}={\mathrm{SS}}_{\mathrm{total}}-{\mathrm{SS}}_{\mathrm{row}}-{\mathrm{SS}}_{\mathrm{column}}-{\mathrm{SS}}_{\mathrm{treatment}}=\mathrm{689,906}-1208-5964-\mathrm{675,501}=7233 $$
$$ \mathrm{Standard}\ \mathrm{error}\ \mathrm{of}\ \mathrm{treatment}\ \mathrm{means}={S}_{\overline{X}}=\sqrt{\frac{S^2}{r}}=\sqrt{\frac{361.6}{6}}=7.76\ \mathrm{kg} $$
$$ \mathrm{Sample}\ \mathrm{standard}\ \mathrm{error}\ \mathrm{of}\ \mathrm{difference}\ \mathrm{between}\ \mathrm{two}\ \mathrm{treatment}\ \mathrm{means}={S}_{\overline{X}i-\overline{X} it}=\sqrt{\frac{2{S}^2}{r}}=\sqrt{\frac{2{S}^2}{r}}=10.97\ \mathrm{kg} $$

3.8.5 Factorial Experiments

Factorial experiments consist of number of factors as treatment with all possible combinations with different levels of equal importance. For example, an experiment involves temperature as treatment (factor) will have different levels of temperature. Similarly, if silicon (Si) fertilization is used as factor in pot experiment, several levels will be used to evaluate the experiment. For example, if we use two sources of Si (potassium silicate and sodium silicate) each at two different concentrations, it will be referred as a 2 × 2 or 22 factorial experiment. The possible combinations of two levels in each of the two factors will be four as shown in Table 3.14. Similarly, if Si fertilization experiment is conducted by using only potassium silicate with its two levels (no application as Si0 and 200 mg L−1 of potassium silicate as Si200) under two water regimes, i.e., with water (W+) and without water (W−), the design should be factorial with 2 × 2 or 22 as shown in Table 3.14. In factorial experiment, term level represents several treatments within any factor. The capital letters are used to represent factors, while levels (treatment combinations and means) were represented with small letters and numerical subscripts, e.g., a1b2 may refer to treatment combination consists of first level of A and second level of factor B with the mean of corresponding treatment. The df and SS for the variance among four treatment means in a 22 can be divided into single df and SS. Symbolic representation of 3 × 3 or 32 factorial treatment combinations has been shown in Table 3.15. The principles involved in the partitioning can be elaborated by Table 3.16. The four differences a2a1 at each level of B and b2b1 at each level of A are called simple effects. Average of simple effects is called main effect denoted by capital letters, e.g., A and B. The A and B for 22 factorial experiment can be calculated by using following equations:

$$ A=\frac{1}{2}\ \left[\left({a}_2{b}_2-{a}_1{b}_2\right)+\left({a}_2{b}_2-{a}_1{b}_1\right)\right]=\frac{1}{2}\ \left[\left({a}_2{b}_2+{a}_2{b}_1\right)-\left({a}_1{b}_2+{a}_1{b}_1\right)\right] $$
$$ B=\frac{1}{2}\ \left[\left({a}_2{b}_2-{a}_2{b}_1\right)+\left({a}_1{b}_2-{a}_1{b}_1\right)\right]=\frac{1}{2}\ \left[\left({a}_2{b}_2+{a}_1{b}_2\right)-\left({a}_2{b}_1+{a}_1{b}_1\right)\right] $$
Table 3.14 2 × 2 or 22 factorial treatment combinations
Table 3.15 Symbolic representation of 3 × 3 or 32 factorial treatment combinations
Table 3.16 Shoot dry weight (g) of sorghum plant under different silicon source as factor A and silicon concentration as factor B to illustrate simple effects, main effects, and interactions

Main effects in factorial experiment are averaged in number of ways same as other treatment. Different conditions might prevail within blocks and among blocks for factorial experiment in RCBD, and Latin square design thus in Table 3.16 factor A is replicated within every block as it is present at both levels for each level of factor B. In case of factorially arrangement treatment, hypothesis that is usually tested is “there is no interaction among factors.” Data presented in Table 3.16 have shown that simple effects under I and II for Si sources (A) and concentrations (B) are different, while for III the simple effects for A and B as well as main effect are the same. The differential response obtained between the simple effects of a factor is called interaction as seen in cases I and II of Table 3.16. However, interaction is not present in case III of Table 3.16. This is the major advantage of application of factorial experiment as it provides information about the interaction between factors. The interaction of A and B can be defined by using following equations:

$$ AB=\frac{1}{2}\ \left[\left({a}_2{b}_2-{a}_1{b}_2\right)-\left({a}_2{b}_1-{a}_1{b}_1\right)\right]=\frac{1}{2}\ \left[\left({a}_2{b}_2+{a}_1{b}_1\right)-\left({a}_1{b}_2+{a}_2{b}_1\right)\right] $$

The interaction for the data in Table 3.16:

$$ AB=\frac{1}{2}\ \left(6-2\right)=2\ \left(\mathrm{simple}\ \mathrm{effects}\ \mathrm{of}\ A\ \mathrm{for}\ \mathrm{Case}\ \mathrm{I}\right) $$
$$ AB=\frac{1}{2}\ \left(10-6\right)=2\ \left(\mathrm{simple}\ \mathrm{effects}\ \mathrm{of}\ B\ \mathrm{for}\ \mathrm{Case}\ \mathrm{I}\right) $$

The interaction for case II in Table 3.16:

$$ AB=\frac{1}{2}\ \left[\left(33.13-43.13\right)-\left(37.13-34.13\right)\right] $$
$$ AB=\frac{1}{2}\ \left[33.13-43.13-37.13+34.13\right] $$
$$ AB=\frac{1}{2}\ \left[-13\right] $$
$$ AB=-6.5 $$

The interaction for case III in Table 3.16:

$$ AB=\frac{1}{2}\ \left[\left(40.13-38.13\right)-\left(32.13-30.13\right)\right] $$
$$ AB=\frac{1}{2}\ \left[40.13-38.13-32.13+30.13\right] $$
$$ AB=\frac{1}{2}\ \left[0\right]=0\ \left(\mathrm{no}\ \mathrm{intearction}\right) $$

Interaction concept is further elaborated by using graph as shown in Fig. 3.8. It should be noted that presence or absence of main effects does not tell anything about interaction presences or absence and vice versa. If interaction is nonsignificant, we can conclude that factors act independently. However, if interaction is large and significant, then main effects have little meaning. For large factorial experiments, it has been suggested to use confounded designs as described by Das and Giri (1979).

Fig. 3.8
figure 8

Graphical illustration of interaction

Factorial experiment other case includes e.g. if we have actor A as three locations and factor B as Si fertilizer with two levels, while factor C consists of three sorghum cultivars; such kind of factorial experiment will be referred as 3 × 2 × 3 or 32 × 2 (Table 3.17).

Table 3.17 Three factor (3 × 2 × 3 or 32 × 2) factorial experiments

ANOVA calculation for the 3 × 3 × 2 or 32 × 2 factorial experiments involves following steps with results presented in ANOVA Table 3.18:

  1. 1.

    Calculation of correction factor, total sum of square, block SS, treatment SS and error SS

    $$ \mathrm{Correction}\ \mathrm{factor}=\mathrm{CF}=\frac{{X^2}_{..}}{rabc}=\frac{(2903)^2}{54}=\mathrm{156,038.77} $$
Table 3.18 Analysis of variance table for 32 × 2 factorial experiment in RCBD
$$ {\mathrm{SS}}_{\mathrm{total}}=\sum \limits_{i,j,k,r}{X^2}_{ijkr}-\mathrm{CF}=\mathrm{158,503.56}-\mathrm{156,038.77}=2464.78 $$
$$ {\mathrm{SS}}_{\mathrm{replication}}=\frac{\sum_{k=1}^r{R^2}_k}{abc}-\mathrm{CF} $$
$$ {\mathrm{SS}}_{\mathrm{repliaction}}=\frac{(968)^2+{(983)}^2+{(953)}^2}{18}-\mathrm{156,038.77}=\mathrm{156,100.43}-\mathrm{156,038.77}=61.65 $$
$$ {\mathrm{SS}}_{\mathrm{treatment}}=\frac{\sum_{j=1}^a{\sum}_{k=1}^b{\sum}_{i=1}^c{Tr^2}_{ijk}}{R}-\mathrm{CF} $$
$$ {\mathrm{SS}}_{\mathrm{treatment}}=\frac{(187)^2+\dots +{(134)}^2}{3}-\mathrm{156,038.77}=\mathrm{158,324.90}-\mathrm{156,038.77}=2286.15 $$
$$ {\mathrm{SS}}_{\mathrm{error}}={\mathrm{SS}}_{\mathrm{total}}-{\mathrm{SS}}_{\mathrm{repliaction}}-{\mathrm{SS}}_{\mathrm{treatment}}=2464.78-61.65-2286.15=116.98 $$
  1. 2.

    Partitioning of treatments sum of squares into main effects and interactions

    $$ {\mathrm{SS}}_A=\frac{\sum \limits_j\ {\left({a}_j\right)}^2}{rbc}-\mathrm{CF} $$
$$ {\mathrm{SS}}_A=\frac{(1053)^2+{(952)}^2+{(898)}^2}{18}-\mathrm{156,038.77}=\mathrm{156,726.5}\hbox{--} \mathrm{156,038.77}=687.75 $$
$$ {\mathrm{SS}}_B=\frac{\sum \limits_k\ {\left({b}_k\right)}^2}{rac}-\mathrm{CF} $$
$$ {\mathrm{SS}}_B=\frac{(1406)^2+{(1496)}^2}{27}-\mathrm{156,038.77}=\mathrm{156,188}\hbox{--} \mathrm{156,038.77}=149.25 $$
$$ {\mathrm{SS}}_C=\frac{\sum \limits_i\ {\left({c}_i\right)}^2}{rab}-\mathrm{CF} $$
$$ {\mathrm{SS}}_C=\frac{(1067)^2+{(992)}^2+{(843)}^2}{18}-\mathrm{156,038.77}=\mathrm{157,477.7}\hbox{--} \mathrm{156,038.77}=1438.93 $$
$$ {\mathrm{SS}}_{AB}=\frac{\sum \limits_{j,k}\ {\left({a}_j{b}_k\right)}^2}{rc}-\mathrm{CF}-\left({\mathrm{SS}}_A+{\mathrm{SS}}_B\right) $$
$$ {\mathrm{SS}}_{AB}=\frac{(509)^2+{(544)}^2+{(462)}^2+{(490)}^2+{(435)}^2+{(462)}^2}{9}-\mathrm{156,038.77}-687.75-149.25=2.47 $$
$$ {\mathrm{SS}}_{AC}=\frac{\sum \limits_{j,i}\ {\left({a}_j{c}_i\right)}^2}{rb}-\mathrm{CF}-\left({\mathrm{SS}}_A+{\mathrm{SS}}_C\right) $$
$$ {\mathrm{SS}}_{AC}=\frac{(387)^2+{(360)}^2+{(306)}^2+{(350)}^2+{(326)}^2+{(277)}^2+{(330)}^2+{(307)}^2+{(261)}^2}{6}-\mathrm{156,038.77}-\left(687.75+1438.93\right)=6.35 $$
$$ {\mathrm{SS}}_{BC}=\frac{\sum \limits_{k,i}\ {\left({b}_k{c}_i\right)}^2}{ra}- CF-\left({\mathrm{SS}}_B+{\mathrm{SS}}_C\right) $$
$$ {\mathrm{SS}}_{BC}=\frac{(517)^2+{(550)}^2+{(481)}^2+{(512)}^2+{(409)}^2+{(435)}^2}{9}-\mathrm{156,038.77}-\left(149.25+1438.93\right)=1.38 $$
$$ {\mathrm{SS}}_{AB C}=\frac{\sum \limits_{i,j,k}\ {\left({a}_j{b}_k{c}_i\right)}^2}{r}-\mathrm{CF}-{\mathrm{SS}}_A-{\mathrm{SS}}_B-{\mathrm{SS}}_C-{\mathrm{SS}}_{AB}-{\mathrm{SS}}_{AC}-{\mathrm{SS}}_{BC} $$
$$ {\mathrm{SS}}_{AB C}=\frac{(187)^2+\dots {(134)}^2}{3}-\mathrm{156,038.77}-\left(\ {\mathrm{SS}}_A+{\mathrm{SS}}_B+{\mathrm{SS}}_C+{\mathrm{SS}}_{AB}+{\mathrm{SS}}_{AC}+{\mathrm{SS}}_{BC}\right) $$
$$ {\mathrm{SS}}_{ABC}=\frac{(187)^2+\dots {(134)}^2}{3}-\mathrm{156,038.77}-2286.3=0.024 $$

3.8.6 Fractional Factorial Design

Fractional factorial design is used when large number of factors needs to be tested. In this case, only fraction of total number of treatments is going to be tested based upon the systematic selection.

3.8.7 Nested and Split Plot Design

Nested and split plot experiments are multifactor experiments. Split plot design is used for factorial experiment with a principle that whole plots are divided into subplots or subunits. The factors which need more importance, greater precision, and smaller experimental material and expected to exhibit smaller differences are placed in the subunits. Consider an experiment to test factor A (nitrogen fertilizer) at four levels of RCBD and second factor B (sorghum cultivars) at three levels which can be placed by dividing each A units into subunits. Thus, layout for the split plot includes factor A which will be in the main plot while factor B in the subplot as shown in Fig. 3.9.

Fig. 3.9
figure 9

Layout for the split plot design

Layout design steps for the split plot includes (i) Division of experimental area into three blocks or replication with further division into four main plots for the nitrogen fertilizer application (ii) Two separate randomization is needed, firstly for the main plot (N treatments) and then for the subplots (cultivars). Split plot design in figure showed that size of the main plot is “c” times greater than subplot. Since in this experiment c = 3 (cultivars in subplot), thus the size of main plot is three times greater than subplot. However, each main plot treatment is tested, e.g., 3 times, while subplot treatment will be tested 12 times which leads to more precision in subplot treatments as compared to the main plot. Partitioning of degree of freedom for the split plot design under different arrangements has been presented in Table 3.19.

Table 3.19 Degree of freedom for split plot design under different arrangements

3.8.8 Strip Plot/Split-Block Design

Experiments in which both factors (e.g., A and B with multiple levels of a and b) require larger plot area strip plot design are used. In this design, whole area is divided into “a” horizontal and “b” vertical strips. One level of factor A is applied in horizontal strips while level of B in vertical strips. Strip plot main difference from split plot is to have second factor as strip.

3.8.9 Split-Split Plot Design

Split-split plot designs are applicable when there are three-factor factorial experiments with factor A assign to whole plots while factor B to subplot and factor C to sub-subplot. The ANOVA for split-split plot design with r blocks, a levels of factor A, b levels of factor B, and c levels of factor C has been shown in Table 3.20.

Table 3.20 Analysis of variance for split-split plot design

3.8.10 MANOVA (Multivariate Analysis of Variance)

Multivariate analysis of variance (MANOVA) is ANOVA with several dependent variables. It tests the difference in two or more vectors of means, e.g., evaluation of student’s improvements in Physics and Chemistry using different syllabus. In this case, response variable (students’ improvements) is altered by the observer manipulation of the independent variables. The assumptions to use MANOVA are:

  1. 1.

    The dependent variable should be normally distributed.

  2. 2.

    Linear relationship among all pairs of dependent variables.

  3. 3.

    Homogeneity of variances.

3.9 ANCOVA (Analysis of Covariance)

Analysis of covariance (ANCOVA) uses concepts of both analysis of variance and regression, and it is used when one independent variable is not at predetermined level. The uses of ANCOVA includes (i) increase of precision and control of error, (ii) estimation of missing data, (iii) adjustment of treatment means of dependent variables for corresponding independent variables, (iv) assistance in the data interpretation, and (v) partitioning of total covariance into parts.

3.10 Principal Component Analysis (PCA)

Principal component analysis is the method of multivariate statistics used to check variation and patterns in a data set. It is an easy way to visualize and explore data (Ahmed et al. 2020). Consider a data in two dimensions first (e.g., height and weight). The data can be plotted using scatter plot, but if we want to see variation, we must use PCA with new coordinate system. The axes don’t have any physical meaning. Thus, PCA is a statistical procedure that uses orthogonal transformation to convert set of observation of correlated variables into values of linearly uncorrelated variables. It is the most common form of factor analysis applied to analyze interrelationship among variables (Fig. 3.10). The main objective of PCA is to cluster variables into manageable groups. These groups are known as the components (factors). Steps involved for the PCA are:

  1. 1.

    \( \mathrm{Standardization}\ \mathrm{of}\ \mathrm{the}\ \mathrm{data}\ \Big(z=\raisebox{1ex}{$\mathrm{Variable}\ \mathrm{value}-\mathrm{Mean}$}\!\left/ \!\raisebox{-1ex}{$\mathrm{Standard}\ \mathrm{deviation}$}\right. \))

  2. 2.

    Computing the covariance matrix (identification of correlation and dependence among features in a data set)

  3. 3.

    Eigenvectors and eigenvalues calculation

  4. 4.

    Commuting the principal components

  5. 5.

    Reducing the dimension of data set

Fig. 3.10
figure 10

PCA flow diagram

3.11 Regression

Consider a random sample of n observations in which Y values are determined from the corresponding X values, i.e., (X1, Y1), (X2, Y2), (X3, Y3)… . (Xn, Yn). In this case, Y is a dependent variable while X is an independent variable. First descriptive technique which can be used to determine the relationship between X and Y is the scatter diagram. This diagram is drawn by plotting the X and Y in Cartesian coordinates. The plotting pattern of points obtained between variables tells the relationship which can be either linear or nonlinear (Fig. 3.11). If relationship is linear, then we need to fit model that fits with the given data. Mathematically, the relation between X and Y can be elaborated by the following equation:

$$ Y\propto X $$
Fig. 3.11
figure 11

Scatter plot to show relationship between two variables X and Y

This shows that there is relationship present between the two variables and drawn straight line between the points can serve as moving average of the Y values. The equation of straight line can be:

$$ Y=a+ bX $$

Any point (X, Y) on this line has a X coordinate (abscissa) and a Y coordinate (ordinate) whose values satisfy this equation. When X = 0 or minimum, Y = a (intercept, value of Y X is minimum or zero). When intercept (a) is zero, the line passes through the origin. A unit change in Y due to unit change in X is called slope of the line and represented with b. Thus b\( =\frac{\ \Delta Y}{\ \Delta X}=\frac{\ \mathrm{Unit}\ \mathrm{change}\ \mathrm{in}\ Y}{\ \mathrm{Unit}\ \mathrm{Change}\ \mathrm{in}\ X} \). If b is positive, both values increase or decrease together, but if b is negative, then one value increases while other decreases. This is an example of simple linear regression equation (Ahmed et al. 2011). However, if we increase number of X variables called as predictor variable (X1 to Xn) against Y, it will be called multiple linear regression. The form of equation for the multiple linear regression will be:

$$ Y=a+{\beta}_o{X}_1+{\beta}_1{X}_2+{\beta}_2{X}_3+\dots {\beta}_n{X}_n+\varepsilon $$

where X1Xn = independent non-random variable; β0, β1, β2βn = slope; and ε = random varible represnting error term and genearlly equal to zero.

Let’s consider the data set presented in Table 3.21 to describe the method of least square in order to fit a straight line and calculate simple regression equation and coefficient of determination (R2). The calculation involves determination of SSxx, SSxy, \( \overline{X},\overline{Y} \), and β1 as shown in the following equations:

$$ {\mathrm{SS}}_{xx}={\sum}_{i=1}^n{X^2}_i-\frac{{\left({\sum}_{i=1}^n{X}_i\right)}^2}{n}=639-\frac{(45)^2}{10}=436.5 $$
$$ {\mathrm{SS}}_{xy}={\sum}_{i=1}^n{X}_i{Y}_i-\frac{\left({\sum}_{i=1}^n{X}_i\right)\left({\sum}_{i=1}^n{Y}_i\right)}{n}=1060-\frac{(45)(55)}{10}=812.5 $$
$$ \overline{X}=4.5\ \mathrm{and}\ \overline{Y}=5.5. $$
$$ {\beta}_1=\frac{{\mathrm{SS}}_{XX}}{{\mathrm{SS}}_X}=\frac{812.5}{436.5}=1.86 $$

and

$$ \overline{Y}=a+{\beta}_1\overline{X\ } $$
$$ a=\overline{Y} - {\beta}_1\overline{X} = 5.5\hbox{--} (1.86)(4.5)=5.5-8.37=-2.87. $$
Table 3.21 Data set to illustrate method of least squares to fit a straight line

Hence simple regression equation for this data is:

$$ \hat{Y}=a+{\beta}_1X=-2.87+(1.86)\mathrm{X}. $$

The plot for this least square line is shown in Fig. 3.12. The quality of this fit can be measured quantitatively by using coefficient of determination (R2). The equation for R2 calculation is:

$$ {R}^2=\frac{{\mathrm{SS}}_{yy}-{\mathrm{SS}}_{\mathrm{error}}}{{\mathrm{SS}}_{yy}}=1-\frac{\sum \limits_{i=1}^n{\left({y}_i-{\hat{y}}_i\right)}^2}{\sum_{i=1}^n{\left({y}_i-\overline{y}\right)}^2} $$
$$ {\mathrm{SS}}_{\mathrm{error}}=\sum \limits_{i=1}^n{\left({y}_i-{\hat{y}}_i\right)}^2=\sum \limits_{i=1}^n{\left({y}_i-\Big(a+{\beta}_1X\right)}^2=\sum \limits_{i=1}^n{\left({y}_i-a-{\beta}_1X\right)}^2 $$
Fig. 3.12
figure 12

Simple linear regression line with regression equation and coefficient of determination (R2)

Other approach which could be used to test hypothesis is use of ANOVA table as presented in earlier section. The ANOVA table for regression analysis is presented in Table 3.22. Furthermore, application of concept of multiple linear stepwise regression models has been elaborated using spring wheat grain yield data with respective R2 (Table 3.23).

Table 3.22 ANOVA table for simple regression
Table 3.23 Multiple linear stepwise regression models for spring wheat grain yield with environmental variables (E = environments (2008–09 and 2009–10), PW = planting windows, SR1 = solar radiation at anthesis, SR1 = solar radiation at maturity, T1 = mean average temperature at anthesis, T2 = mean average temperature at anthesis, PTQ1 = photothermal quotient at anthesis, PTQ2 = photothermal quotient at maturity) using stepwise method developed to predict wheat grain yield under changing climate

3.12 Correlation

Correlation is used to measure intensity or degree of association between variables. It is the same as covariance. It is a bivariate statistical technique. The simple linear correlation coefficient or simple correlation (total correlation and product-moment correlation) is sued for descriptive purposes and can be calculated by using following equations:

$$ \boldsymbol{r}=\frac{\raisebox{1ex}{$\sum \left(\boldsymbol{X}-\overline{\boldsymbol{X}}\right)\left(\boldsymbol{Y}-\overline{\boldsymbol{Y}}\right)$}\!\left/ \!\raisebox{-1ex}{$\boldsymbol{n}-\mathbf{1}$}\right.}{\sqrt{\raisebox{1ex}{$\sum {\left(\boldsymbol{X}-\overline{\boldsymbol{X}}\right)}^{\mathbf{2}}$}\!\left/ \!\raisebox{-1ex}{$\boldsymbol{n}-\mathbf{1}$}\right.}\sqrt{\raisebox{1ex}{$\sum {\left(\boldsymbol{Y}-\overline{\boldsymbol{Y}}\right)}^{\mathbf{2}}$}\!\left/ \!\raisebox{-1ex}{$\boldsymbol{n}-\mathbf{1}$}\right.}} $$
$$ r=\frac{\sum \left(\boldsymbol{X}-\overline{\boldsymbol{X}}\right)\left(\boldsymbol{Y}-\overline{\boldsymbol{Y}}\right)}{\sqrt{\sum {\left(\boldsymbol{X}-\overline{\boldsymbol{X}}\right)}^2}\sum {\left(\boldsymbol{Y}-\overline{\boldsymbol{Y}}\right)}^2} $$

Correlation coefficient ranges from +1 to −1. If r = +1, then it shows positive covariance, while if r = −1, it means negative correlation, and if r = 0, it means no correlation at all. Correlation measures co-relation a joint property of two variables, while regression deals with the change of one variable in relation to change of another variable. In correlation, random pair of observation was obtained, while in regression, only the dependent variable needs to be randomly and normally distributed. The application of concept of correlation has been illustrated in Fig. 3.13 (Ahmed 2011).

Fig. 3.13
figure 13

Correlation analysis between spring wheat yield and yield components

3.13 Analytical Tools/Software

Analytical tools which can be used for the statistical analysis are listed below:

  1. 1.

    R

  2. 2.

    SAS

  3. 3.

    Sigma plot

  4. 4.

    Stat graphics

  5. 5.

    Minitab

  6. 6.

    SPSS

  7. 7.

    MS Excel

  8. 8.

    MATLAB

  9. 9.

    GraphPad Prism

  10. 10.

    GenStat

  11. 11.

    SigmaStat

  12. 12.

    Stata

  13. 13.

    Statistica