Kolmogorov–Smirnov Test

doi:10.1007/978-0-387-32833-1_214

7273 Accesses
26 Citations

Access provided by Autonomous University of Puebla. Download reference work entry PDF

The Kolmogorov–Smirnov test is a nonparametric goodness-of-fit test and is used to determine wether two distributions differ, or whether an underlying probability distribution differes from a hypothesized distribution. It is used when we have two samples coming from two populations that can be different. Unlike the Mann–Whitney test and the Wilcoxon test where the goal is to detect the difference between two means or medians, the Kolmogorov–Smirnov test has the advantage of considering the distribution functions collectively. The Kolmogorov–Smirnov test can also be used as a goodness-of-fit test. In this case, we have only one random sample obtained from a population where the distribution function is specific and known.

HISTORY

The goodness-of-fit test for a sample was invented by Andrey Nikolaevich Kolmogorov (1933).

The Kolmogorov–Smirnov test for two samples was invented by Vladimir Ivanovich Smirnov (1939).

In Massey (1952) we find a Smirnov table for the Kolmogorov–Smirnov test for two samples, and in Miller (1956) we find a Kolmogorov table for the goodness-of-fit test.

MATHEMATICAL ASPECTS

Consider two independent random samples: $ { \left(X_1,X_2,\ldots,X_n\right) } $, a sample of size n coming from a population 1, and $ { \left(Y_1,Y_2,\ldots,Y_m\right) } $, a sample of dimension m coming from a population 2. We denote by, respectively, $ { F\left(x\right) } $ and $ { G\left(x\right) } $ their unknown distribution functions.

Hypotheses

The hypotheses to test are as follows:

A: Two-sided case::
H ₀::: $ { F\left(x\right)=G\left(x\right) } $ for each x
H ₁::: $ { F\left(x\right) \neq G\left(x\right) } $ or at least one value of x
B: One-sided case::
H ₀::: $ { F\left(x\right) \leq G\left(x\right) } $ for each x
H ₁::: $ { F\left(x\right) >G\left(x\right) } $ for at least one value of x
C: One-sided case::
H ₀::: $ { F\left(x\right) \geq G\left(x\right) } $ for each x
H ₁::: $ { F\left(x\right) <G\left(x\right) } $ for at least one value of x

In case A, we make the hypothesis that there is no difference between the distribution functions of these two populations. Both populations can then be seen as one population.

In case B, we make the hypothesis that the distribution function of population 1 is smaller than those of population 2. We sometimes say that, generally, X tends to be smaller than Y.

In case C, we make the hypothesis that X is greater than Y.

We denote by $ { H_1\left(x\right) } $ the empirical distribution function of the sample $ { \left(X_1, X_2, \ldots, X_n\right) } $ and by $ { H_2\left(x\right) } $ the empirical distribution function of the sample $ { \left(Y_1,Y_2,\ldots,Y_m\right) } $. The statistical test are defined as follows:

A: Two-tail case

The statistical test T ₁ is defined as the greatest vertical distance between two empirical distribution functions:

$$ T_1= \sup_x|H_1\left(x\right) -H_2\left(x\right)|\:. $$

B: One-tail case

The statistical test T ₂ is defined as the greatest vertical distance when $ { H_1\left(x\right) } $ is greater than $ { H_2\left(x\right) } $:

$$ T_2 = \sup_x[H_1\left(x\right) - H_2\left(x\right)]\:. $$

C: One-tail case

The statistical test T ₃ is defined as the greatest vertical distance when $ { H_2\left(x\right) } $ is greater than $ { H_1\left(x\right) } $:

$$ T_3= \sup_x[H_2\left(x\right)-H_1\left(x\right)]\:. $$

Decision Rule

We reject H ₀ at the significance level α if the appropriate statistical test (T ₁, T ₂, or T ₃) is greater than the value of the Smirnov table having for parameters n, m, and $ { 1-\alpha } $, which we denote by $ { t_{n,m,1-\alpha} } $, that is, if

$$ T_1 (\text{or } T_2 \text{or } T_3)>t_{n,m,1-\alpha}\:. $$

If we want to test the goodness of fit of an unknown distribution function $ { F\left(x\right) } $ of a random sample from a population with a specific and known distribution function $ { F_o\left(x\right) } $, then the hypotheses will be the same as those for testing two samples, except that $ { F\left(x\right) } $ and $ { G\left(x\right) } $ are replaced by $ { F\left(x\right) } $ and $ { F_o\left(x\right) } $.

If $ { H\left(x\right) } $ is the empirical distribution function of a random sample, then the statistical tests T ₁, T ₂, and T ₃ are defined as follows:

$$ \begin{aligned} T_1 & =\sup_x\left|F_o\left(x\right) -H\left(x\right)\right|\:, \\ T_2 & =\sup_x[F_o\left(x\right) -H\left(x\right)]\:, \\ T_3 & =\sup_x[H\left(x\right) -F_o\left(x\right)]\:. \end{aligned} $$

The decision rule is as follows: reject H ₀ at the significance level α if T ₁ (or T ₂ or T ₃) is greater than the value of the Kolmogorov table having for parameters n and $ { 1-\alpha } $, which we denote by $ { t_{n,1-\alpha} } $, that is, if

$$ T_1 (\text{or } T_2 \text{or } T_3)>t_{n,1-\alpha}\:. $$

DOMAINS AND LIMITATIONS

To perform the Kolmogorov–Smirnov test, the following must be observed:

1.
Both samples must be taken randomly from their respective populations.
2.
There must be mutual independence between two samples.
3.
The measure scale must be at least ordinal.
4.
To perform an exact test, the random variables must be continuous; otherwise the test is less precise.

EXAMPLES

The first example treats the Kolmogorov–Smirnov test for two samples and the second one for the goodness-of-fit test.

In a class, we count 25 pupils: 15 boys and 10 girls. We perform a test of mental calculations to see if the boys tend to be better than the girls in this domain.

The data are presented in the following table; the highest scores correspond to the results of the test.

Boys (X _i)		Girls ($ { Y_i } $)
19.8	17.5	17.7	14.1
12.3	17.9	7.1	23.6
10.6	21.1	21.0	11.1
11.3	16.4	10.7	20.3
13.3	7.7	8.6	15.7
14.0	15.2
9.2	16.0
15.6

We test the hypothesis according to which the distributions of the results of the girls and those of the boys are identical. This means that the population from which the sample of X is taken has the same distribution function as the population from which the sample of Y is taken. Hence the null hypothesis:

$$ H_0:F\left(x\right) =G\left(x\right)\enskip \text{for each <Emphasis Type="Italic">x</Emphasis>}\:. $$

If the two-tail case is applied here, we calculate:

$$ T_1 = \sup_x\left|H_1\left(x\right) -H_2\left(x\right)\right|\:, $$

where $ { H_1\left(x\right) } $ and $ { H_2\left(x\right) } $ are the empirical distribution functions of the samples $ { \left(X_1,X_2,\ldots,X_{15}\right) } $ and $ { \left(Y_1,Y_2,\ldots,Y_{10}\right) } $, respectively. In the following table, we have classed the observations of two samples in increasing order to simplify the calculations of $ { H_1 (x) - H_2 (x) } $.

X _i	Y _i	$ { H_1\left(x\right)-H_2\left(x\right) } $
	7.1	0-1/10=-0.1
7.7		1/15-1/10=-0.0333
	8.6	1/15-2/10=-0.1333
9.2		2/15-2/10=-0.0667
10.6		3/15-2/10=0
	10.7	3/15-3/10=-0.1
	11.1	3/15-4/10=-0.2
11.3		4/15-4/10=-0.1333
12.3		5/15-4/10=-0.0667
13.3		6/15-4/10=0
14.0		7/15-4/10=0.0667
	14.1	7/15-5/10=-0.0333
15.2		8/15-5/10=0.0333
15.6		9/15-5/10=0.1
	15.7	9/15-6/10=0
16.0		10/15-6/10=0.0667
16.4		11/15-6/10=0.1333
17.5		12/15-6/10=0.2
	17.7	12/15-7/10=0.1
17.9		13/15-7/10=0.1667
19.8		14/15-7/10=0.2333
	20.3	14/15-8/10=0.1333
	21.0	14/15-9/10=0.0333
21.1		1-9/10=0.1
	23.6	1-1=0

We have then:

$$ \begin{aligned} T_1 & =\sup_x\left|H_1\left(x\right) -H_2\left(x\right)\right| \\ & =0.2333\:. \end{aligned} $$

The value of the Smirnov table for $ { n=15 } $, $ { m=10 } $, and $ { 1-\alpha =0.95 } $ equals $ { t_{15,10,0.95}=0.5 } $.

Thus $ { T_1=0.2333<t_{15,10,0.95}=0.5 } $, and H ₀ cannot be rejected. This means that there is no significant difference in the level of mental calculations of girls and boys.

Consider the following random sample of dimension 10: $ { X_1=0.695 } $, $ { X_2=0.937 } $, $ { X_3=0.134 } $, $ { X_4=0.222 } $, $ { X_5=0.239 } $, $ { X_6=0.763 } $, $ { X_7=0.980 } $, $ { X_8=0.322 } $, $ { X_9=0.523 } $, $ { X_{10}=0.578 } $.

We want to verify by the Kolmogorov–Smirnov test if this sample comes from a uniform distribution. The distribution function of the uniform distribution is given by:

$$ F_o\left(x\right) = \begin{cases} 0 & \mbox{if $ x < 0 $} \\ x & \mbox{if $ 0 \leq x < 1 $} \\ 1 & {otherwise}\:. \end{cases}\:. $$

The null hypothesis H ₀ is then as follows, where $ { F\left(x\right) } $ is the unknown distribution function of the population associated to the sample:

$$ H_0 : F\left(x\right) =F_o\left(x\right)\enskip\text{for each <Emphasis Type="Italic">x</Emphasis>}\:. $$

If the two-tail case is applied, we calculate:

$$ T_1= \sup_x\left|F_o\left(x\right) -H\left(x\right)\right|\:, $$

where $ { H\left(x\right) } $ is the empirical distribution function of the sample $ { \left(X_1, X_2, \ldots, X_{10}\right) } $.

In the following table, we class the 10 observations in increasing order to simplify the calculation of $ { F_0 (x) - H(x) } $.

X _i	$ { F_o (x) } $	$ { H(x) } $	$ { F_o(x) - H(x) } $
0.134	0.134	0.1	0.134-0.1=0.034
0.222	0.222	0.2	0.222-0.2=0.022
0.239	0.239	0.3	0.239-0.3=-0.061
0.322	0.322	0.4	0.322-0.4=-0.078
0.523	0.523	0.5	0.523-0.5=0.023
0.578	0.578	0.6	0.578-0.6=-0.022
0.695	0.695	0.7	0.695-0.7=-0.005
0.763	0.763	0.8	0.763-0.8=-0.037
0.937	0.937	0.9	0.937-0.9=0.037
0.980	0.980	1.0	0.980-1.0=-0.020

We obtain then:

$$ T_1= \sup_x\left|F_o\left(x\right)-H\left(x\right)\right|=0.078\:. $$

The value of the Kolmogorov table for $ { n=10 } $ and $ { 1-\alpha =0.95 } $ is $ { t_{10,0.95}=0.409 } $.

If T ₁ is smaller than $ { t_{10,0.95} > (0.078 < 0.409) } $, then H ₀ cannot be rejected. That means that the random sample could come from a uniformly distributed population.

REFERENCE

Kolmogorov, A.N.: Sulla determinazione empirica di una legge di distribuzione. Giornale dell'Instituto Italiano degli Attuari 4, 83–91 (6.1) (1933)
Google Scholar
Massey, F.J.: Distribution table for the deviation between two sample cumulatives. Ann. Math. Stat. 23, 435–441 (1952)
Article MathSciNet Google Scholar
Miller, L.H.: Table of percentage points of Kolmogorov statistics. J. Am. Stat. Assoc. 31, 111–121 (1956)
Article MATH Google Scholar
Smirnov, N.V.: Estimate of deviation between empirical distribution functions in two independent samples. (Russian). Bull. Moscow Univ. 2(2), 3–16 (6.1, 6.2) (1939)
Google Scholar
Smirnov, N.V.: Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat. 19, 279–281 (6.1) (1948)
Google Scholar

Download references

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

(2008). Kolmogorov–Smirnov Test. In: The Concise Encyclopedia of Statistics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-32833-1_214

Download citation

DOI: https://doi.org/10.1007/978-0-387-32833-1_214
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-31742-7
Online ISBN: 978-0-387-32833-1
eBook Packages: Mathematics and StatisticsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Kolmogorov–Smirnov Test

HISTORY

MATHEMATICAL ASPECTS

Hypotheses

Decision Rule

DOMAINS AND LIMITATIONS

EXAMPLES

FURTHER READING

REFERENCE

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Publish with us

Navigation

Kolmogorov–Smirnov Test

HISTORY

MATHEMATICAL ASPECTS

Hypotheses

Decision Rule

DOMAINS AND LIMITATIONS

EXAMPLES

FURTHER READING

REFERENCE

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation