The Kolmogorov-Smirnov Test, more commonly referred to as the K-S Test, is a non-parametric and distribution free statistical analysis used to determine sample distribution in a population. In addition to calculating the D-statistic and p-value for the data set, the output generates the alternative hypothesis and several graphical representations in the form of histograms, normal curves, and empirical distribution functions, all of which helps in understanding sample distribution.
K-S test relies on the empirical distribution function (ECDF) to test the agreement between two cumulative distributions. For N ordered data points i.e. Y1, Y2, …, YN, the ECDF is defined to be
where n(i) is the number of points less than Yi and the values for Yi are sorted in ascending order. The equation generates an increasing step function that grows by 1/N at each ordered data point. K-S test operates by comparing the empirical distribution function to a theoretical distribution and calculating the maximum distance between the two curves, which is represented by the D value. The null hypothesis states that there is no difference between the two distributions. A p value is obtained representing the probability that the null hypothesis is true and takes into account the comparison of D with the critical value, c(α), where c(α) is a size-independent function with α as the chosen significance level for statistical significance. For p < α, the null hypothesis is rejected, suggesting that the two populations are from different distributions. Similarly, if p > α, the null hypothesis is accepted and the population distributions are deemed to be the same.
The relationship of the test statistic (D value) to the significance level (α) should also be taken into consideration. For a low α value, a large difference in the populations is needed to reject the null hypothesis, indicating a higher D value. A significantly high α means that even small differences in the distributions are magnified and will lead to rejecting the null hypothesis regardless of small D values. Consequently, the null hypothesis is rejected for all data sets that are not from the same continuous distribution. K-S test is especially useful in understanding distribution of data and distinguishing among the various distribution types, such as normal, log-normal, Weibull, exponential, and logistic.