6 minutes

# Kruskal–Wallis

The **Kruskal–Wallis test** by ranks, **Kruskal–Wallis H test** or **One-way ANOVA on ranks** is a non-parametric method for testing whether samples originate from the same distribution^{1}, especially when the ANOVA normality assumptions may not apply. Although this test is for identical populations, it is designed to be sensitive to unequal means. … see more

Let $ n_{i} (i = 1, 2, …, k) $ represent the sample sizes for each of the k groups (i.e., samples) in the data. Next, rank the combined sample. Then compute R_{i} = the sum of the ranks for group *i*. Then the *Kruskal Wallis test* is:

$$ H = \left [ \frac{12}{N(N+1)} \cdot \Sigma \frac{T_{c}^{2}}{\eta _{c}}\right ]-3\cdot (N+1) $$

This statistic approximates a chi-square distribution with *k-1* degrees of freedom if the null hypothesis of equal populations is true. Each of the n_{i} should be at least 5 for the approximation to be valid.

Finally, the *p-value* is approximated by $ Pr(\chi _{g-1}^{2}\geq H) $.

If some $ n_{i} $ values are small (less than 5) the probability distribution of *H* can be quite different from this chi-squared distribution. If a table of the chi-squared probability distribution is available, the critical value of chi-squared, $ \chi _{\alpha :g-1}^{2} $, can be found by entering the table at *g−1* degrees of freedom and looking under the desired significance or alpha level.

If the statistic is not significant, then there is no evidence of stochastic dominance between the samples. However, if the test is significant then at least one sample stochastically dominates another sample. Therefore, a researcher might use sample contrasts between individual sample pairs, or post hoc tests using Dunn’s test, which (1) properly employs the same rankings as the Kruskal-Wallis test, and (2) properly employs the pooled variance implied by the null hypothesis of the Kruskal-Wallis test in order to determine which of the sample pairs are significantly different^{2}.

### Example

The following example considers the *Atmospheric CO2 Concentrations [ppm]* in a given city. Three weather stations provide value for a semester and are compared among them. Then, the main station compares its values with a secondary device which stores data at a different resolution.

The comparison is made to find whether the stations provide similar information or have significant differences (i.e. *H < p-value*).

```
# setup libraries
library(ggplot2)
library(reshape2)
library(ggthemes)
library(stargazer)
```

The data from the 3 stations is loaded. The date, rather than useful to verify the source of information, is not relevant in the analysis.

```
StationsCity <- read.csv("myData/StationCity.csv", row.names=1)
dplyr::tbl_df(StationsCity)
StationsCity$Date <- NULL
```

Date | Station-GBMcity | Station-GBMnorth | Station-GBMsouth | |
---|---|---|---|---|

1 | 01/01/2013 | 377.04 | 375.71 | 338.48 |

2 | 02/01/2013 | 375.52 | 374.66 | 312.75 |

3 | 03/01/2013 | 380.69 | 367.69 | 327.79 |

4 | 04/01/2013 | 379.21 | 370.54 | 299.91 |

5 | 05/01/2013 | 377.12 | 372.72 | 307.83 |

6 | 06/01/2013 | 378.38 | 367.8 | 330.19 |

7 | 07/01/2013 | 379.44 | 355.17 | 316.36 |

8 | 08/01/2013 | 376.07 | 373.47 | 307.41 |

9 | 09/01/2013 | 377.81 | 371.22 | 331.21 |

```
StationsCity <- melt(StationsCity)
StationsCity$case <- seq(1:(dim(StationsCity)[1]/3))
colnames(StationsCity) <- c('station','value','case')
pStation <- ggplot(data=StationsCity, aes(x=case, y=value, colour=station)) +
geom_point() + geom_line(alpha=0.25) +
geom_hline(yintercept = median(StationsCity$value), linetype='dashed')
pStation + theme_wsj() + scale_colour_wsj("colors6", "")
pStationb <- ggplot(data=StationsCity, aes(value, colour=station)) +
geom_density() +
geom_vline(xintercept = median(StationsCity$value), linetype='dashed')
pStationb + theme_wsj() + scale_colour_wsj("colors6", "")
```

```
testStations <- kruskal.test(value ~ station, data = StationsCity)
testStations
```

`Kruskal-Wallis rank sum test data: value by variable Kruskal-Wallis chi-squared = 359.69, df = 2, p-value < 2.2e-16`

```
StationsCity <- read.csv("myData/StationCity.csv", row.names=1)
StationOld <- read.csv("myData/StationCityOld.csv", row.names=1)
dplyr::tbl_df(StationOld)
```

value | station | week | |
---|---|---|---|

1 | 366.86 | Station.GBMcity-PG | 07/01/2013 |

2 | 371.2 | Station.GBMcity-PG | 14/01/2013 |

3 | 375.63 | Station.GBMcity-PG | 21/01/2013 |

4 | 377.99 | Station.GBMcity-PG | 28/01/2013 |

5 | 373.34 | Station.GBMcity-PG | 04/02/2013 |

6 | 368.65 | Station.GBMcity-PG | 11/02/2013 |

7 | 360.67 | Station.GBMcity-PG | 18/02/2013 |

8 | 368.29 | Station.GBMcity-PG | 25/02/2013 |

9 | 369.09 | Station.GBMcity-PG | 04/03/2013 |

```
StationsCity <- StationsCity[,c('Date','Station.GBMcity')]
StationsCity$station <- 'GBMcity'
StationsCity$case <- rownames(StationsCity)
StationsCity$Date <- NULL
colnames(StationsCity) <- c('value','station','case')
StationOld$week <- NULL
StationOld$case <- rownames(StationOld)
StationOld$station <- 'GBMcity.B'
StationSame <- rbind(StationsCity,StationOld)
StationSame$value <- as.numeric(StationSame$value)
StationSame$station <- as.factor(StationSame$station)
StationSame$case <- as.integer(StationSame$case)
pStation <- ggplot(data=StationSame, aes(x=case, y=value, colour=station)) +
geom_point() + geom_line(alpha=0.25) +
geom_hline(yintercept = median(StationSame$value), linetype='dashed')
pStation + theme_wsj() + scale_colour_wsj("colors6", "")
pStationSec <- ggplot(data=StationSame, aes(value, colour=station)) +
geom_density() +
geom_vline(xintercept = median(StationSame$value), linetype='dashed')
pStationSec + theme_wsj() + scale_colour_wsj("colors6", "")
```

```
testStations <- kruskal.test(value ~ station, data = StationSame)
testStations
```

`Kruskal-Wallis rank sum test data: value by variable Kruskal-Wallis chi-squared = 0.0017473, df = 1, p-value = 0.9667`

To sum up…

- Since the sample is larger than 10 elements,
*H*may be used as*p-value* - H
_{1}is 2.2e-16 - H
_{2}is 0.9667 - Therefore, we may say that
`Test 2`

maintains the relationship among the different stations so the hypothesis*H*is accepted, whereas_{0}`Test 1`

is significantly different and so*H*is rejected—despite it can be seen from the graphic that is caused by one of the stations which is significantly different._{0}

Now, **how the test is made?**

```
StationsCity
maxValue <- max(StationsCity$value)
minValue <- min(StationsCity$value)
sizeVal <- dim(StationsCity)[1]
StationsCityRanked <- StationsCity
StationsCityRanked$value <- rank(StationsCity$value)
StationsCityRanked <- dcast(StationsCityRanked, case ~ station)
StationsCityRanked$case <- NULL
```

station | value | case | Station.GBMcity | Station.GBMnorth | Station.GBMsouth | |
---|---|---|---|---|---|---|

1 | Station.GBMcity | 377.04 | 1 | 485 | 465 | 158 |

2 | Station.GBMcity | 375.52 | 2 | 463 | 444 | 45 |

3 | Station.GBMcity | 380.69 | 3 | 525 | 312 | 107 |

4 | Station.GBMcity | 379.21 | 4 | 512 | 376 | 14 |

5 | Station.GBMcity | 377.12 | 5 | 486 | 409 | 32 |

6 | Station.GBMcity | 378.38 | 6 | 506 | 315 | 120 |

178 | Station.GBMcity | 375.82 | 178 | 467 | 541 | 161 |

179 | Station.GBMcity | 376.83 | 179 | 482 | 360 | 1 |

180 | Station.GBMcity | 374.03 | 180 | 431 | 253.5 | 40 |

181 | Station.GBMcity | 366.1 | 181 | 277 | 309 | 53 |

182 | Station.GBMcity | 373.9 | 182 | 427 | 499 | 207 |

```
sumRanks <- colSums(StationsCityRanked)
meanRanks <- colMeans(StationsCityRanked)
SumCombined <- sum(sumRanks)
MeanCombinded <- sum(meanRanks)
CountCombinded <- table(StationsCity$station)
H <- ( 12 / (dim(StationsCity)[1]*(dim(StationsCity)[1]+1))) *
( (sumRanks[1]^2)/CountCombinded[1] +
(sumRanks[2]^2)/CountCombinded[2] +
(sumRanks[3]^2)/CountCombinded[3] ) -
3 * (dim(StationsCity)[1]+1)
H
df <- 3 -1
lim.p <- qchisq(0.05, df)
lim.p
H < lim.p
```

Similarly, with the other dataset:

```
StationSame
maxValue <- max(StationSame$value)
minValue <- min(StationSame$value)
sizeVal <- dim(StationSame)[1]
StationSameRanked <- StationSame
StationSameRanked$value <- rank(StationSame$value)
StationSameRanked <- dcast(StationSameRanked, case ~ station)
StationSameRanked$case <- NULL
sumRanks <- colSums(StationSameRanked, na.rm = T)
meanRanks <- colMeans(StationSameRanked, na.rm = T)
SumCombined <- sum(sumRanks)
MeanCombinded <- sum(meanRanks)
CountCombinded <- table(StationSame$station)
H <- ( 12 / (dim(StationSame)[1]*(dim(StationSame)[1]+1))) *
( (sumRanks[1]^2)/CountCombinded[1] +
(sumRanks[2]^2)/CountCombinded[2] ) -
3 * (dim(StationSame)[1]+1)
H
df <- 2 -1
lim.p <- qchisq(0.05, df)
lim.p
H < lim.p
```

Finally, the following figure plots density values for the *chi-square* distribution with different degrees of freedom. In the examples, H_{1} is significantly less than p with *df=3-1*, and H_{2}, on the other hand, is more than p with *df=2-1* and so *H _{0}* is accepted.

```
dfDist <- c(1, 2, 5, 10)
labels <- paste('df =',dfDist)
colpal <- c('tomato','green','cyan','gold')
j = 1
plot(x=-1, xlim=c(0,5), ylim=c(0,1), xlab='chi-sq', ylab='density')
for(i in dfDist){
curve(dchisq(x,i), xlim=c(0,10), col=colpal[j], lwd=1, add=TRUE)
j <- j + 1
}
legend("topright", bty='n', text.width=1, labels, lwd=2.5,
lty=c(1, 1, 1, 1), col=colpal)
abline(h=0,lty=3)
abline(v=0,lty=3)
```

- Kruskal; Wallis (1952). “Use of ranks in one-criterion variance analysis”. Journal of the American Statistical Association. 47 (260): 583–621
^{[return]} - Dunn, Olive Jean (1964). “Multiple comparisons using rank sums”. Technometrics. 6 (3): 241–252
^{[return]}