We will consider data on economic indicators for EU countries from Greenacre (2012). The variables correspond to: CPI, consumer price index (index = 100 in 2005); UNE, unemployment rate in 15–64 age group; INP, industrial production (index = 100 in 2005); BOP, balance of payments; PRC, private final consumption expenditure; UN_perc, annual change in unemployment rate.
options(repr.plot.width=6, repr.plot.height=6)
eu<-read.csv('eu_indicators.csv',sep=' ')
head(eu) #inspect the data
#scale and inspect the biplot
dat<-scale(eu[,3:8])
rownames(dat)<-eu$Countries
biplot(princomp(dat))
Countries | abbr | CPI | UNE | INP | BOP | PRC | UN_perc |
---|---|---|---|---|---|---|---|
Belgium | BE | 116.03 | 4.77 | 125.59 | 908.6 | 6716.5 | -1.6 |
Bulgaria | BG | 141.20 | 7.31 | 102.39 | 27.8 | 1094.7 | 3.5 |
CzechRep. | CZ | 116.20 | 4.88 | 119.01 | -277.9 | 2616.4 | -0.6 |
Denmark | DK | 114.20 | 6.03 | 88.20 | 1156.4 | 7992.4 | 0.5 |
Germany | DE | 111.60 | 4.63 | 111.30 | 499.4 | 6774.6 | -1.3 |
Estonia | EE | 135.08 | 9.71 | 111.50 | 153.4 | 2194.1 | -7.7 |
Levels in the dendrogram represent a dissimilarity between examples.
#perform hierarchical clustering and show the cluster dendrogram
hc<-hclust(dist(dat))
plot(hc,hang=-1)
#additional visualisation tools available in library ape
library(ape)
plot(as.phylo(hc), type = "fan")
To join clusters $C_i$ and $C_j$ into super-clusters, we need a way to measure the dissimilarity $D(C_i,C_j)$ between them. Three most common choices are:
Single linkage generally results in elongated, loosely connected clusters: it will have a tendency to include in the same cluster the items linked by a series of close intermediate observations. This is called chaining. Clusters in complete linkage tend to be more compact and have smaller diameters (largest distance among the cluster members). Average linkage provides a balance between the two. Let us visualise behaviour of these different methods on an artifical dataset generated from a mixture of three 2-dimensional Gaussians.
library(cluster)
dat=xclara[sample(dim(xclara)[1],500),]
#plot the data
plot(dat)
hcs<-hclust(dist(dat),method="single")
plot(hcs,hang=-1,labels = FALSE)
clusterlab <- cutree(hcs, 6)
plot(dat,col=clusterlab)
clusterlab <- cutree(hcs, 3)
plot(dat,col=clusterlab)
hcc<-hclust(dist(dat),method="complete")
plot(hcc,hang=-1,labels = FALSE)
clusterlab <- cutree(hcc, 6)
plot(dat,col=clusterlab)
clusterlab <- cutree(hcc, 3)
plot(dat,col=clusterlab)
hca<-hclust(dist(dat),method="average")
plot(hca,hang=-1,labels = FALSE)
clusterlab <- cutree(hca, 6)
plot(dat,col=clusterlab)
clusterlab <- cutree(hca, 3)
plot(dat,col=clusterlab)