An R package for comparing two classifications or clustering solutions that have different structures - i.e. the two classifications have a different number of classes, or one classification has soft membership and one classification has hard membership. You can create a confusion matrix (error matrix) and then calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. The helper functions also help you to do things like make a soft classification into a hard one, or turn a set of class labels into a binary classification matrix.
The basic premise is that you already have two (or more perhaps) classifications that you would like compare - these could be from a clustering algorithm, extracted from a remote sensing map, a set of classes assigned manually etc. There already exist a number of tools and packages to calculate cluster diagnostics or accuracy metrics, but they are usually focused on comparing clustering solutions that are hard (i.e. each observation has only one class) and have the same number of classes (e.g. clustering solution vs. the ‘truth’). c2c is designed to allow you to compare classifications that to not fit into this scenario. The motivating problem was the need to compare a probabilistic clustering of vegetation data to an existing hard classification (which had a hierarchy with of numbers of classes) of that data, without losing the probabilistic component that the clustering algorithm produces.
In this vignette we will work through a simple, but hopefully useful,
example using the iris data set. We will use a fuzzy clustering
algorithm from the e1071 package.
Load the iris data set, and prep for clustering
Let’s start with a cluster analysis with 3 groups, since we know that’s where we’re headed, and extract the soft classification matrix
Now we want to compare that soft matrix to a set of hard labels;
we’ll use the species names. get_conf_mat produces the
confusion matrix, and it take two inputs - they can be a matrix or a set
of labels
## setosa versicolor virginica
## 1 1.2050738 39.687779 13.101972
## 2 0.5697467 7.670532 35.835365
## 3 48.2251796 2.641689 1.062664
The output confusion matrix shows us the number of shared sites between our clustering solution and the set of labels (species in this case), accounting for the probabalistic memberships. We can see here that our 3 clusters have very clear fidelity to the species. We can also see what the relationship is like if we degrade the clustering to hard labels (this is the case of a traditional error matrix/accuracy assessment)
## setosa versicolor virginica
## 1 0 47 13
## 2 0 3 37
## 3 50 0 0
Nice, a little confusion between versicolor and virginica. Let’s try more clusters and see if we can tease it apart
fcm6 <- cmeans(x = iris_dat, centers = 10)
fcm6_probs <- fcm6$membership
get_conf_mat(fcm6_probs, iris$Species)## setosa versicolor virginica
## 1 0.15266154 6.0867736 12.9719594
## 2 9.58057173 0.5162824 0.2705114
## 3 0.25354479 14.9000057 2.6927787
## 4 0.10803818 1.8013189 17.4345266
## 5 16.64910522 0.5646783 0.3037981
## 6 9.44835761 0.5148264 0.3076202
## 7 0.17866330 12.6373412 4.7953482
## 8 0.07529018 0.6758506 9.3986275
## 9 13.22515593 0.6026032 0.3000124
## 10 0.32861152 11.7003196 1.5248174
## setosa versicolor virginica
## 1 0 3 15
## 2 10 0 0
## 3 0 17 1
## 4 0 0 23
## 5 20 0 0
## 6 8 0 0
## 7 0 17 0
## 8 0 0 11
## 9 12 0 0
## 10 0 13 0
Cleans things up somewhat, but note the uncertainty is hidden when
you compare hard clustering. As an aside, when you set
make.A.hard = TRUE, the function get_hard is
being used, it might be useful elsewhere. Similarly, when you pass a
vector of labels to get_conf_mat the function
labels_to_matrix makes the binary classification
matrix.
## 1 2 3
## [1,] 0 0 1
## [2,] 0 0 1
## [3,] 0 0 1
## [4,] 0 0 1
## [5,] 0 0 1
## [6,] 0 0 1
## setosa versicolor virginica
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
You can also compare two soft matrices, for example were could compare the 3- and 10-class classifications we just made
## 1 2 3 4 5 6 7
## 1 10.4668215 0.7411042 14.3854264 3.3839788 0.7390052 0.8466115 10.6450044
## 2 8.0773782 0.3868322 2.6365973 15.5552240 0.3955298 0.4683992 6.2723499
## 3 0.6671948 9.2394292 0.8243055 0.4046808 16.3830466 8.9557936 0.6939984
## 8 9 10
## 1 1.879188 0.7703416 10.137343
## 2 7.893729 0.3967090 1.992894
## 3 0.376851 12.9607208 1.423512
or we could directly compare two vectors of labels, which is a different way of doing what we already did above.
## setosa versicolor virginica
## 1 0 47 13
## 2 0 3 37
## 3 50 0 0
Examining the confusion matrix can be enlightening just by itself,
but it can be useful to have some more quantitative metrics,
particularly if you’re comparing lots of classifications. For exmaple
you may be trying to optimise clustering parameters or maybe you’re
comparing lots of different clustering solutions.
calculate_clustering_metrics does this
## Percentage agreement WILL be calculated: it will only make sense if the confusion matrix diagonal corresponds to matching classes (i.e. rows and columns are in the same class order)
## $percentage_agreement
## [1] 0.06625513
##
## $overall_purity
## [1] 0.8249888
##
## $class_purity
## $class_purity$row_purity
## 1 2 3
## 0.7350293 0.8130424 0.9286658
##
## $class_purity$col_purity
## setosa versicolor virginica
## 0.9645036 0.7937556 0.7167073
##
##
## $overall_entropy
## [1] 0.4504491
##
## $class_entropy
## $class_entropy$row_entropy
## 1 2 3
## 0.9446236 0.7628754 0.4325617
##
## $class_entropy$col_entropy
## setosa versicolor virginica
## 0.2533893 0.9035512 0.9687942
Purity and entropy are as defined in Manning et al. (2008). Overall and per-class metrics are included, as both have uses in different situations. See Lyons et al. (2017) and Foster et al. (2017) for use on a model-based vegetation clustering example. Finally, not the message there about percentage agreement - as it says, only use it if the clustering solutions have the same class order, or are numbers for example, which should stay in order. For a decent classification, it shouldn’t differ much from purity anyway.
Foster, Hill and Lyons (2017). “Ecological Grouping of Survey Sites when Sampling Artefacts are Present”. Journal of the Royal Statistical Society: Series C (Applied Statistics). DOI: http://dx.doi.org/10.1111/rssc.12211
Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.
Manning, Raghavan and Schütze (2008). Introduction to information retrieval (Vol. 1, No. 1). Cambridge: Cambridge university press.