An R package for comparing two classifications or clustering solutions that have different structures - i.e. the two classifications have a different number of classes, or one classification has soft membership and one classification has hard membership. You can create a confusion matrix (error matrix) and then calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. The helper functions also help you to do things like make a soft classification into a hard one, or turn a set of class labels into a binary classification matrix.
The basic premise is that you already have two (or more perhaps) classifications that you would like compare - these could be from a clustering algorithm, extracted from a remote sensing map, a set of classes assigned manually etc. There already exist a number of tools and packages to calculate cluster diagnostics or accuracy metrics, but they are usually focused on comparing clustering solutions that are hard (i.e. each observation has only one class) and have the same number of classes (e.g. clustering solution vs. the ‘truth’). c2c is designed to allow you to compare classifications that to not fit into this scenario. The motivating problem was the need to compare a probabilistic clustering of vegetation data to an existing hard classification (which had a hierarchy with of numbers of classes) of that data, without losing the probabilistic component that the clustering algorithm produces.
In this vignette we will work through a simple, but hopefully useful,
example using the iris data set. We will use a fuzzy clustering
algorithm from the e1071
package.
Load the iris data set, and prep for clustering
Let’s start with a cluster analysis with 3 groups, since we know that’s where we’re headed, and extract the soft classification matrix
Now we want to compare that soft matrix to a set of hard labels;
we’ll use the species names. get_conf_mat
produces the
confusion matrix, and it take two inputs - they can be a matrix or a set
of labels
## setosa versicolor virginica
## 1 0.5697909 7.67279 35.83887
## 2 1.2051993 39.68606 13.09839
## 3 48.2250098 2.64115 1.06274
The output confusion matrix shows us the number of shared sites between our clustering solution and the set of labels (species in this case), accounting for the probabalistic memberships. We can see here that our 3 clusters have very clear fidelity to the species. We can also see what the relationship is like if we degrade the clustering to hard labels (this is the case of a traditional error matrix/accuracy assessment)
## setosa versicolor virginica
## 1 0 3 37
## 2 0 47 13
## 3 50 0 0
Nice, a little confusion between versicolor and virginica. Let’s try more clusters and see if we can tease it apart
fcm6 <- cmeans(x = iris_dat, centers = 10)
fcm6_probs <- fcm6$membership
get_conf_mat(fcm6_probs, iris$Species)
## setosa versicolor virginica
## 1 0.09675354 0.5975077 7.7686907
## 2 0.16077013 2.7167726 12.1892524
## 3 0.24319463 12.3036332 3.4208791
## 4 0.45590127 10.5293573 1.1540685
## 5 19.52092426 0.5343066 0.2399371
## 6 0.34546635 14.7324950 2.0334194
## 7 0.20928333 6.1665284 10.2558311
## 8 11.95174056 0.4940715 0.2439597
## 9 0.13486822 1.4033071 12.4686145
## 10 16.88109772 0.5220208 0.2253475
## setosa versicolor virginica
## 1 0 0 9
## 2 0 0 13
## 3 0 18 0
## 4 0 13 0
## 5 19 0 0
## 6 0 15 1
## 7 0 4 13
## 8 13 0 0
## 9 0 0 14
## 10 18 0 0
Cleans things up somewhat, but note the uncertainty is hidden when
you compare hard clustering. As an aside, when you set
make.A.hard = TRUE
, the function get_hard
is
being used, it might be useful elsewhere. Similarly, when you pass a
vector of labels to get_conf_mat
the function
labels_to_matrix
makes the binary classification
matrix.
## 1 2 3
## [1,] 0 0 1
## [2,] 0 0 1
## [3,] 0 0 1
## [4,] 0 0 1
## [5,] 0 0 1
## [6,] 0 0 1
## setosa versicolor virginica
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
You can also compare two soft matrices, for example were could compare the 3- and 10-class classifications we just made
## 1 2 3 4 5 6 7
## 1 6.4629980 10.3617949 4.9431427 1.674114 0.3635332 2.1488317 6.003589
## 2 1.6359202 4.2497560 10.3090383 8.919396 0.7526916 14.0544379 9.956113
## 3 0.3640338 0.4552442 0.7155258 1.545817 19.1789432 0.9081111 0.671941
## 8 9 10
## 1 0.4399697 11.2765060 0.4069717
## 2 0.8733611 2.3771720 0.8617637
## 3 11.3764409 0.3531118 16.3597306
or we could directly compare two vectors of labels, which is a different way of doing what we already did above.
## setosa versicolor virginica
## 1 0 3 37
## 2 0 47 13
## 3 50 0 0
Examining the confusion matrix can be enlightening just by itself,
but it can be useful to have some more quantitative metrics,
particularly if you’re comparing lots of classifications. For exmaple
you may be trying to optimise clustering parameters or maybe you’re
comparing lots of different clustering solutions.
calculate_clustering_metrics
does this
## Percentage agreement WILL be calculated: it will only make sense if the confusion matrix diagonal corresponds to matching classes (i.e. rows and columns are in the same class order)
## $percentage_agreement
## [1] 0.2754573
##
## $overall_purity
## [1] 0.8249996
##
## $class_purity
## $class_purity$row_purity
## 1 2 3
## 0.8130147 0.7350679 0.9286738
##
## $class_purity$col_purity
## setosa versicolor virginica
## 0.9645002 0.7937212 0.7167774
##
##
## $overall_entropy
## [1] 0.4504395
##
## $class_entropy
## $class_entropy$row_entropy
## 1 2 3
## 0.7629341 0.9445773 0.4325302
##
## $class_entropy$col_entropy
## setosa versicolor virginica
## 0.2534083 0.9036161 0.9686980
Purity and entropy are as defined in Manning et al. (2008). Overall and per-class metrics are included, as both have uses in different situations. See Lyons et al. (2017) and Foster et al. (2017) for use on a model-based vegetation clustering example. Finally, not the message there about percentage agreement - as it says, only use it if the clustering solutions have the same class order, or are numbers for example, which should stay in order. For a decent classification, it shouldn’t differ much from purity anyway.
Foster, Hill and Lyons (2017). “Ecological Grouping of Survey Sites when Sampling Artefacts are Present”. Journal of the Royal Statistical Society: Series C (Applied Statistics). DOI: http://dx.doi.org/10.1111/rssc.12211
Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.
Manning, Raghavan and Schütze (2008). Introduction to information retrieval (Vol. 1, No. 1). Cambridge: Cambridge university press.