What is c2c?

An R package for comparing two classifications or clustering solutions that have different structures - i.e. the two classifications have a different number of classes, or one classification has soft membership and one classification has hard membership. You can create a confusion matrix (error matrix) and then calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. The helper functions also help you to do things like make a soft classification into a hard one, or turn a set of class labels into a binary classification matrix.

How to use c2c

The basic premise is that you already have two (or more perhaps) classifications that you would like compare - these could be from a clustering algorithm, extracted from a remote sensing map, a set of classes assigned manually etc. There already exist a number of tools and packages to calculate cluster diagnostics or accuracy metrics, but they are usually focused on comparing clustering solutions that are hard (i.e. each observation has only one class) and have the same number of classes (e.g. clustering solution vs. the ‘truth’). c2c is designed to allow you to compare classifications that to not fit into this scenario. The motivating problem was the need to compare a probabilistic clustering of vegetation data to an existing hard classification (which had a hierarchy with of numbers of classes) of that data, without losing the probabilistic component that the clustering algorithm produces.

An example with the iris data

In this vignette we will work through a simple, but hopefully useful, example using the iris data set. We will use a fuzzy clustering algorithm from the e1071 package.

library(c2c)
library(e1071)

Load the iris data set, and prep for clustering

data(iris)
iris_dat <- iris[,-5]

Let’s start with a cluster analysis with 3 groups, since we know that’s where we’re headed, and extract the soft classification matrix

fcm3 <- cmeans(x = iris_dat, centers = 3)
fcm3_probs <- fcm3$membership

Now we want to compare that soft matrix to a set of hard labels; we’ll use the species names. get_conf_mat produces the confusion matrix, and it take two inputs - they can be a matrix or a set of labels

get_conf_mat(fcm3_probs, iris$Species)

##       setosa versicolor virginica
## 1  0.5697909    7.67279  35.83887
## 2  1.2051993   39.68606  13.09839
## 3 48.2250098    2.64115   1.06274

The output confusion matrix shows us the number of shared sites between our clustering solution and the set of labels (species in this case), accounting for the probabalistic memberships. We can see here that our 3 clusters have very clear fidelity to the species. We can also see what the relationship is like if we degrade the clustering to hard labels (this is the case of a traditional error matrix/accuracy assessment)

get_conf_mat(fcm3_probs, iris$Species, make.A.hard = TRUE)

##   setosa versicolor virginica
## 1      0          3        37
## 2      0         47        13
## 3     50          0         0

Nice, a little confusion between versicolor and virginica. Let’s try more clusters and see if we can tease it apart

fcm6 <- cmeans(x = iris_dat, centers = 10)
fcm6_probs <- fcm6$membership
get_conf_mat(fcm6_probs, iris$Species)

##         setosa versicolor  virginica
## 1   0.09675354  0.5975077  7.7686907
## 2   0.16077013  2.7167726 12.1892524
## 3   0.24319463 12.3036332  3.4208791
## 4   0.45590127 10.5293573  1.1540685
## 5  19.52092426  0.5343066  0.2399371
## 6   0.34546635 14.7324950  2.0334194
## 7   0.20928333  6.1665284 10.2558311
## 8  11.95174056  0.4940715  0.2439597
## 9   0.13486822  1.4033071 12.4686145
## 10 16.88109772  0.5220208  0.2253475

get_conf_mat(fcm6_probs, iris$Species, make.A.hard = TRUE)

##    setosa versicolor virginica
## 1       0          0         9
## 2       0          0        13
## 3       0         18         0
## 4       0         13         0
## 5      19          0         0
## 6       0         15         1
## 7       0          4        13
## 8      13          0         0
## 9       0          0        14
## 10     18          0         0

Cleans things up somewhat, but note the uncertainty is hidden when you compare hard clustering. As an aside, when you set make.A.hard = TRUE, the function get_hard is being used, it might be useful elsewhere. Similarly, when you pass a vector of labels to get_conf_mat the function labels_to_matrix makes the binary classification matrix.

head(get_hard(fcm3_probs))

##      1 2 3
## [1,] 0 0 1
## [2,] 0 0 1
## [3,] 0 0 1
## [4,] 0 0 1
## [5,] 0 0 1
## [6,] 0 0 1

head(labels_to_matrix(iris$Species))

##   setosa versicolor virginica
## 1      1          0         0
## 2      1          0         0
## 3      1          0         0
## 4      1          0         0
## 5      1          0         0
## 6      1          0         0

You can also compare two soft matrices, for example were could compare the 3- and 10-class classifications we just made

get_conf_mat(fcm3_probs, fcm6_probs)

##           1          2          3        4          5          6        7
## 1 6.4629980 10.3617949  4.9431427 1.674114  0.3635332  2.1488317 6.003589
## 2 1.6359202  4.2497560 10.3090383 8.919396  0.7526916 14.0544379 9.956113
## 3 0.3640338  0.4552442  0.7155258 1.545817 19.1789432  0.9081111 0.671941
##            8          9         10
## 1  0.4399697 11.2765060  0.4069717
## 2  0.8733611  2.3771720  0.8617637
## 3 11.3764409  0.3531118 16.3597306

or we could directly compare two vectors of labels, which is a different way of doing what we already did above.

get_conf_mat(fcm3$cluster, iris$Species)

##   setosa versicolor virginica
## 1      0          3        37
## 2      0         47        13
## 3     50          0         0

Examining the confusion matrix can be enlightening just by itself, but it can be useful to have some more quantitative metrics, particularly if you’re comparing lots of classifications. For exmaple you may be trying to optimise clustering parameters or maybe you’re comparing lots of different clustering solutions. calculate_clustering_metrics does this

conf_mat <- get_conf_mat(fcm3_probs, iris$Species)
calculate_clustering_metrics(conf_mat)

## Percentage agreement WILL be calculated: it will only make sense if the confusion matrix diagonal corresponds to matching classes (i.e. rows and columns are in the same class order)

## $percentage_agreement
## [1] 0.2754573
## 
## $overall_purity
## [1] 0.8249996
## 
## $class_purity
## $class_purity$row_purity
##         1         2         3 
## 0.8130147 0.7350679 0.9286738 
## 
## $class_purity$col_purity
##     setosa versicolor  virginica 
##  0.9645002  0.7937212  0.7167774 
## 
## 
## $overall_entropy
## [1] 0.4504395
## 
## $class_entropy
## $class_entropy$row_entropy
##         1         2         3 
## 0.7629341 0.9445773 0.4325302 
## 
## $class_entropy$col_entropy
##     setosa versicolor  virginica 
##  0.2534083  0.9036161  0.9686980

Purity and entropy are as defined in Manning et al. (2008). Overall and per-class metrics are included, as both have uses in different situations. See Lyons et al. (2017) and Foster et al. (2017) for use on a model-based vegetation clustering example. Finally, not the message there about percentage agreement - as it says, only use it if the clustering solutions have the same class order, or are numbers for example, which should stay in order. For a decent classification, it shouldn’t differ much from purity anyway.

References

Foster, Hill and Lyons (2017). “Ecological Grouping of Survey Sites when Sampling Artefacts are Present”. Journal of the Royal Statistical Society: Series C (Applied Statistics). DOI: http://dx.doi.org/10.1111/rssc.12211

Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.

Manning, Raghavan and Schütze (2008). Introduction to information retrieval (Vol. 1, No. 1). Cambridge: Cambridge university press.

c2c workflow

What is c2c?

How to use c2c

An example with the iris data

References