Title: | Compare Two Classifications or Clustering Solutions of Varying Structure |
---|---|
Description: | Compare two classifications or clustering solutions that may or may not have the same number of classes, and that might have hard or soft (fuzzy, probabilistic) membership. Calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. This package is not geared towards traditional accuracy assessment for classification/ mapping applications - the motivating use case is for comparing a probabilistic clustering solution to a set of reference or existing class labels that could have any number of classes (that is, without having to degrade the probabilistic clustering to hard classes). |
Authors: | Mitchell Lyons [aut, cre] |
Maintainer: | Mitchell Lyons <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2025-03-07 05:04:36 UTC |
Source: | https://github.com/mitchest/c2c |
Calculate a range of clustering metrics on a confusion confusion matrix, usually from get_conf_mat
.
calculate_clustering_metrics(conf_mat)
calculate_clustering_metrics(conf_mat)
conf_mat |
a confusion matrix, as produced by |
Entropy calculated via overall_entropy
and class_entropy
, purity calculated via overall_purity
and class_purity
, percentage agreement calculated via percentage_agreement
(only for confusion matrices of equal dimensions and matching class order)
A list containing the metrics that can be calculated, see details.
Mitchell Lyons
Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.
get_conf_mat
, labels_to_matrix
, get_hard
# meaningless data, but you get the idea # compare two soft classifications my_soft_mat1 <- matrix(runif(50,0,1), nrow = 10, ncol = 5) my_soft_mat2 <- matrix(runif(30,0,1), nrow = 10, ncol = 3) # make the confusion matrix and calculate stats conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat2) conf_mat; calculate_clustering_metrics(conf_mat) # compare a soft classificaiton to a vector of hard labels my_labels <- rep(c("a","b","c"), length.out = 10) # utilising labels_to_matrix(my_labels) conf_mat <- get_conf_mat(my_soft_mat1, my_labels) conf_mat; calculate_clustering_metrics(conf_mat) # make one of the soft matrices hard # utilising get_hard(my_soft_mat2) conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat2, make.B.hard = TRUE) conf_mat; calculate_clustering_metrics(conf_mat) # two classifications with same number of classes, enables percentage agreement conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat1) conf_mat; calculate_clustering_metrics(conf_mat)
# meaningless data, but you get the idea # compare two soft classifications my_soft_mat1 <- matrix(runif(50,0,1), nrow = 10, ncol = 5) my_soft_mat2 <- matrix(runif(30,0,1), nrow = 10, ncol = 3) # make the confusion matrix and calculate stats conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat2) conf_mat; calculate_clustering_metrics(conf_mat) # compare a soft classificaiton to a vector of hard labels my_labels <- rep(c("a","b","c"), length.out = 10) # utilising labels_to_matrix(my_labels) conf_mat <- get_conf_mat(my_soft_mat1, my_labels) conf_mat; calculate_clustering_metrics(conf_mat) # make one of the soft matrices hard # utilising get_hard(my_soft_mat2) conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat2, make.B.hard = TRUE) conf_mat; calculate_clustering_metrics(conf_mat) # two classifications with same number of classes, enables percentage agreement conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat1) conf_mat; calculate_clustering_metrics(conf_mat)
Used to calculate cluster entropy from a confusion matrix, for each class (i.e. each row and column of the confusion matrix).
class_entropy(conf_mat)
class_entropy(conf_mat)
conf_mat |
A confusion matrix from |
Metrics per class are useful when you are comparing two classifications with different numbers of classes, when an overall measure might not be useful or sensible. Entropy as defined in Manning (2008).
A data frame with two columns, the first corresponding to the confusion matrix rows, the second corresponding to the confusion matrix columns.
Manning, C. D., Raghavan, P., & Schütze, H. (2008) Introduction to information retrieval (Vol. 1, No. 1). Cambridge: Cambridge university press.
Used to calculate cluster purity from a confusion matrix, for each class (i.e. each row and column of the confusion matrix).
class_purity(conf_mat)
class_purity(conf_mat)
conf_mat |
A confusion matrix from |
Metrics per class are useful when you are comparing two classifications with different numbers of classes, when an overall measure might not be useful or sensible. Purity as defined in Manning (2008).
A data frame with two columns, the first corresponding to the confusion matrix rows, the second corresponding to the confusion matrix columns.
Manning, C. D., Raghavan, P., & Schütze, H. (2008) Introduction to information retrieval (Vol. 1, No. 1). Cambridge: Cambridge university press.
get_conf_mat
takes two classifications or clustering solutions and creates a confusion matrix representing the number of shared sites between them.
get_conf_mat(A, B, make.A.hard = F, make.B.hard = F)
get_conf_mat(A, B, make.A.hard = F, make.B.hard = F)
A |
A matrix or data.frame (or something that can be coerced to a matrix) of class membership or a vector of class labels (character or factor). |
B |
A matrix or data.frame (or something that can be coerced to a matrix) or class membership or a vector of class labels (character or factor). |
make.A.hard |
logical (defaults to FALSE). If TRUE, and if A= is a matrix of soft membership, it will be degraded to a hard binary matrix, taking the highest value, breaking ties at random |
make.B.hard |
logical (defaults to FALSE). If TRUE, and if B= is a matrix of soft membership, it will be degraded to a hard binary matrix, taking the highest value, breaking ties at random |
Takes inputs A and B (converting labels to matrices if required) and combines them via (). Soft classifications will necessarily be matrices. Hard classifications can be given as a binary matrix of membership or a vector of labels. For matrix inputs, rows should represent individual sites, observations, cases etc., and columns should represent classes. For class label inputs, the vector should be ordered similarly by site, observation, case etc; they will be converted to a binary matrix (see
labels_to_matrix
). Classes from matrix A are represented by rows of the output, and classes from matrix B are represented by the columns. Class names inherited from names()
or colnames()
- if at least one of the inputs has names, interpretation will be much easier. Ties in membership probability are broken at random - if you don't want this to happen, suggest you break the tie manually before proceeding.
A confusion matrix
Mitchell Lyons
Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.
calculate_clustering_metrics
, labels_to_matrix
, get_hard
# meaningless data, but you get the idea # compare two soft classifications my_soft_mat1 <- matrix(runif(50,0,1), nrow = 10, ncol = 5) my_soft_mat2 <- matrix(runif(30,0,1), nrow = 10, ncol = 3) # make the confusion matrix and calculate stats conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat2) conf_mat; calculate_clustering_metrics(conf_mat) # compare a soft classification to a vector of hard labels my_labels <- rep(c("a","b","c"), length.out = 10) # utilising labels_to_matrix(my_labels) conf_mat <- get_conf_mat(my_soft_mat1, my_labels) conf_mat; calculate_clustering_metrics(conf_mat) # make one of the soft matrices hard # utilising get_hard(my_soft_mat2) conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat2, make.B.hard = TRUE) conf_mat; calculate_clustering_metrics(conf_mat) # two classifications with same number of classes, enables percentage agreement conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat1) conf_mat; calculate_clustering_metrics(conf_mat)
# meaningless data, but you get the idea # compare two soft classifications my_soft_mat1 <- matrix(runif(50,0,1), nrow = 10, ncol = 5) my_soft_mat2 <- matrix(runif(30,0,1), nrow = 10, ncol = 3) # make the confusion matrix and calculate stats conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat2) conf_mat; calculate_clustering_metrics(conf_mat) # compare a soft classification to a vector of hard labels my_labels <- rep(c("a","b","c"), length.out = 10) # utilising labels_to_matrix(my_labels) conf_mat <- get_conf_mat(my_soft_mat1, my_labels) conf_mat; calculate_clustering_metrics(conf_mat) # make one of the soft matrices hard # utilising get_hard(my_soft_mat2) conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat2, make.B.hard = TRUE) conf_mat; calculate_clustering_metrics(conf_mat) # two classifications with same number of classes, enables percentage agreement conf_mat <- get_conf_mat(my_soft_mat1, my_soft_mat1) conf_mat; calculate_clustering_metrics(conf_mat)
Used in get_conf_mat
but might be useful separately
get_hard(x)
get_hard(x)
x |
A matrix or data frame (or something coercible to a matrix) containing memberships - rows are sites (observations, cases etc.) columns are classes |
Binary matrix of class membership. Class names inherited from names()
or colnames()
.
my_mat <- matrix(runif(20,0,1), nrow = 4) get_hard(my_mat)
my_mat <- matrix(runif(20,0,1), nrow = 4) get_hard(my_mat)
Used in get_conf_mat
but might be useful separately
labels_to_matrix(x)
labels_to_matrix(x)
x |
Character or factor vector of class labels |
Binary matrix of class membership.
my_labels <- rep(c("a","b","c","d"), 5) labels_to_matrix(my_labels)
my_labels <- rep(c("a","b","c","d"), 5) labels_to_matrix(my_labels)
Used to calculate overall cluster entropy from a confusion matrix.
overall_entropy(conf_mat)
overall_entropy(conf_mat)
conf_mat |
A confusion matrix from |
A scaler, cluster entropy as defined in Manning (2008)
Manning, C. D., Raghavan, P., & Schütze, H. (2008) Introduction to information retrieval (Vol. 1, No. 1). Cambridge: Cambridge university press.
Used to calculate overall cluster purity from a confusion matrix.
overall_purity(conf_mat)
overall_purity(conf_mat)
conf_mat |
A confusion matrix from |
A scaler, cluster purity as defined in Manning (2008)
Manning, C. D., Raghavan, P., & Schütze, H. (2008) Introduction to information retrieval (Vol. 1, No. 1). Cambridge: Cambridge university press.
Used to calculate overall percentage agreement for a confusion matrix - the confusion matrix must have equal dimensions and the diagonal must represent 'matching' class pairs (percentage agreement does not make sense otherwise)
percentage_agreement(conf_mat)
percentage_agreement(conf_mat)
conf_mat |
A confusion matrix from |
A scaler, percentage agreement (sometime referred to as overall accuracy)