This function takes a square matrix dmx
that contains item by item
distances, and a factor classes
(with as many items as there are rows, and thus columns,
in dmx
) that assigns a class to each item.
The function returns a measure q of how well the distances
in dmx
'capture' the classification in classes
,
where distances are taken to 'capture a classification' to the extent
that items are (immediately) surrounded by other items from the same class,
and not by items from some other class.
Next to an overall cluster quality for all the data taken together,
the function also returns the cluster quality of individual points
and the cluster quality of individual classes
(as well as the mean cluster quality over classes).
All these measures are called q in the output of the function.
Usage
separationkNN(dmx, classes, k = NULL, weights = c("linear", "s-curve", "none"))
Arguments
- dmx
A square matrix containing item by item distances
- classes
A factor of the same length as the number of rows and columns in
dmx
; the class in position i inclasses
is the class assigned to the item of row i and column i indmx
- k
The value of
k
that is to be used to identify thek
nearest neighbours. Ifk
is not specified, thenk
is taken to be either the total number of items divided by ten (if the number of items divided by ten is smaller than the size of the smallest class), or the size of the smallest class minus one (if the size of the smallest class minus one is smaller than the total number of items divided by ten). This default behaviour obviously is but a very crude attempt at guessing a sensible value fork
. Most of the time you probably want to overrule this default behaviour. If you explicitly specifyk
, all value from one up to the total number of items minus one are allowed.- weights
The
weights
argument determines how exactly the cluster quality of a point is derived from the class membership of its k nearest neighbours. This cluster quality is 'the weighted mean of class membership values of these neighbours (1=same class as target item; 0=different class)', with the weights being determined by theweights
argument. The weights are k numbers, the first of which indicates the weight if the closest neighbour, the second of which indicates the weight of the second closest neighbour, etc. The sum of the weights always is one. Whenweights
is"linear"
, which is the default situation, weights decrease linearly as one progresses through the set of neighbours (starting from the one that is closest to the target item). Whenweights
is"s-curve"
, weights decrease as one progresses through the set of neigbours (starting from the one that is closest to the target item) according to the s-shape ofy<-(40:-40)/10; plot(1:81, exp(y) / (1 + exp(y)), type="l")
, but with the actual weights rescaled so that they add up to one. Finally, whenweight
is"none"
, all connections in the path have equal weight. The actual weights that are used in a call toseparationkNN()
can be found in theweights
components in its output.
Value
An object of the class clustqualkNN
,
which is a list containing at least the following components:
- globqual
The global cluster quality q
- meanclassqual
The mean of all class-specific cluster quality values q
- classqual
A table with for each class its class-specific clusters quality q
- pointqual
A numeric vector with for each item its item-specific cluster quality q
- weights
A numeric vector with the weights that were used
- k
A number indication which
k
was used
Details
The q measures are calculated as follows: first, for each item an item-specific cluster quality is calculated. This is done by calculating the proportion of 'same class items' among its k nearest neighbours. The higher the measure, the better the cluster quality for that item. However, what is calculated is not simply the proportion, but rather the weighted mean of the values of the k nearest neighbours, where a 'same class neighbour' has value one, a 'different class neighbour' has value zero, and the weights of the neighbours can have different settings (see below). In the default settings, weights decrease linearly with their rank of 'distance from the item', and all weights add up to one. For instance, if k is one then the weight is 1. If k is 2, then the weights, starting from the closest nearest neighbour, are .67 and .33. If k is 3, then the weights are .5, .33, and .17. If k is 4, they are .4, .3, .2, and .1. Etc.
The overall cluster quality of the data is then calculated as the
mean cluster quality of all items. Additionally, the cluster quality
for every class in classes
is calculated as the mean cluster
quality of the items belonging to that class. The mean class quality,
finally, is the mean of all class-specific class quality measures.
Examples
# we create a 'point cloud', with points belonging to two classes
points <- rbind(matrix(rnorm(100, 2, 2), ncol=2),
matrix(rnorm(100, 4, 2), ncol=2))
dst <- dist(points, diag=TRUE, upper=TRUE)
classes <- as.factor(rep(c("a","b"), c(50, 50)))
# we analyse the cluster quality, letting the procedure choose k
fitkNN <- separationkNN(dst, classes)
summary(fitkNN)
#> Length Class Mode
#> classfreqs 2 table numeric
#> classqual 2 -none- numeric
#> pointclass 100 factor numeric
#> pointqual 100 -none- numeric
#> globqual 1 -none- numeric
#> meanclassqual 1 -none- numeric
#> k 1 -none- numeric
#> weights 10 -none- numeric
fitkNN$globqual # global cluster quality
#> [1] 0.5727273
fitkNN$meanclassqual # mean class quality
#> [1] 0.5727273
fitkNN$classqual # class-specific quality
#> a b
#> 0.5865455 0.5589091
# we analyse the cluster quality, setting k to 25
fitkNN <- separationkNN(dst, classes, k=25)
summary(fitkNN)
#> Length Class Mode
#> classfreqs 2 table numeric
#> classqual 2 -none- numeric
#> pointclass 100 factor numeric
#> pointqual 100 -none- numeric
#> globqual 1 -none- numeric
#> meanclassqual 1 -none- numeric
#> k 1 -none- numeric
#> weights 25 -none- numeric