k-means algorithm

A popular clustering algorithm is K-means, It’s a unsupervised method that divides data-points in predetermined k groups. This algorithm starts k centroids randomly.

For instance, if you have samples of venom glands from a differential gene expression experiment k will be hypothetical samples; if you work with cancer samples the initial centroids will be hypothetical patients and so on. Then each sample is assigned to the closest centroid in multiple iterations.

It important highlight that the initial centroids are chosen randomly, so the results may differ. This is its biggest disadvantage.

In this example, we have multiple types of cancer and we are going to try to cluster the samples and review if they match with the original designation (if you want more information about the dataset and the methods I recommend you reading Alatuna Akanin Computational genomic with R book).

First, we have to import the data, you can download the data here:

Download example data Or use the original source installing the package compGenomRData

library(cluster)
library(ggplot2)
library(ggpubr)

mat <- read.csv("~/Documentos/R/R_personal_website_2.0/static/uploads/leukemiaExpressionSubset.csv")
rownames(mat) <- mat$X
mat <- mat[,2:ncol(mat)] |> as.matrix()

Then we make a k-means clustering using the cluster package with the function kmeans(), as it was said before, the initial centroids are selected randomly, that’s why we use set.seed() to use always the same pseudo-random numbers.

The number of cluster are arbitrary, in this case we use 5 and we create the object kclu, inside this object, kclu$custer contain the samples and the custer each of them belong. Finally, we use table() to summarize the results.

set.seed(101)
kclu=kmeans(t(mat),centers=5) 

type2kclu = data.frame(LeukemiaType =substr(colnames(mat),1,3),
                       cluster=kclu$cluster) 
table(type2kclu)
LeukemiaTypeclusterFreq
ALL112
AML10
CLL10
CML10
NoL10
ALL20
AML21
CLL20
CML21
NoL212
ALL30
AML30
CLL30
CML311
NoL30
ALL40
AML40
CLL412
CML40
NoL40
ALL50
AML511
CLL50
CML50
NoL50

K-meoids

Other approach to cluster samples with unsupervised machine learning is K-meoids, it is similar to k-means but it takes real data points instead hypothetical centroids and it use Manhattan distance in each iteration.

kmclu=cluster::pam(t(mat),k=5,metric = "manhattan") # cluster using k-medoids
# make a data frame with Leukemia type and cluster id
type2kmclu = data.frame(LeukemiaType=substr(colnames(mat),1,3),
                        cluster=kmclu$cluster)
table(type2kmclu)
LeukemiaTypeclusterFreq
ALL112
AML10
CLL10
CML10
NoL10
ALL20
AML210
CLL20
CML20
NoL20
ALL30
AML32
CLL30
CML30
NoL312
ALL40
AML40
CLL412
CML40
NoL40
ALL50
AML50
CLL50
CML512
NoL50

Visualizing results with ggplot

You can use different tools from or pythonto vizualize the results. In this case, a function to print two plots together was made. But Remember, the name of the cluster is random, so both algorithms are consistent even if the name of the cluster are not the same, in this cases it’s better compare the previous tables.

# Calculate distances
dists=dist(t(mat))
# calculate MDS
mds=cmdscale(dists)
ggplot(data = as.data.frame(mds),
       aes(x = V1,y=V2, color = as.factor(kclu$cluster))) +
        geom_point() +
        theme_classic()+
        xlab("mds 1")+
        ylab("mds 2")+
        scale_color_discrete(name = "Cluster")+
        ggtitle("Clusters with k-means")
km_plot <- function(m,k) {
        p <- ggplot(data = as.data.frame(m),
               aes(x = V1,y=V2, color = as.factor(k$cluster))) +
                geom_point() +
                theme_classic()+
                xlab("mds 1")+
                ylab("mds 2")+
                scale_color_discrete(name = "Cluster")
        if (class(k)== "kmeans" && length(class(k) == 1)) {
                p <- p + ggtitle("Clustering with k-means")
        }else {
                p <- p + ggtitle("Clustering with k-meioids")
        }
return(p)
}
ggarrange(km_plot(mds,kclu),km_plot(mds,kmclu))

Do I choose the correct number of custer?

As I said before, the number of Ks is selected completely arbitrary by the data scientist or researcher. Is there some criteria to choose the best number of Ks?.

Yes. There are different aproacches.

Silhouette

Silhouette metric can take values from -1 to +1 negative values indicate that you can trust on that assignation and positive values mean that the point is well matched.

This metric compare how similar is a point to its own cluster and how different is its to other clusters, that implies that silhouette values are calculated to each point.

Here we test if 5 is a “good” number of Ks.

pamclu=cluster::pam(t(mat),k=5)
plot(cluster::silhouette(pamclu),main=NULL)

Here we compare the average of silhouettes for multiple Ks number, and the algorithm shows best result using a k = 4, but remember, this cluster does not have meaning by themselves, you have to know you data and understand the problem you are facing.

Ks=sapply(2:7,function(i)summary(silhouette(pam(t(mat),k=i)))$avg.width)
plot(2:7,Ks,xlab="k",ylab="av. silhouette",type="b",
pch=19)
Diego Sierra Ramírez
Diego Sierra Ramírez
Msc. in Biological Science / Data analyst

Related