Skip to content

Distance Metrics

Clustering

Clustering, like the name implies, is a way to group data points to find patterns. Clustering is type of unsupervised learning - where patterns are discovered without labeled data.

Distance Metrics

Before generating clusters we need to calculate the distance between observations:

To do so we have a few different options for distance metrics:

Euclidean Distance

\[d_{euc}(x,y) = \sqrt{\sum_{i=1}^n{(x_i - y_i)^2}}\]

Explanation of Terms

  • \(x\) variable x
  • \(y\) variable y
  • \(n\) number of observations

Manhattan Distance

\[d_{man}(x,y) = \sum_{i=1}^n{|(x_i - y_i)|}\]

Explanation of Terms

  • \(x\) variable x
  • \(y\) variable y
  • \(n\) number of observations

Eisen Cosine Correlation Distance

\[d_{eis}(x,y) = 1 - \frac{|\sum_{i=1}^n{x_iy_i}|}{\sqrt{\sum_{i=1}^n {x_i^2}\sum_{i=1}^n {y_i^2}}}\]

Explanation of Terms

  • \(x\) variable x
  • \(y\) variable y
  • \(n\) number of observations

Pearson Correlation Distance

\[d_{pearson}(x,y) = 1 - \frac{\sum_{i=1}^n{(x - \mu_x)(y - \mu_y)}}{\sqrt{\sum_{i=1}^n{(x - \mu_x)^2} \sum_{i=1}^n{(y - \mu_y)^2}}} \]

Explanation of Terms

  • \(x\) variable x
  • \(y\) variable y
  • \(\mu_x\) mean of variable x
  • \(\mu_y\) mean of variable y
  • \(n\) number of observations

Spearman Correlation Distance

\[d_{spearman}(x,y) = 1 - \frac{\sum{(x\prime - \mu_{x\prime} )(y\prime - \mu_{y\prime} )}}{\sqrt{\sum{(x\prime - \mu_{x\prime} )^2} \sum{(y\prime - \mu_{y\prime} )^2}}}\]

Explanation of Terms

  • \(x\) variable x
  • \(y\) variable y
  • \(\mu_x\) mean of variable x
  • \(\mu_y\) mean of variable y
  • \(n\) number of observations

Distance Metrics In R

Let's try creating a distance matrix!

# load the libraries
.libPaths(c("/cluster/tufts/hpc/tools/R/4.0.0"))
library(tidyverse)
library(factoextra)

# load our counts data
counts <- read.csv(
  file="data/gbm_cptac_2021/data_mrna_seq_fpkm.txt",
  header = T,
  sep = "\t")

# make the genes our rownames
rownames(counts) <- make.names(counts$Hugo_Symbol,unique = TRUE)

# remove the gene symbol column
counts <- counts %>%
  select(-c(Hugo_Symbol)) 

# log2 transform our data 
# transpose our data so that our patients are rows
counts <- t(log2(counts + 1))

# Change NA counts to 0
counts[!is.finite(counts)] <- 0

# generate correlation distance matrix
dist <- get_dist(counts,method = "pearson")

# plot correlation distance matrix
fviz_dist(dist) +
  theme(axis.text = element_text(size = 3)) +
  labs(
    title = "Pearson Correlation Distances Between Samples",
    fill = "Pearson Correlation"
  )

References

  1. Clustering Distance Measures
  2. K-Means Clustering in R: Algorithm and Practical Examples
  3. Agglomerative Hierarchical Clustering
  4. Distance Method Formulas