Skip to content

Qualitative Variables

Qualitative variables can be thought of as categories: so variables like eye color, gender, and race. When assessing qualitative variables it is useful to consider proportions:

\[\frac{n_i}{N}\]

Explanation of Terms

  • \(n_i\) number in category of interest
  • \(N\) total number of observations

So let's calculate some proportions in R!

library(tidyverse)
meta <- read.table("./data/gbm_cptac_2021/data_clinical_patient.txt",
                   header = T,
                   sep="\t")

country_sum <- meta %>%
  count(COUNTRY_OF_ORIGIN, sort = TRUE) %>% 
  mutate(proportion = n / sum(n)) %>%
  mutate(COUNTRY_OF_ORIGIN = replace_na(COUNTRY_OF_ORIGIN,"NA"))

country_sum
  COUNTRY_OF_ORIGIN  n proportion
1             China 30 0.30303030
2     United States 21 0.21212121
3            Russia 19 0.19191919
4            Poland 18 0.18181818
5              <NA>  6 0.06060606
6          Bulgaria  2 0.02020202
7           Croatia  1 0.01010101
8            Mexico  1 0.01010101
9       Phillipines  1 0.01010101

Note

You'll see that we do have an NA value here and that it's proportion in our variable is counted too! Since the NA value in R has special properties we ensure it is a character and not an NA value using the replace_na() function.

Qualitative variables can be visualized using a bar plot:

ggplot(country_sum, aes(x=proportion,y=reorder(COUNTRY_OF_ORIGIN,+proportion))) + 
  geom_bar(fill="lightpink",stat = "identity")+
  theme_bw()+
  labs(
    x="Proportion",
    y="Country of Origin",
    title="Country of Origin Barplot"
  )

Tip

Here we ensure that we reorder our countries with the reorder() function as ggplot2 will not order our data for us.

References