Data Manipulation
Loading Libraries
To manipulate our data and ultimately plot it, we often use libraries of functions to reduce the amount of coding we need to do. For the purposes of this tutorial we will load the following:
library(tidyverse) # data manipulation/plotting
library(janitor) # data tabulation
library(reshape2) # reshape to plug into ggplot
library(ggfortify) # pca plot
library(ggplotify) # convert to ggplot
library(png) # use png images
library(grid) # modify png images
library(pheatmap) # create heatmaps
library(patchwork) # combine multiple plots
library(DESeq2) # run differential expression
Loading Data
To load our data we will be using functions from the readr
packages instead of the base functions provided through base R. The readr
collection of functions are advantageous as they are typically more user friendly and are faster at importing data. Let's load our sample and meta data:
# load sample data
meta <- read_delim(file="../data/meta_data.tsv",
delim="\t")
# load mRNA data
counts <- read_delim(file="../data/expression_data.tsv",
delim="\t")
Inspecting Data
We can examine the structure of our data with the str
function:
str(meta)
output
spc_tbl_ [113 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ patient_id : chr [1:113] "GSM3561843" "GSM3561844" "GSM3561845" "GSM3561846" ...
$ braak.score:ch1: chr [1:113] "III" "IV" "II" "V" ...
$ cell type:ch1 : chr [1:113] "myeloid" "endothelial" "neuron" "neuron" ...
$ diagnosis:ch1 : chr [1:113] "Control" "Control" "Control" "AD" ...
$ expired_age:ch1: chr [1:113] ">90" "88" "79" "84" ...
$ Sex:ch1 : chr [1:113] "Female" "Female" "Male" "Female" ...
- attr(*, "spec")=
.. cols(
.. patient_id = col_character(),
.. `braak.score:ch1` = col_character(),
.. `cell type:ch1` = col_character(),
.. `diagnosis:ch1` = col_character(),
.. `expired_age:ch1` = col_character(),
.. `Sex:ch1` = col_character()
.. )
- attr(*, "problems")=<externalptr>
Here we see the overall structure of our meta data. We have a data frame with 113 rows, 6 columns, the first few observations per column, and the data type of each column. Now if this is a bit much to look at we can also just look at the top 6 rows of our data:
head(counts[,1:5])
output
# A tibble: 6 × 5
SYMBOL GSM3561843 GSM3561844 GSM3561845 GSM3561846
<chr> <dbl> <dbl> <dbl> <dbl>
1 AKT3 1391 1918 472 773
2 NR2E3 2 50 9 20
3 NAALADL1 10 18 2 1
4 SIGLEC14 512 0 1 1
5 MIR708 4 0 0 0
6 NAV2-AS6 24 44 47 35
Convert A Column To Rownames
Sometimes when we import our data, we may not get it in the form we need it in. In that case we will need to do some data munging (AKA data cleaning). A common operation is take a column of unique identifiers and turn that column into the rownames for the data frame:
Column to Rownames
We will need to make the gene names our row names we can do this with the column_to_rownames
funtion:
# convert the SYMBOL column to rownames
counts <- counts %>%
column_to_rownames("SYMBOL")
# now let's inspect the new counts data frame
head(counts[,1:5])
output
GSM3561843 GSM3561844 GSM3561845 GSM3561846 GSM3561847
AKT3 1391 1918 472 773 140
NR2E3 2 50 9 20 9
NAALADL1 10 18 2 1 0
SIGLEC14 512 0 1 1 31
MIR708 4 0 0 0 0
NAV2-AS6 24 44 47 35 0
Let's do the same for patient id in our meta data. This way we can better map our counts data frame to our meta data data frame:
# make the patient_id column the rownames
meta <- meta %>%
column_to_rownames("patient_id")
# now let's inspect the new counts data frame
head(meta[,1:5])
output
braak.score:ch1 cell type:ch1 diagnosis:ch1 expired_age:ch1 Sex:ch1
GSM3561843 III myeloid Control >90 Female
GSM3561844 IV endothelial Control 88 Female
GSM3561845 II neuron Control 79 Male
GSM3561846 V neuron AD 84 Female
GSM3561847 II myeloid Control 73 <NA>
GSM3561848 V myeloid AD 84 Female
Changing Column Names
Data won't always come in with user friendly column names. We can change column names in a data frame with the rename
/rename_with
functions:
Renaming Column Names
In our meta data frame all our column names have extra characters on the end. We can change all these column names with the rename_with
function:
# rename columns that have the pattern :ch1
meta <- meta %>%
rename_with(~ gsub(":ch1","",.))
# we can check our data frame to see our changes
head(meta)
output
braak.score cell type diagnosis expired_age Sex
GSM3561843 III myeloid Control >90 Female
GSM3561844 IV endothelial Control 88 Female
GSM3561845 II neuron Control 79 Male
GSM3561846 V neuron AD 84 Female
GSM3561847 II myeloid Control 73 <NA>
GSM3561848 V myeloid AD 84 Female
If we were interested in changing the name of one column we could use the rename
function to do so:
# rename the cell type column
meta <- meta %>%
dplyr::rename("cell_type"="cell type")
# let's check the names of our data frame now!
head(meta)
output
braak.score cell_type diagnosis expired_age Sex
GSM3561843 III myeloid Control >90 Female
GSM3561844 IV endothelial Control 88 Female
GSM3561845 II neuron Control 79 Male
GSM3561846 V neuron AD 84 Female
GSM3561847 II myeloid Control 73 <NA>
GSM3561848 V myeloid AD 84 Female
Filtering Rows
Not all rows in your data will be worth keeping we can remove rows with the filter
function:
Filtering Data Frames
We will be assessing differentially expressed genes between Alzheimer's disease patients and control patients. However, we will be filtering out any non-neuronal tissue samples:
# only keep neuronal samples
meta_filt <- meta %>%
filter(cell_type=="neuron")
# we can check how many rows were filtered out by comparing the filtered
# data frame with the previous data frame
nrow(meta)
nrow(meta_filt)
output
[1] 113
[1] 42
Selecting Columns
we have just removed 71 rows (so 71 samples) from our meta data. To run DESeq2 we will need to remove those samples from our counts data frame as well. We can remove columns using the select
function:
Selecting Columns
# select only columns whose names are present in the rownames of the meta data
counts_filt <- counts %>%
select(rownames(meta_filt))
# let's check how many columns are present in the filtered and unfiltered data frames
ncol(counts)
ncol(counts_filt)
output
[1] 113
[1] 42
Add A Column
Right now our condition variable is Control
v. AD
. We may want to add a column that spells out that AD
represents Alzheimer's Disease. We can add columns with the mutate
function:
Adding Columns To A Data Frame
# let's create a column for Alzheimer's Disease Status using the ifelse function to insert a value if a test is TRUE
meta_filt <- meta_filt %>%
mutate(diagnosis = gsub("AD","Alzheimer's Disease",diagnosis))
# let's inspect this new column!
meta_filt$diagnosis[1:10]
output
[1] "Control" "Alzheimer's Disease" "Control" "Alzheimer's Disease"
[5] "Alzheimer's Disease" "Control" "Control" "Alzheimer's Disease"
[9] "Control" "Alzheimer's Disease"
Changing A Column
We can also use mutate to update an existing column as well let's update the disease status column to be a factor setting this as a factor will help with DESeq2 later on:
# factor the diagnosis column to set the `Control` value as the reference
meta_filt <- meta_filt %>%
mutate(diagnosis=factor(diagnosis,
levels = c(
"Control",
"Alzheimer's Disease"
)))
# let's inspect our updated column!
meta_filt$diagnosis[1:10]
output
[1] Control Alzheimer's Disease Control Alzheimer's Disease
[5] Alzheimer's Disease Control Control Alzheimer's Disease
[9] Control Alzheimer's Disease
Levels: Control Alzheimer's Disease
Merging Data Frames
Sometimes we need to take two data frames and merge them into one data frame. We can do this with the join
functions. Here we will do an inner join (meaning we merge two data frames an only keep rows that match by some column):
Merging Data Frames
# flip the counts data so patients are the columns
# make the rownames a patient id column
counts_flip <- counts_filt %>%
t() %>%
data.frame() %>%
rownames_to_column("patient_id")
# make the rownames a patient id column
meta_patient_to_col <- meta_filt %>%
rownames_to_column("patient_id")
# inner join the meta data and the counts data
merged <- meta_patient_to_col %>%
inner_join(
.,
counts_flip,
by="patient_id"
)%>%
`rownames<-`(.[["patient_id"]])
nrow(merged)
ncol(merged)
output
[1] 42
[1] 3006
Creating a Table For Publication
When getting reading for a publication we may want to include a summary of our data. The janitor
package contains many helpful functions for that purpose! Let's try to create a table of sex by disease status:
# often times you may want to tabulate your data to prepare a
# table 1 for your paper. An easy way of doing this is to
# use the tabyl/adorn functions from the janitor package
table_1 <- merged %>% # data frame to work with
tabyl(Sex, diagnosis) %>% # tabulate sex by diagnosis
adorn_totals("row") %>% # adorn totals by row
adorn_percentages("all") %>% # adorn percentages to all cells
adorn_pct_formatting(digits = 1) %>% # round the percentage to the first digit
adorn_ns # keep both count and percentage
# let's take a look at our table 1!
table_1
output
Sex Control Alzheimer's Disease
Female 14.3% (6) 16.7% (7)
Male 28.6% (12) 19.0% (8)
<NA> 7.1% (3) 14.3% (6)
Total 50.0% (21) 50.0% (21)