Background
Background
Sequencing data analysis typically focuses on either assessing DNA or RNA. As a reminder here is the interplay between DNA, RNA, and protein:
DNA Sequencing
- Fixed copy of a gene per cell
- Analysis goal: Variant calling and interpretation
RNA Sequencing
- Copy of a transcript per cell depends on gene expression
- Analysis goal: Differential expression and interpretation
Note
Here we are working with DNA sequencing
Next Generation Sequencing
Here we will analyze a DNA sequence using next generation sequencing data. Here are the steps to get that data:
- Library Preparation: DNA is fragmented and adapters are added to these fragments
- Cluster Amplification: This library is loaded onto a flow cell, where the adapters help hybridize the fragments to the flow cell. Each fragment is then amplified to form a clonal cluster
- Sequencing: Fluorescently labelled nucleotides are added to this flow cell and each time a base in the fragment bonds a light signal is emmitted telling the sequencer which base is which in the sequence.
- Alignment & Data Analysis: These sequenced fragments, or reads, can then be aligned to a reference sequence to determine differences.
Singe End v. Paired End Data
- single-end sequence each DNA fragement from one end only
- paired-end sequence each DNA fragement from both sides. Paired-end data is useful when sequencing highly repetitive sequences.
Variant Calling
Ploidy
-
When discussing variant calling it is worth mentioning an organism's ploidy. Ploidy is the number of copies of each chromosomes.
- Humans cells are diploid for autosomal chromosome and haploid for sex chromosomes
- Bacteria are haploid
- Viruses and Yeast can by haploid or diploid
-
Variant callers can use ploidy to improve specificity (avoid false positives) because there are expected variant frequencies, e.g. for a diploid:
- Homozygous
- both copies contain variant
-
fraction of the reads ~1
-
Heterozygous
- one copy of variant
- fraction of reads with variant ~0.5