Gene Quantification
Gene Quantification with featureCounts
Assigning which reads are expressed in which genes essentially gives each gene a "count" per sample - or how many reads actually map to that that gene. If a read maps to a gene by as little as one base pair it is counted towards that gene. However, genes can overlap with each other, meaning a read can map to more than one gene (multi-mapping). These reads are discarded as ambiguous reads given that we can't be sure of which gene they come from:
Gene Quantification
Let's create a script to convert our gff3 file to a gft file (necessary for featureCounts), remove any lines that don't have a gene or transcript id, run featureCounts on our alignment data, and perform qc with MultiQC:
featurecounts.sh
#!/bin/bash
# Step 6: Count features using featureCounts
# convert gff3 to gtf
gffread ./data/sequence.gff3 -T -o ./data/sequence.gtf
# ensure there are no empty gene ids or transcript ids
grep -v 'gene_id ""' ./data/sequence.gtf | grep -v 'transcript_id "gene' > ./data/sequence_fixed.gtf
# run featurecounts
featureCounts \
-a ./data/sequence_fixed.gtf \
-o featurecounts_output/featurecounts_results.txt \
alignment_output/*bam
# run qc on featurecounts run
mkdir ./qc_output/featurecounts_qc
multiqc -o ./qc_output/featurecounts_qc --title "featureCounts QC Report" ./featurecounts_output/*.summary
Explanation of Terms
gffread:
-T -o ./data/sequence.gtf
make a gtf file instead of a gff3 file
grep:
grep -v
match the opposite of this pattern
featureCounts:
-a ./data/sequence_fixed.gtf
path to annotation file-o featurecounts_output/featurecounts_results.txt
path to output filealignment_output/*bam
path to our alignment input data
multiqc:
-o ./qc_output/featurecounts_qc
output path--title "featureCounts QC Report"
report title./featurecounts_output/*.summary
file to use to make report
featureCounts Quality Control
Now let's examine our featureCounts QC:
featureCounts General Statistics: percent and number of sequences assigned
featureCounts General Statistics: percent and number of sequences assigned, unmapped, multi-mapped, unassigned due to no features present, and unassiged due to ambiguity in mapping
Here we note that very few reads actually mapped to genes due to multi-mapping, where one read matches multiple genes/features. Given that this is subsampled data mapped to only one chromosome, we aren't going to take any steps to fix this. However, in your own RNA-seq analysis, most reads should be assigned to features.