AlphaFold2 Pre-Processing
AlphaFold2 Pre-Processing
Let's talk a bit about how AlphaFold2 go from a protein FASTA file to a full structure prediction.
Searching for Similar Sequences
- This query sequence is compared to:
- UniRef90 database to find similar sequences
- PDB70 to find similar structures
- Sequences that are too similar to our query are filtered out so that we don’t just build a replicate based on that sequence
- These sequences are arranged as an MSA
Multiple Sequence Alignment (MSA)
- An MSA is an array of sequences
- These sequences are aligned with one another as to best match similar regions
- These sequences don’t always line up perfectly and as such we see:
- Conserved positions: where the letter does not change
- Coevolved positions: where the letter will change with another letter
- Specificity Determining positions: where the letter is consistently different
Why is an MSA Useful In Structure Prediction?
- The theory is that residues that coevolve are generally close to each other in the protein’s folded state
- So, by assessing what residues change together we get an idea of where they might be spatially!
MSA Embedding
- An MSA is still essentially an array of letters
- To be more computer friendly these letters are embedded as numbers using their positional information
- AlphaFold embeds these letter values as numeric ones and terms this the MSA representation
Embedding Example
- Take for example the sentence “I ate an apple and played the piano”
- This string is embedded by positional information.
- e.g. ate was the second word so there is a 1 in the second column at row “ate”
Pair Representation
- Similar Structures were also queried for using our protein sequence.
- These structure files (A.K.A Crystallographic Information Files (CIF)) contain 3D coordinates for a protein’s atoms in space
- These coordinates are used to initialize a pairwise distance matrix between residues that AlphaFold calls the pair representation