Skip to content

Tufts TTS Research Technology Tutorials - Beta

AlphaFold2 Pre-Processing

Tufts TTS Research Technology Tutorials - Beta

About
About
- Introduction
- News
  News
  - 2023
  - 2022
HPC User Guide
HPC User Guide
- Introduction To The Cluster
  Introduction To The Cluster
- Cluster Resources
  Cluster Resources
- Introduction To SLURM
  Introduction To SLURM
- HPC Services
  HPC Services
  - HPC Services
HPC Software/Tools
HPC Software/Tools
- Available HPC Tools
- Python
  Python
- R
  R
- Misc
  Misc
Unix/R/Python Tutorials
Unix/R/Python Tutorials
- Introduction
- Unix
  Unix
  - Intro To Unix
    Intro To Unix
    
    Intro To Linux
    
    Starting with Shell
    
    Bash Parameters
    
    Shell Navigation
    
    Creating & Manipulating Files
    
    Going Home
    
    Running an Interactive Session
    
    Example with BLAST
    
    BLAST Batch Script
- R
  R
  - Intro To R
    Intro To R
    
    Introduction To R OnDemand
    
    R Basics
    
    Data Structures
    
    Functions & Flow
    
    Inspecting/Manipulating Data
    
    Visualization
- Python
  Python
  - Intro To Python
    Intro To Python
    
    Introduction To Python OnDemand
    
    Variables & Data Types
    
    Libraries & Data Frames
    
    Plotting with Plotly
    
    Lists
    
    Loops & Conditionals
    
    Functions & Scope
Omics Tutorials
Omics Tutorials
- Introduction
- Genomics
  Genomics
  - NGS Tips & Tricks
    NGS Tips & Tricks
    
    Fastq Manipulation
  - Intro To NGS
    Intro To NGS
    
    Background
    
    Setup
    
    Quality Control
    
    Alignment
    
    Alignment Cleanup
    
    Variant Calling
    
    Variant Annotation
  - Intro To 16S Metabarcoding
    Intro To 16S Metabarcoding
    
    Background
    
    Setup
    
    Quality Control
    
    Error Model & ASVs
    
    Merging, Chimeras & Taxonomy
    
    Diversity Analysis
    
    Differential Abundance
- Transcriptomics
  Transcriptomics
  - Intro To RNA-Seq
    Intro To RNA-Seq
    
    Background
    
    Setup
    
    Quality Control
    
    Read Alignment
    
    Gene Quantification
    
    Differential Expression
    
    Pathway Enrichment
- Proteomics
  Proteomics
  - Intro To Proteomics
    Intro To Proteomics
    
    Background
    
    Setup
  - Intro To AlphaFold2
    Intro To AlphaFold2
    
    Background
    
    Setup
    
    AlphaFold2 Pre-Processing AlphaFold2 Pre-Processing
    Table of contents
    
    AlphaFold2 Pre-Processing
    
    Searching for Similar Sequences
    
    Multiple Sequence Alignment (MSA)
    
    Why is an MSA Useful In Structure Prediction?
    
    MSA Embedding
    
    Embedding Example
    
    Pair Representation
    
    AlphaFold2 Evoformer/Structure Module
    
    AlphaFold2 Output
    
    PyMOL Visualization
    
    Optional: AlphaFold2 Batch Script
Biostatistics
Biostatistics
- Introduction To Biostatistics
- Setup
- Variables and Sampling
  Variables and Sampling
- Analyzing One Categorial Variable
  Analyzing One Categorial Variable
  - Binomial Test
- Analyzing Two Categorical Variables
  Analyzing Two Categorical Variables
- Analyzing One Numeric Variable
  Analyzing One Numeric Variable
  - One Sample T-Test
- Analyzing Numeric Variable With Two Groups
  Analyzing Numeric Variable With Two Groups
  - Paired T-Test
  - Two Sample T-Test
- Analyzing Two Numeric Variables
  Analyzing Two Numeric Variables
  - Correlation
- Analyzing Two Or More Groups
  Analyzing Two Or More Groups
  - One-Way ANOVA
Machine Learning
Machine Learning
- Introduction To Machine Learning
  Introduction To Machine Learning
  - Introduction To Machine Learning
  - Tutorial Setup
- Unsupervised Learning
  Unsupervised Learning
- Supervised Learning
  Supervised Learning

AlphaFold2 Pre-Processing

AlphaFold2 Pre-Processing

Let's talk a bit about how AlphaFold2 go from a protein FASTA file to a full structure prediction.

Searching for Similar Sequences

This query sequence is compared to:
- UniRef90 database to find similar sequences
- PDB70 to find similar structures
Sequences that are too similar to our query are filtered out so that we don’t just build a replicate based on that sequence
These sequences are arranged as an MSA

Multiple Sequence Alignment (MSA)

An MSA is an array of sequences
These sequences are aligned with one another as to best match similar regions
These sequences don’t always line up perfectly and as such we see:
- Conserved positions: where the letter does not change
- Coevolved positions: where the letter will change with another letter
- Specificity Determining positions: where the letter is consistently different

Why is an MSA Useful In Structure Prediction?

The theory is that residues that coevolve are generally close to each other in the protein’s folded state
So, by assessing what residues change together we get an idea of where they might be spatially!

MSA Embedding

An MSA is still essentially an array of letters
To be more computer friendly these letters are embedded as numbers using their positional information
AlphaFold embeds these letter values as numeric ones and terms this the MSA representation

Embedding Example

Take for example the sentence “I ate an apple and played the piano”
This string is embedded by positional information.
e.g. ate was the second word so there is a 1 in the second column at row “ate”

Pair Representation

Similar Structures were also queried for using our protein sequence.
These structure files (A.K.A Crystallographic Information Files (CIF)) contain 3D coordinates for a protein’s atoms in space
These coordinates are used to initialize a pairwise distance matrix between residues that AlphaFold calls the pair representation