Introduction to Biostatistics
Biostatistics attempts to use statiscal methods to solve biological problems. This involves data so for the purpose of our biostatistics tutorials we will need to do some setup.
Setup
For the following machine learning tutorials we will be using glioblastoma data from cBioPortal. Before getting started you will need:
Navigate To The Cluster
Once you have an account and are connected to the VPN/Tufts Network, navigate to the OnDemand Website and log in with your tufts credentials. Once you are logged in you'll notice a few navigation options:
Setting Up A Project Space
We are going to open an interactive app:
Click on Interactive Apps > RStudio Pax
and you will see a form to fill out to request compute resources to use RStudio on the Tufts HPC cluster. We will fill out the form with the following entries:
Number of hours
:3
Number of cores
:1
Amount of memory
:32GB
R version
:4.0.0
Reservation for class, training, workshop
:Default
Load Supporting Modules
:curl/7.47.1 gcc/7.3.0 hdf5/1.10.4 boost/1.63.0-python3 libpng/1.6.37 java/1.8.0_60 libxml2/2.9.10 libiconv/1.16 fftw/3.3.2 gsl/2.6
We will now need to create our project that we will work out of:
Click Lauch
and wait until your session is ready. Click Connect To RStudio Server
, and you will notice a new window will pop up with RStudio. Now Create a new project:
- Go to
File
>New Project
New Directory
New Project
- Create a name for your project (e.g.
machine-learning
) Create Project
In terminal, start setting up your directories:
mkdir data
mkdir scripts
mkdir results
Now that we have our project set up we will need to download our data. In the data
folder we will download our data and decompress it:
cd data
wget https://cbioportal-datahub.s3.amazonaws.com/gbm_cptac_2021.tar.gz
tar -xvf gbm_cptac_2021.tar.gz
cd ..