Basic skills session - Princeton University · 2.Can then start script running and then close the...

31
Basic skills session Ramina and Rebecca April 10th, 2018 1 / 31

Transcript of Basic skills session - Princeton University · 2.Can then start script running and then close the...

Page 1: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Basic skills session

Ramina and Rebecca

April 10th, 2018

1 / 31

Page 2: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Acknowledgments

I NIA HRS ”Genetics in Social Science” lab sessions run byErin Ware

I RSF WorkshopI Lisa Schneper in Notterman Lab

2 / 31

Page 3: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Outline

I Navigating around the server and basic linuxI Using adroit and slurmI plink

I Data typesI Syntax structureI A few examples:

1. Convert .ped/.map to .bed/.bim/.fam2. GWAS3. PGS

I LDpredI gcta for GREML

3 / 31

Page 4: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Two ways of interacting with server

1. Transferring files: common uses (often involves creation ofcorrectly-formatted .txt files in R or other program):

I You construct a phenotype file from survey data of yourchoice and want to include it as the DV for GREML, GWAS,or some analysis

I You have files containing covariates to control for in ananalysis

I You have a list of participant IDs that you want to include inthe analysis (e.g., subset genetic df to European-Americanparticipants only)

2. Running analyses: run on server either due to dataconfidentiality, file size, or runtime

4 / 31

Page 5: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Transferring filesI Logging onto server:

sftp [email protected] Two sets of commands for navigating around:

1. Navigating around your local computer: attach ”l” prefix tolinux commands- e.g.,:

I lpwd: print present working directoryI lcd: navigate to a specific directoryI lls: list files in present directory

2. Navigating around the server: same commands as abovebut without ”l” prefix

I How to put files on/get files from the server:1. Navigate to your local directory: e.g.:

lcd Dropbox/HRS/data/2. Navigate to the server directory you want to put file in, e.g.:

cd raj2/myfiles/3. Use ”put” command:

put phenotype_file.txt4. To download files (e.g., score output to then merge with

survey data):get risk_scores.profile

5 / 31

Page 6: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Running analysesI sftp mode covered on previous slide is really limited in

terms of what you can do– for instance, you can’t evencreate a new folder to store files while in that mode! Sowhen you’re doing basically anything other thantransferring files, you log onto server in other mode:ssh [email protected]

I Useful linux commands (in addition to ones from previousslide of pwd, cd, ls):

I ”I need to make a new folder”:mkdir

I Copy file:cp

I Delete file:rm

I Move without copying:mv path/file destination/file

I ”There’s a text file like a README that I want to preview”:less file.txt

I There are two genetic files that correspond to different sample sizes and Iwant to get a quick read on count of rows in each (only works for certainfile types):wc -l filename.fam

6 / 31

Page 7: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

More linux: loopsI ”I want to construct a for loop –for instance, to iterate over

chromosome files that share a similar structure” – example of aloop that iterates over .gen files corresponding to differentchromosomes and subsets each (will go over flag structurewhen reviewing plink):for num in 16 10 3 12 20 15 8 14 6 2 18 4 17 5 7 1;do gtool -S \--g /home/geneticData/PsychChipRelease1_070516/imputed_all/PsychArraysQCfiltered3HWE_impute2_chr"$num"_chunkALL.br.gen.gz \--s /home/geneticData/PsychChipRelease1_070516/imputed_all/PsychArraysQCfiltered3HWE_impute2_chr1_chunkALL.sample2 \--og velders_snpsimput_chr"$num".gen \--os velders_snpsimput_chr"$num".sample \--inclusion velders_snplistnew.txt; done

I Things to highlight compared to R scripting:I Same structure of using any indicator for iterator (e.g., num

could have been number, i, etc)I Linebreaks denoted by ”;”I do and done within loopI Part that’s iterating is ”$iterator ” = ”$num” within filename

7 / 31

Page 8: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

More linux: subsetting files efficiently

I Splitting data by row into discrete chunks (can be useful forrunning processes on large files)– e.g., below code splitsfile into 10000 lines each

split -l 10000 name_of_file

I Selecting columns by position or name and writing them toa new file: awk family of commands

8 / 31

Page 9: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Most useful linux (for me): screen command

I Lifesaver when you want to run analyses that take awhilewithout sending / queuing these on adroit etc.

I What a screen is: sets up a separate terminal that you canattach to when you want to start running a process/analysisand detach from to allow analyses to run in the background

I General guidelines for interacting with remote server apply:

I Run script locally on subset of data or fake data to catcherrors before estimating on full sample

I Make heavy use of print commands to track where analysishas progressed to

9 / 31

Page 10: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Screen steps1. Start a new screen– will open up a window that’s the

”screen” and i’m calling it exportalleles to remind myself ofwhich analysis it’s running:screen -S exportalleles

2. Can then start script running and then close the screen butallow analysis to still run (more formally, detach from thescreen):ctrl + a + d

3. Can then close your computer, eat, move on with your life,etc., and when you log back into server, list screens:screen -ls

4. Can then resume a particular screen (can resume eitherusing number or name you’ve assigned) (more formally,attach to the screen):screen -r exportalleles

10 / 31

Page 11: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Similar to adroit, can also construct a .sh file to run anentire script

I How to run (if in directory where script is stored):bash myscript.sh

I Example structure (make sure to add line at top): cancreate in any text editor1

1I like SublimeText interface, which allows you to switch between languages andhas autocomplete 11 / 31

Page 12: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Adroit, Scratch and Slurm

1. ssh username@adroit4Will ask for your Princeton password (type and Enter)

2. You end up in your personal folder. You can keep somedata and programs here. But you also have access to the“scratch” folder, which is larger

I To access scratch folder, use: “cdscratch/network/username”

3. From here you can install programs and run analyses.I For example:

wget http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-x86_64.zipunzip plink-1.07-x86_64.zip

12 / 31

Page 13: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Example: Adroit scriptBasically a .bash script with some #SBATCH settings that

control how many CPUs and how much RAM your programuses. Change the values after the equals sign to adjust how

sbatch behaves. More options can be found at:http://www.arc.ox.ac.uk/content/slurm-job-scheduler

13 / 31

Page 14: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Submitting jobs

I Once written, save the bash script with the file extension.slurm

plink_analysis.slurm

I Then you can use the sbatch command to submit it to thequeuesbatch plink_analysis.slurm

I It will run as soon as it is your turn!

14 / 31

Page 15: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Common programs for interacting with genetic data

1. plink: used for data management, GWAS, risk scoreconstruction

2. gcta: used for GREML3. gtools: used for imputation data4. Many more (e.g., fastStructure for ancestry clustering,

etc.)

15 / 31

Page 16: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

plink

16 / 31

Page 17: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

plink: common file formatsI plink is a program- installed on lotka; call via plink and then

write commands you want to run... two most common fileformats (can convert between):

1. .ped/.map files– .ped has rows as respondents andvariants as columns (after first 6 rows) (depicted below);.map has rows as snps and gives you order of columns for.ped and chromosome positions

FID IID MID FID Sex Phen. snp1 snp1 snp2 . . .. A allele B allele A allele . . .

1 1 0 0 -9 -9 G G A2 2 0 0 -9 -9 C G A

I .bed/.bim/.fam: will usually convert .ped/.map to this set offiles using the following command:plink --noweb \--file /home/opr/dbgap/HRS/HRS_phase123 \--make-bed \--out /home/opr/dbgap/HRS/HRS_phase123_newformat \

17 / 31

Page 18: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

plink: zooming out on syntax structure1. Call program and tell it to not interface with web

plink --noweb2. Tell it which file to operate the subsequent command on

--file /pathname/fileprefixI Notes:

I use ”–file” for .ped/.map files but ”–bfile” for .bed/.bim/.famfiles

I only feed it the file prefix (e.g., if we feed it ”HRS phase123”,it knows to look at both the .ped and the .map file for those)

I Can either use absolute pathname:/home/opr/raj2/HRS_analyses/geneticdata/

I Or relative pathname (e.g., if I’m already in geneticdata andfile is stored one level up in directory):../

3. Tell it which command to execute on that file:--make-bed

4. Tell it what to name the resulting file(s) (again will be the prefixand it will append different suffixes to the prefix– in this case,.bed, .bim, and .fam)--out /home/opr/dbgap/HRS/HRS_phase123_newformat

18 / 31

Page 19: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

plink: .bed/.bim/.fam1. .fam: rows are respondents, columns are same as first 6 of .ped (individual ID,

fam ID, mother id, father id, sex, phenotype)– small enough to read into R usingread.table

2. .bim: rows are variants, the important columns for the purposes of scoreconstruction are those indicating the A allele (second to last column) and B allele(last column)

3. .bed: contains genetic info; not viewable via head! (compressed encoding)

19 / 31

Page 20: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Sidenote on .gen imputation dataI plink you’ll use when data are directly genotyped (each participant has

one allele for a SNP– e.g., AA, AB, BB)I IMPUTE2 and other imputation methods produce imputation data often

stored in .gen formatI The defining characteristic is that each participant, rather than

deterministically having one allele for each SNP, has a length-3 vectorof imputation probabilities for the three alleles

I Basic structure of .gen (red = participant 1; green = participant 2, etc.)

I Can work with more automatically using gtools (e.g., convert to.bed/.bim/.fam for use in plink where you assign each individual totheir ”best call” allele and code Pr(allele) < τ as missing) or moremanually in R by reading in with read.table

20 / 31

Page 21: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

plink: subsetting to include only certain respondents orcertain snps

1. Include only certain respondents– ”I want to subset toEuropean Americans only and have a list of their IDs”:plink --noweb \--bfile /home/opr/dbgap/HRS/HRS_phase123 \--keep EuropeanIDs.txt \--make-bed \--out /home/opr/raj2/HRS_phase123_EA

2. Include only certain snps (don’t need to do beforeconstructing PGS but can)plink --noweb \--bfile /home/opr/dbgap/HRS/HRS_phase123 \--keep snps_to_use.txt \--make-bed \--out /home/opr/raj2/HRS_phase123_snpsubset

21 / 31

Page 22: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

plink: run a GWAS

plink --noweb \--bfile /home/opr/dbgap/HRS/HRS_phase123 \--pheno myphenotype.txt \--linear \--covar /home/opr/raj2/mycovars.txt \--cover-name Sex, FamilyID--out /home/opr/raj2/GWAS_withcovar

22 / 31

Page 23: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

plink: steps in PGS score construction (basic; seeLauren and Erin’s paper for many choices that go intothis!)

I Properly format two input files for score constructing:1. Reference alleles consortium used : e.g., if a given SNP

has TC, the consortium may use ”T” as the reference / riskallele. .txt file

2. Betas: again, .txt file format

23 / 31

Page 24: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

plink: steps in PGS score construction

I Align reference alleles to those used by consortium

plink --noweb \--bfile /home/geneticData/PsychChipRelease1_070516/originalData/preimputedData_ffids_badsampleremoved \--reference-allele internalizing_ref.txt \--make-bed \--out internalizing_refallele_df

I (Maybe) flip the strand of alleles for ones not found in initial setting of referencealleles

plink --noweb \--bfile /home/geneticData/PsychChipRelease1_070516/originalData/preimputedData_ffids_badsampleremoved \--flip internalizing_allele2flip.txt \--make-bed \--out internalizing_flipallele_df

I Re-set reference alleles after flipping strand (repeat above code on flipped alleledf)

I Construct score

plink --noweb \--bfile internalizing_refallele_df_flipped \--score internalizing_beta.txt \--out EAGLE_internalizing_score

24 / 31

Page 25: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

How do we create PCs to then residualize scores by?

I Some surveys releases PCs as part of their genetic dataI To construct yourself, can use: ”PCA” command within

plink but my sense is most datasets use more complicatedprocedures (e.g., Fragile Families uses a method thatinvolves projection onto 1000 Genomes and .vsf files)

25 / 31

Page 26: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

What happens if you want to run a model notsupported by plink (or gcta or some other program)?

I Can export raw alleles data and run in R/other software– longruntime due to huge file size (e.g, > 500,000 columns) butpossible:

I Example workflow:I Use plink to export raw allele data (recode A just tells it to

use simpler additive coding of count of min alleles)

plink --noweb \--file /home/opr/dbgap/HRS/HRS_phase123 \--recodeA--out /home/opr/raj2/HRS_rawdf

I Can chunk data in command line using subsettingcommands used previously

I Write scripts to read in chunks (should use data.table,fastlm, and other package optimized for large df and keepdata in matrix form whenever possible; in my experience,dplyr takes forever)

26 / 31

Page 27: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

LD score regression

27 / 31

Page 28: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

LD score is fast and easy

I Trivial run-time and memory (∼ 15s, ∼ 1GB for h2).I Automated data re-formatting and QC.

I munge sumstats.py included w/ ldsc.I No need for one-off perl scripts.

I Download pre-computed LD Scores.I broadinsitute.org/˜bulik/eur_ldscores/I (European-only for now)

28 / 31

Page 29: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Example: estimating rg(BIP,SCZ )

29 / 31

Page 30: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

gcta

30 / 31

Page 31: Basic skills session - Princeton University · 2.Can then start script running and then close the screen but allow analysis to still run (more formally, detach from the screen): ctrl

Steps in running GREML1. Use plink or gcta to construct a genetic relatedness matrix

(GRM). Code for plink:2

plink --noweb \--bfile /home/opr/raj2/HRS_phase123_newformat \--make-grm-bin \--maf 0.01 \--out /home/opr/raj2/HRS_GRM_EA \

2. Create a tab-separated .txt file with the IID, FID, and yourphenotype of interest (no header), as well as a text file with anycovars to control for

3. Use gcta64 to run GREMLgcta64 --reml \--grm HRS_GRM_EA--pheno myphenotype.txt--qcovar PCS.txt--out GREML_results_myphenotype

2Depending on the version of gcta you have installed, either createthe .bin version (newer versions) or .gz version (–make-grm-gz; olderversions). Can also change the cutoff for excluding unrelatedindividuals

31 / 31