R GWAS Packages

18
R Packages R Packages for Genome-Wide Association for Genome-Wide Association Studies Studies Qunyuan Zhang Qunyuan Zhang Division of Statistical Genomics Division of Statistical Genomics Statistical Genetics Forum Statistical Genetics Forum March 10,2008 March 10,2008

description

Statistics in Genomewide association studies by Qunyuan Zhang

Transcript of R GWAS Packages

Page 1: R GWAS Packages

R Packages R Packages for Genome-Wide Association for Genome-Wide Association StudiesStudies

Qunyuan ZhangQunyuan Zhang

Division of Statistical GenomicsDivision of Statistical Genomics

Statistical Genetics ForumStatistical Genetics Forum

March 10,2008March 10,2008

Page 2: R GWAS Packages

What is R ?What is R ?R is a free software environment for statistical computing and graphics. R is a free software environment for statistical computing and graphics.

Run s on a wide variety of UNIX platforms, Windows and MacOS (interactive or Run s on a wide variety of UNIX platforms, Windows and MacOS (interactive or batch mode) batch mode)

Free and open source, can be downloaded from cran.r-project.org

Wide range of packages (base & contributed), novel methods available

Concise grammar & good structure (function, data object, methods and class)

Help from manuals and email group

Slow, time and memory consuming (can be overcome by parallel computation, and/or integration with C)

Popular, used by 70~80% statisticians

Page 3: R GWAS Packages

R Task ViewsR Task Viewshttp://cran.r-project.org/web/views/http://cran.r-project.org/web/views/

Page 4: R GWAS Packages

Statistical Genetics Packages in Statistical Genetics Packages in RRhttp://cran.r-project.org/web/views/Genetics.htmlhttp://cran.r-project.org/web/views/Genetics.html

Population GeneticsPopulation Genetics : : geneticsgenetics (basic), (basic), GenelandGeneland (spatial structures of genetic data), (spatial structures of genetic data), rmetasimrmetasim (population genetics simulations), (population genetics simulations), hapsimhapsim (simulation), (simulation), popgenpopgen (clustering SNP (clustering SNP genotype data and SNP simulation), genotype data and SNP simulation), hierfstathierfstat (hierarchical F-statistics of genetic data), (hierarchical F-statistics of genetic data), hwdehwde (modeling genotypic disequilibria), (modeling genotypic disequilibria), BiodemBiodem (biodemographical analysis), (biodemographical analysis), kinshipkinship (pedigree (pedigree analysis), analysis), adegenetadegenet (population structure), (population structure), apeape & & apTreeshapeapTreeshape (Phylogenetic and evolution (Phylogenetic and evolution analyses), analyses), ouchouch (Ornstein-Uhlenbeck models), (Ornstein-Uhlenbeck models), PHYLOGRPHYLOGR (simulation and GLS model), (simulation and GLS model), stepwisestepwise (recombination breakpoints) (recombination breakpoints)Linkage and AssociationLinkage and Association : : gapgap (both population and family data, sample size calculations, (both population and family data, sample size calculations, probability of familial disease aggregation, kinship calculation, linkage and association probability of familial disease aggregation, kinship calculation, linkage and association analyses, haplotype frequencies) analyses, haplotype frequencies) tdthaptdthap (TDT for haplotypes, (TDT for haplotypes, powerpkgpowerpkg (power analyses (power analyses for the affected sib pair and the TDT design),for the affected sib pair and the TDT design),hapassochapassoc (likelihood inference of trait (likelihood inference of trait associations with haplotypes in GLMs), associations with haplotypes in GLMs), haplo.ccshaplo.ccs (haplotype and covariate relative risks in (haplotype and covariate relative risks in case-control data by weighted logistic regression), case-control data by weighted logistic regression), haplo.statshaplo.stats (haplotype analysis for (haplotype analysis for unrelated subjects), unrelated subjects), tdthaptdthap (haplotype transmission/disequilibrium tests), (haplotype transmission/disequilibrium tests), ldDesignldDesign (experiment design for association and LD studies), (experiment design for association and LD studies), LDheatmapLDheatmap (heatmap of pairwise LD),. (heatmap of pairwise LD),. mapLDmapLD (LD and haplotype blocks), pbatR (R version of PBAT), (LD and haplotype blocks), pbatR (R version of PBAT), GenABEL & SNPassoc GenABEL & SNPassoc for GWASfor GWASQTL mappingQTL mapping for the data from experimental crosses: for the data from experimental crosses: bqtlbqtl (inbred crosses and recombinant (inbred crosses and recombinant inbred lines), inbred lines), qtlqtl (genome-wide scans), (genome-wide scans), qtlDesignqtlDesign (designing QTL experiments & power (designing QTL experiments & power computations), computations), qtlbimqtlbim (Bayesian Interval QTL Mapping) (Bayesian Interval QTL Mapping) Sequence & Array Data ProcessingSequence & Array Data Processing : : seqinrseqinr, , BioConductorBioConductor packages packages

Page 5: R GWAS Packages

GenABELGenABELAulchenko Y.S., Ripke S., Isaacs A., van Duijn C.M. GenABEL: an R package for genome-wide association analysis. Bioinformatics. 2007, 23(10):1294-6.

GenABEL: genome-wide SNP association analysisGenABEL: genome-wide SNP association analysisa package for genome-wide association analysis between quantitative or binary a package for genome-wide association analysis between quantitative or binary traits and single-nucleotides polymorphisms (SNPs). traits and single-nucleotides polymorphisms (SNPs).

Version: 1.3-5 Version: 1.3-5 Depends: R (≥ 2.4.0), Depends: R (≥ 2.4.0), methods, genetics, haplo.stats, qvalue, MASSmethods, genetics, haplo.stats, qvalue, MASS Date: 2008-02-17 Date: 2008-02-17 Author: Yurii Aulchenko, with contributions from Maksim Struchalin, Stephan Author: Yurii Aulchenko, with contributions from Maksim Struchalin, Stephan Ripke and Toby Johnson Ripke and Toby Johnson Maintainer: Yurii Aulchenko <i.aoultchenko at erasmusmc.nl> Maintainer: Yurii Aulchenko <i.aoultchenko at erasmusmc.nl> License: GPL (≥ 2) License: GPL (≥ 2) In views: Genetics In views: Genetics CRAN checks: GenABEL results CRAN checks: GenABEL results

Page 6: R GWAS Packages

GenABEL: Data GenABEL: Data ObjectsObjects

gwaa.data-class

phdata: phenotypic data (data frame)

gtdata:genotypic data (snp.data-class)

load.gwaa.data(phenofile = "pheno.dat", genofile = "geno.raw“)

nbytes: number of bytes used to store data on a SNPnids: number of peoplemale: male codeidnames: ID namesnsnps: number of SNPsnsnpnames: list of SNP nameschromosome: list chromosomes corresponding to SNPscoding: list of nucleotide coding for SNP namesstrand: strands of the SNPsmap: list SNPs’ positionsgtps: genotypes (snp.mx-class)

snp.data()

convert.snp.text() from text file (GenABEL default format)convert.snp.ped() from Linkage, Merlin, Mach, and similar filesconvert.snp.mach() from Mach formatconvert.snp.tped() from PLINK TPED formatconvert.snp.illumina() from Illumina/Affymetrix-like format

2-bit storage

0 001 012 103 11Save 75%

Page 7: R GWAS Packages

GenABEL: Data GenABEL: Data ManipulationManipulation

snp.subset(): subset data by snp names or by QC criteria

add.phdata(): merge extra phenotypic data to the gwaa.data-class.

ztransform(): standard normalization of phenotypes

rntransform(): rank-normalization of phenotypes

npsubtreated(): non-parametric adjustment of phenotypes for medicated subjects

Page 8: R GWAS Packages

GenABEL: QC & GenABEL: QC & SummarizationSummarization

summary.snp.data(): summary of snp data (Number of observed genotypes, call rate, allelic frequency, genotypic distribution, P-value of HWE test check.trait(): summary of phenotypic data and outlier check based on a specified p/FDR cut-offcheck.marker(): SNP selection based on call rate, allele frequency and deviation from HWEHWE.show(): showing HWE tables, Chi2 and exact HWE P-valuesperid.summary(): call rate and heterozygosity per person

ibs(): matrix of average IBS for a group of people & a given set of SNPs hom(): average homozygosity (inbreeding) for a set of people, across multiple markers

Page 9: R GWAS Packages

GenABEL: SNP Association GenABEL: SNP Association ScansScans

scan.glm(): snp association test using GLM in R library scan.glm((“y~x1+x2+…+CRSNP", family = gaussian(), data, snpsubset, idsubset) scan.glm((“y~x1+x2+…+CRSNP", family = binomial (), data, snpsubset, idsubset)scan.glm.2D(): 2-snp interaction scan

Fast Scan (call C language)

ccfast(): case-control association analysis by computing chi-square test from 2x2 (allelic) or 2x3 (genotypic) tablesemp.ccfast(): Genome-wide significance (permutation) for ccfast() scan

qtscore(): association test (GLM) for a trait (quantitative or categorical) emp.qtscore(): Genome-wide significance (permutation) for qscaore() scan

mmscore(): score test for association between a trait and genetic polymorphism, in samples of related individuals (needs stratification variable, scores are computed within strata and then added up)

egscore(): association test, adjusted for possible stratification by principal components of genomic kinship matrix(snp correlation matrix)

Page 10: R GWAS Packages

GenABEL: Haplotype Association GenABEL: Haplotype Association ScansScans

scan.haplo(): haplotype association test using GLM in R library

scan.haplo.2D(): 2-haplotype interaction scan

(haplo.stats package required)

Sliding window strategy

Posterior prob. of Haplotypes via EM algorithm

GLM-based score test for haplotype-trait association (Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. 2002. Score tests for association of traits with haplotypes when linkage phase is ambiguous Am J Hum Genet 70: 425-434. )

Page 11: R GWAS Packages

GenABEL: GenABEL: GWAS results from scan.glm, scan.haplo, ccfast, qtscore, emp.ccfast,emp.qtscore

scan.gwaa-class

Names: snpnames list of names of SNPs testedP1df: p-values of 1-d.f. (additive or allelic) test for association P2df: p-values of 2-d.f. (genotypic) test for associationPc1df: p-values from the 1-d.f. test for association between SNP and trait; the statistics is corrected for possible inflationeffB: effect of the B allele in allelic testeffAB: effect of the AB genotype in genotypic testeffBB: effect of the BB genotype in genotypic testMap: list of map positions of the SNPsChromosome: list of chromosomes the SNPs belong toIdnames: list of subjects used in analysisLambda: inflation factor estimate, as computed using lower portion (say, 90%) of the distribution, and standard error of the estimateFormula: formula/function used to compute p-valuesFamily: family of the link function / nature of the test

Page 12: R GWAS Packages

GenABEL: Table & Graphic GenABEL: Table & Graphic FunctionsFunctions

descriptives.marker(): table of marker info.descriptives.trait(): table of trait info.descriptives.scan(): table of scan results

plot.scan.gwaa(): plot of scan resultsplot.check.marker(): plot of marker data (QC etc.)

Page 13: R GWAS Packages

GenABEL:GenABEL:Computer EfficiencyComputer Efficiency

2000 subjects x 500K chip

Memory: ~3.2 G

Loading time: ~4 Min.

SNP summary: ~1 Min.

Call ccfast: ~0.5 Min.

Call qtscore: ~2 Min.

Total: < 10 Min.

Permutation test

N=10,000

73~ 120 hrs, 3~5 days

Intel Xeon 2.8GHz processor,SuSE Linux 9.2, R 2.4.1

Page 14: R GWAS Packages

SNPassocAn R package to perform whole genome association studies, Juan R. González 1, et al. Bioinformatics, 2007 Bioinformatics, 2007 23(5):654-655 23(5):654-655

SNPassoc: SNPs-based whole genome association studiesSNPassoc: SNPs-based whole genome association studiesThis package carries out most common analysis when performing whole This package carries out most common analysis when performing whole genome association studies. These analyses include descriptive statistics genome association studies. These analyses include descriptive statistics and exploratory analysis of missing values, calculation of Hardy-Weinberg and exploratory analysis of missing values, calculation of Hardy-Weinberg equilibrium, analysis of association based on generalized linear models equilibrium, analysis of association based on generalized linear models (either for quantitative or binary traits), and analysis of multiple SNPs (either for quantitative or binary traits), and analysis of multiple SNPs (haplotype and epistasis analysis). Permutation test and related tests (haplotype and epistasis analysis). Permutation test and related tests (sum statistic and truncated product) are also implemented. (sum statistic and truncated product) are also implemented.

Version:1.4-9Version:1.4-9Depends:R (≥ 2.4.0), haplo.stats, survival, mvtnormDepends:R (≥ 2.4.0), haplo.stats, survival, mvtnormDate:2007-Oct-16Date:2007-Oct-16Author:Juan R GonzAuthor:Juan R Gonzáález, Llulez, Lluíís Armengol, Elisabet Guins Armengol, Elisabet Guinóó, Xavier Sol, Xavier Soléé, and , and VVííctor MorenoMaintainer:Juan R Gonzctor MorenoMaintainer:Juan R Gonzáález <jrgonzalez at imim.es>lez <jrgonzalez at imim.es>License:GPL version 2 or newerURL:http://www.r-project.org and License:GPL version 2 or newerURL:http://www.r-project.org and http://davinci.crg.es/estivill_lab/snpassoc;http://davinci.crg.es/estivill_lab/snpassoc;In views:GeneticsIn views:GeneticsCRAN checks:SNPassoc resultsCRAN checks:SNPassoc results

Page 15: R GWAS Packages

SNPassoc: Data & Summary

setupSNP(data=snp-pheno.table, info=map.table,colSNPs=, sep = "/", ...)

summary()allele frequenciespercentage of missing valuesHWE test

Page 16: R GWAS Packages

SNPassoc: Association Tests

WGassociation(y~x1+x2, data=, model = (codominant, dominant, recessive, overdominant, log-additive or all),quantitative = , level = 0.95)scanWGassociation(): only p valuesassociation(): only for selected snps, can do stratified, GxE interaction analyses

ResultsSummary: a summary table by genes/chromosomesWgstats: detailed output(case-control numbers, percentages, odds ratios/ mean differences, 95% confidence intervals, P-value for the likelihood ratio test of association, and AIC, etc.)Pvalues: a table of p-values for each genetic model for each SNPPlot: p values in the -log scale for plot.Wgassociation()Labels: returns the names of the SNPs analyzed

Page 17: R GWAS Packages

SNPassoc: Multiple-SNP Analysis

SNP–SNP Interaction

interactionPval(): epistasis analysis between all pairs of SNPs (and covariates).

Haplotype Analysis

haplo.glm(): using the R package haplo.stats: association analysis of haplotypes with a response via GLM

haplo.interaction(): interactions between haplotypes (and covariates)

Page 18: R GWAS Packages

SNPassoc: Computer Efficiency

1000 subjects X 3000 SNPs

5 min. import data

40 min. setupSNP()

30 min. scanWGassociation(): only p values (including permutation test)

Memory usage: 750 MB