SNPs

11
1. What is QuickSNP? QuickSNP is a web server developed by the researchers from the Mood Disorders Program of the Department of Psychiatry and Behavioral Sciences at the Johns Hopkins School of Medicine. 2. What does QuickSNP do? QuickSNP is a publicly accessible resource for the selection of variations (single nucleotide polymorphisms or SNPs) in the human genome than can be used for association studies in the context of genetic disorders 3. What are SNPs? SNP (pronounced “snip”) stands for single nucleotide polymorphism. A single nucleotide nolymorphism is defined as a single DNA base substitution that is observed with a frequency of at least 1% in a given population. Millions of SNPs have been cataloged in the human genome including some that are responsible for disease as well as others that are normal variations. 4. How many SNPs are present in the human genome? In the NCBI dbSNP Build 126, ~12 million human SNPs are listed (http://www.ncbi.nlm.nih.gov/projects/SNP/ ). 5. What are tagSNPs? tagSNPs are the subset of SNPs that represent the other SNPs in a given region on the basis of linkage disequilibrium (LD) structure. They contain most of the information that could be gained by genotyping the other surrounding SNPs. 6. Why is it useful to identify tagSNPs? Empirical studies suggest that much of the human genome can be characterized as blocks of strong linkage disequilibrium. With strong correlation between markers, much of the common haplotype diversity can be represented by a small number of tagSNPs. Therefore, using tagSNPs in association studies allows for a substantial reduction in genotyping costs with only a minimal loss of power.

description

important information regarding SNPs

Transcript of SNPs

Page 1: SNPs

1. What is QuickSNP?QuickSNP is a web server developed by the researchers from the Mood Disorders Program of the Department of Psychiatry and Behavioral Sciences at the Johns Hopkins School of Medicine.

2. What does QuickSNP do?

QuickSNP is a publicly accessible resource for the selection of variations (single nucleotide polymorphisms or SNPs) in the human genome than can be used for  association studies in the context of genetic disorders

3. What are SNPs?

SNP (pronounced “snip”) stands for single nucleotide polymorphism. A single nucleotide nolymorphism is defined as a single DNA base substitution that is observed with a frequency of at least 1% in a given population. Millions of SNPs have been cataloged in the human genome including some that are responsible for disease as well as others that are normal variations.

4. How many SNPs are present in the human genome?

In the NCBI dbSNP Build 126,  ~12 million  human SNPs are listed (http://www.ncbi.nlm.nih.gov/projects/SNP/).

5. What are tagSNPs?

tagSNPs are the subset of SNPs that represent the other SNPs in a given region on the basis of linkage disequilibrium (LD) structure. They contain most of the information that could be gained by genotyping the other surrounding SNPs.

6. Why is it useful to identify tagSNPs?Empirical studies suggest that much of the human genome can be characterized as blocks of strong linkage disequilibrium. With strong correlation between markers, much of the common haplotype diversity can be represented by a small number of tagSNPs. Therefore, using tagSNPs in association studies allows for a substantial reduction in genotyping costs with only a minimal loss of power.

7. What information does QuickSNP require to report tagSNPs?

In order to select tagSNPs from the QuickSNP server, the user needs only to enter either a chromosomal position or gene name(s). The gene name(s) can be either directly pasted into the given text box or provided as a separate file with one gene name written in each line. The chromosomal positions corresponding to UCSC genome build version hg16, hg17 or hg18 can be used (equivalent NCBI builds are 34, 35 and 36 respectively). For chromosomal positions corresponding to even older builds, the user can go to the linkhttp://genome.ucsc.edu/cgi-bin/hgLiftOver to perform the conversions manually, and then enter the positions in QuickSNP.

8. What are the advantages of a gene-based approach to tagSNP selection?

The gene-based approach offers a lower cost alternative to choosing anonymous SNPs across a region that has

Page 2: SNPs

been chosen for fine-mapping. It provides a reasonable way to focus limited resources on the stretches of DNA most likely to contain disease susceptibility variants, the genes and flanking regions. tagSNPs are selected in and near genes, while intergenic regions are skipped over. Cost savings will, of course, be greater when this approach is employed in gene-poor regions, and lesser when it is employed in gene-rich regions.

9. What is meant by minor allele frequency and why is it important to consider for genotyping studies?

The minor allele frequency (MAF) refers to the frequency at which the less common allele of the SNP occurs in a given population.  The lower the minor allele frequency selected, the greater the number of tagSNPs that will be required to represent the variation in a given genomic region. SNPs with a minor allele frequency of 5% or greater were targeted by the HapMap project, so that figure is default in QuickSNP.

10. What is the importance of r-squared for tagSNP selection?

The tagSNP selection is based on the extent of LD between the pairs of SNPs. This level of LD is measured in terms of an r-squared value and hence, varying values for r-squared will yield varying numbers of SNPs in the QuickSNP output. When r-squared=1 for two SNPs, they are said to be in perfect LD (and these two markers are redundant for genotyping), whereas an r-squared=0 indicates no LD. An r-squared of 0.8 is often used for tagSNP selection, as this value is felt to represent a compromise between completeness and efficiency of coverage of the common variation in the genome.

11. Why should one consider flanking sequences around genes for tagSNP selection?

Disease causing variation need not necessarily lie in the coding region of a gene. Potential regulatory regions in the 5’ or 3’ flanking regions of the gene, including the promoter, could harbor relevant variants.

12. What does the “financial calculator” option do?

It calculates the cost of performing a genotyping experiment given the number of samples and the number of SNPs included in a given study. QuickSNP automatically calculates an estimated cost for getting the genotyping done using Illumina BeadArray (http://www.illumina.com/) and Applied Biosystems TaqMan platforms (http://www.appliedbiosystems.com/). In addition, one can also enter the cost of genotyping (per SNP) from a different source, and the server will calculate the cost for the entire genotyping experiment for the given region and the number  of samples.

13. What is meant by the option “force include coding non-synonymous SNPs”?

Coding non-synonymous SNPs are the variations in the gene that bring about a change in the amino acid sequence of the resulting protein. Because of their obvious functional implications, they are excellent candidates for conferring susceptibility to disease. Thus, one may want to include such SNPs in a study, irrespective of whether or not they are selected as tagSNPs. Therefore, we have provided the option in QuickSNP to “force include coding non-

Page 3: SNPs

synonymous SNPs,” which will result in a list of SNPs comprising the tagSNPs as well as the coding non-synonymous SNPs in the user-specified region. We have a whole genome coding non-synonymous SNP database linked to the server, which will automatically generate these SNPs for the gene(s) or region specified. The user need not manually extract coding non-synonymous SNPs to use this option.

14. What is meant by the “inter-SNP distance” option?

It has been observed that SNPs located too close to each other in the genome do not work well on the Illumina BeadArray platform. Thus, we give an option to reject SNPs that will be separated by a user-specified distance.

15. What is the HapMap project?

The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings. Using the information in the HapMap, researchers will be able to find genes that affect health, disease, and individual responses to medications and environmental factors. The Project is a collaboration among scientists and funding agencies from Japan, the United Kingdom, Canada, China, Nigeria, and the United States. All of the information generated by the Project is released into the public domain. The HapMap project has identified common haplotypes in four populations—Utah residents of European ancestry, Yoruba in Nigeria, Han Chinese, and Japanese. It has also identified tagSNPs that uniquely identify these haplotypes.  See http://www.hapmap.org/ for details.

16. Why are there different populations?

Because of the history of the human species, most common haplotypes occur in all human populations. However, any given haplotype may be more common in one population and less common in another, and newer haplotypes may be found in just a single population. For the efficient selection of tagSNPs, it is required to identify haplotypes and compare their frequencies in multiple populations. Also, genetic data from more than one population will enhance the ability of researchers to study the genetic contributions to diseases that are more or less prevalent in different groups. For this reason, genetic data in the HapMap project was gathered from four populations with African, Asian, and European ancestry. Together, these DNA samples should enable HapMap researchers to identify most of the common haplotypes that exist in populations worldwide. The DNA samples have come from a total of 270 people. The Yoruba people of Ibadan, Nigeria (YRI) provided 30 sets of samples from two parents and an adult child (each such set is called a trio). In Japan, 45 unrelated individuals from the Tokyo area provided samples (JPT). In China, 45 unrelated individuals from Beijing provided samples (CHB). Thirty U.S. trios provided samples, which were collected in 1980 from U.S. residents with northern and western European ancestry (CEU).

17. What is the origin of haplotypes?

A very nice explanation can be found at the following web page: http://www.hapmap.org/originhaplotype.html

Page 4: SNPs

18. What are “include tags”?

“Include tags” allows one to supply a list of SNPs that must be included among the selected SNPs. The format is simply one SNP ID in each line in a text file, for example:

rs7176429rs2117655rs11638486rs7163473rs16941002rs16974121

These SNPs will be chosen as tagSNPs rather than other eligible tag SNPs from the same LD bin. If there are no SNPs in the same LD bin, SNPs listed in the “include tags” will be selected as singletons in the resulting tagSNPs list. Please note that QuickSNP can handle at most 250 include/exclude SNPs at present.

19. What are “exclude tags”?

The opposite of “include tags”; these SNPs will never be selected as tagSNPs. If they occur in an LD bin with other potential tagSNPs, the others will be selected as tags. If they occur as singletons they will simply be ignored. They should be provided as one SNP in each line in a text file, as described above for “include tags” (see FAQ no. 18). Please note that QuickSNP can handle at most 250 include/exclude SNPs at present.

20. What is meant by occurence in whole-genome chips & TaqMan assays?

Several commercial genotyping platforms are available that cover nearly the entire human genome. These include Illumina, Applied Biosystems, Affymetrix and others. These companies have developed assays or chips that include a large number of human SNPs. We provide an option to display how many (and which) SNPs in the user’s desired set of genes or genomic region are available on each of these platforms. Depending upon which platform has maximum coverage in the respective region of interest, the user may wish to choose the option to use those commercial genotyping resources.

21. How does graphic visualization work?

We also provide an option in QuickSNP to view the identified tagSNPs in the form of a graphical map, available through the UCSC genome browser. The user has just to click the link in the summary table to see that map, wherein tagSNPs can be viewed along with other sequence annotation features such as genes, transcripts, conserved regions and many others. It must be noted that at present, we provide this option for genomic region-based queries (whole regions as well as genes only), but not for gene name queries.

22. Why is it important to consider genome build for SNP selection?

Page 5: SNPs

With each version of the human genomic sequence released, there are changes in positions of genes, SNPs and other features. Taking this into account, the misspecification of positions for tagSNP selection in QuickSNP might lead to various discrepancies in the results. We therefore provide the option for users to indicate the genome build that they would like to employ. For example, if they are using  genomic positions obtained from linkage studies, they should confirm which genome build was used in those studies and enter the correct information in QuickSNP. We accept genomic coordinates for NCBI genome build versions 34, 35 and 36, which correspond to UCSC genome assemblies July 2003(hg16), May 2004(hg17) and Mar 2006(hg18), respectively. For coordinates from even older genome assemblies, one can visit the resource http://genome.ucsc.edu/cgi-bin/hgLiftOver to perform the conversion manually and then enter the information in QuickSNP.

23. Are there other tools for SNP selection for association studies, and how is QuickSNP different from them?

There are other internet tools available for identification of SNPs for genotyping. These include SNPper (http://snpper.chip.org/), TAMAL (http://neoref.ils.unc.edu/tamal/index.jsp), SNPSelector (http://sky.bsd.uchicago.edu/SNPSelector.html), SNPHunter (http://www.hsph.harvard.edu/ppg/software.htm) and PupasView (http://pupasview.bioinfo.cipf.es/).  QuickSNP offers many useful features when compared with these tools:

A gene-centric approach to tagSNP selection Accepts multiple gene names as inputs Allows for automatic coordinate conversion  between different genome assemblies Provides the option to include flanking sequence around genes Provides the option to reject SNPs that are too closely separated (user-specified distance), since these are

less likely to work in genotyping experiments Calculates the financial cost for the genotyping studies Automatically includes coding non-synonymous SNPs in the region, if specified by the user For ‘include tag’ and ‘exclude tag option’, predetermines which SNPs are present in HapMap database

for the given population, and use only those (for including or excluding). In other existing tools, the whole search aborts if any include/exclude tag is absent in HapMap database.

Provides a link to a graphical map of tagSNPs Reports allele and genotype frequencies for tagSNPs in different populations Reports the number of SNPs that have available assays or are present on whole genome chips provided

by commercial genotyping platforms. Provides a user friendly summary table, and downloadable result files

24. What are the various databases and programs used in QuickSNP?

QuickSNP uses the information content of the following databases

HapMap Genotypes and allele frequency dataHapmap data release 21a/phase II Jan07 on NCBI B35 assembly, dbSNP b 125

Entrez gene Gene Annotations NCBI B35 assembly

Page 6: SNPs

Genome Browser SNP annotations May 2004 data release

ABI Available taqman assays 08_07_2006 release

IlluminaHap550 & Hap650 whole genome SNP chip annotation

v1.0.0

Affymetrix Whole genome (500k) SNP chip annotation May 2006In addition, the following programs have also been integrated in the server:

Haploview tagSNP identification using SNP genotypes version 3.32

Liftover converts genome coordinates between assemblies Generic

25. How large a region can QuickSNP handle?

Due to memory constraints, at present, QuickSNP can handle a region as big as 5 megabases in size. For searches using gene names, it can handle 40 genes at a time. We will continuously make efforts to improve upon the processing speed, and hence, we may be able to provide an option to use a larger region for tagSNP searches in the near future.

28. How do I cite QuickSNP in publications?

The following article should be cited for the use of the QuickSNP web server tool or data derived from it, in research that will be published in a journal or on the Internet.

QuickSNP: an automated web server for selection of tagSNPs. Grover D, Woodfield AS, Verma R, Zandi PP, Levinson DF, Potash JB. Nucleic Acids Research 2007, 35(Web Server issue):W115-W120.

In addition to the paper, please include a reference to the QuickSNP website in your manuscript:http://bioinformoodics.jhmi.edu/quickSNP.pl.

29. How is QuickSNP 1.1 different from the older version (QuickSNP 1.0)?

QuickSNP 1.0 utilized flat file databases as a data source. These have now been converted into a relational MySQL database in QuickSNP 1.1. As a result, the program has become 3 to 10 times faster (depending on the search criteria specified) and one can now search a 2 Mb region (or 20 genes) within less than a minute under default conditions.

The date of formal release of this newer version of QuickSNP is August 1, 2007.

Page 7: SNPs

26. Whom do I contact if I have trouble using this server?

There are three levels of help available to users: (1) QuickHelp, which can be accessed by clicking on the [?] symbol next to each option, and explain breifly about the purpose of that option; (2) Frequently asked questions (3) directly contacting us by sending an email message to [email protected].

27. Can I perform multiple QuickSNP searches simultaneously from my computer?

Yes, you can. Although it will make all the searches very slow. Thus, we recommend that you perform one search at a time.

Single Nucleotide Polymorphism

A Single Nucleotide Polymorphism, or SNP, is the variation of a single nucleotide pair (A

T replaced with C-G, or vice versa) within a defined DNA snippet. This DNA snippet

might be a gene (both in coding and non-coding regions of the gene), or a non-gene

(intergenic) part of the chromosome. As such, an SNP creates alleles. Most SNPs only

have two versions of the allele, one with A-G and the other with C-T at the specified

location.

Even SNPs within the coding region of a gene don't necessarily create differences in the

amino acids coded, because of the degeneracy of the genetic code. An SNP in which the

amino acid is not changed is called synonymous, while an SNP in which the amino acid is

changed as a result of the SNP is called non-synonymous. SNPs not in protein-coding

regions may still have consequences for gene splicing, transcription factor binding, or the

sequence of non-coding RNA.

SNPs are important in the study of population genetics. The minor allele frequency is the

ratio of chromosomes in the population carrying the less common variant to those with the

more common variant. Therefore, by definition it is always less than one (though some

people now express them as a percentage instead). Since populations differ, the minor

allele frequency can only be stated for a defined population.

What is the significance of Minor Allele Frequency? how Minor Allele Frequencies are used in GWAS in the selection of SNPs? What does Minor Allele Frequency being zero signify beyond saying that the population is homozygous with dominant allele.

Page 8: SNPs

Loci are selected for a genotype assay such as a SNP chip because they are expected not to be uniform in the population. Most chips distinguish between two genotypes at a given locus; those are the two alleles. We can estimate the frequency of these alleles in the total population from their frequency in a sample population, such as the HapMap samples. One of these alleles will appear less frequently than the other; that is the minor allele. Typically GWAS are designed to exclude SNPs with a MAF < 5%, as it requires very strong statistical power to make meaningful statements about very rare alleles.

To a first approximation, you've correctly interpreted the result of a MAF of zero. In truth since we're just estimating the MAF, there may well be people who do not have the major allele.