EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be...

13
1 EBST ver 1.1 User Manual 2014.11.4 Shuichi Kitada 1 , Reiichiro Nakamichi 1 , Toshihide Kitakado 1 and Hirohisa Kishino 2 1 Graduate School of Marine Biosciences, Tokyo University of Marine Science and Technology, Minato, Tokyo 108-8477, Japan 2 Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo, Tokyo 113-8657, Japan EBFST ver 1.1 is software to estimate F ST by an empirical Bayes procedure (Kitada et al. 2007). It runs under under Microsoft Windows system (64bit, XP or later) and R statistical computation environment (64bit, ver 3.0.0 or later). R must therefore be installed on your computer prior to running this software. The original R code, which includes AD Model Builder (ADMB) modules (http://admb-project.org/) for numerical maximization of the marginal likelihood function, was written by TK. RN and HK added necessary changes for ver 1.1. Changes made from ver 1.0 are: Estimation of normarized pairwise FST added 64bit memory extension to handle large number of loci Output format fixed Performance improved It provides the maximum likelihood estimate of global F ST and the rate of gene flow over populations, and generates posterior distributions of pairwise F ST between two populations. It calculates the posterior mean and standard deviation of pairwise F ST , and probabilities of pairwise F ST being smaller than threshold values. It accepts two data format files (text files); GENEPOP (Raimond and Rousset 1995; Rousset 2008) format and the frequency format files, and can be applied to haplotype/genotype data derived from common genetic markers, including mitochondrial DNA (mtDNA), isozymes, microsatellites and single nucleotide polymorphisms.

Transcript of EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be...

Page 1: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

1

EBST ver 1.1 User Manual

2014.11.4

Shuichi Kitada1, Reiichiro Nakamichi1, Toshihide Kitakado1 and Hirohisa Kishino2

1Graduate School of Marine Biosciences, Tokyo University of Marine Science and Technology, Minato, Tokyo 108-8477, Japan 2Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo, Tokyo 113-8657, Japan

EBFST ver 1.1 is software to estimate FST by an empirical Bayes procedure (Kitada et al. 2007). It runs under under Microsoft Windows system (64bit, XP or later) and R statistical computation environment (64bit, ver 3.0.0 or later). R must therefore be installed on your computer prior to running this software. The original R code, which includes AD Model Builder (ADMB) modules (http://admb-project.org/) for numerical maximization of the marginal likelihood function, was written by TK. RN and HK added necessary changes for ver 1.1. Changes made from ver 1.0 are: Estimation of normarized pairwise FST added 64bit memory extension to handle large number of loci Output format fixed Performance improved It provides the maximum likelihood estimate of global FST and the rate of gene flow over populations, and generates posterior distributions of pairwise FST between two populations. It calculates the posterior mean and standard deviation of pairwise FST, and probabilities of pairwise FST being smaller than threshold values. It accepts two data format files (text files); GENEPOP (Raimond and Rousset 1995; Rousset 2008) format and the frequency format files, and can be applied to haplotype/genotype data derived from common genetic markers, including mitochondrial DNA (mtDNA), isozymes, microsatellites and single nucleotide polymorphisms.

Page 2: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

2

1 Downloading and installing EBST To download EBST, click “Installer Package” on the EBST website. Extract ZIP file, and execute EBFST_V11.exe. This will create a work folder (default is C:¥EBFST_V11), which contains a sample data folder. The data folder contains two sample data files: example.txt and example_genepop.txt. You may create a shortcut copy of the EBST icon and paste it onto your desktop. 2 Input data files EBST accepts text files in two different formats: frequency and GENEPOP. Use the frequency format for mtDNA haplotype frequency data. For allozyme, Microsatellite, and SNP genotype data, it is convenient to use a GENEPOP format file, which is automatically converted by EBST into the frequency format before performing calculations. Microsatellite data of the Japanese Spanish mackerel (Nakajima et al. 2014) can also be downloaded from the EBST website and Dryad at http://datadryad.org/resource/doi:10.5061/dryad.b66rc. 2.1 Haplotype frequency data Haplotype count data should be entered in the format shown below and saved as a text file. In this example, there are 15 geographical samples and 94 haplotypes obtained from mtDNA D-loop region of Japanese Spanish mackerel. #The number of subpopulations 15 #The number of loci 1 #Locus 1 0 0 0 3 0 0 0 0 0 0 0 0 5 0 0 1 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … … 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 3: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

3

2.2 Genotype data An example of genotype data saved as a text file in GENEPOP format is shown below. This is compatible with GENEPOP 4.3 (Rousset 2014). First line is used for information about your data. Locus names can be given in each line (USE NO SPACE ! after each locus name), or on the same line but separated by commas. Pop is the sample indicator (POP, Pop, pop, pOP, …). Each sample from a different geographical original is declared by a line with a pop statement (USE NO SPACE after each pop statement). "GenePop file, with 16 samples, 5 loci (1,654 individuals)” Sni13 Sni21 Sni24 Sni26 Sni29 Pop Osaka_Bay_2001, 119123 143145 141147 129145 148190 Osaka_Bay_2001, 119119 143145 135135 125129 148148 … POP Harimanada_2001, 119119 145145 139171 123133 148148 Harimanada_2001, 119121 145147 139145 127127 148154 … OR "GenePop file, with 16 samples, 5 loci (1,654 individuals)” Sni13, Sni21, Sni24, Sni26, Sni29 PoP Osaka_Bay_2001, 119123 143145 141147 129145 148190 Osaka_Bay_2001, 119119 143145 135135 125129 148148 … Here, Osaka_Bay_2001 is an identifier for the individual. You can use any character (except a comma). The comma between the identifier and the list of genotypes is required. The first number 119123 indicates that this individual is heterozygous for the 119 and 123 alleles at the first locus. Alleles are numbered from 01 to 99 or 001 to 999 if needed. In 3-digits coding, homozygotes for the 90 allele are noted 090090, not 9090 as in the 2-digits format. 2-digits and 3-digits coding of alleles can be intermixed (among loci, not within loci)

Page 4: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

4

Haploid and diploid data can be intermixed. 6-digits genotypes are recognized as 3-digits diploid genotypes; 4-digits genotypes are recognized as 2-digits diploid genotypes; 2- and 3-digits genotypes are recognized as haploid genotypes. The same coding should be used consistently within each locus. Missing alleles should be indicated with 00 (or 000 for 3-digits coding) and not with blanks. No information (coded 0000, 000000) and partial information (only one allele is determined: i. e., 0200, 0010, 128000, 000162) can be accepted. [CAUTION for SNP data] EBFST measures genetic differentiation between populations based on common genetic variants. When the mean frequency of the major allele over the populations is greater than 99% and the mean frequency of the minor allele is less than 1%, EBFST does not perform calculation. Remove the loci from the input data file. Such loci might not reflect genetic differentiation between populations. 3 Running EBST To run EBST, click the Windows Start button and select “EBST” from the Program menu. Alternatively, you can double-click the EBST icon on your desk top. 3.1 Reading the data After starting the program, the dialog box shown in Fig. 1 will appear. If your data file is in the data folder in GENEPOP format, check “GenePop Format File” button, click the “File Select” button and select the filename to be analyzed (e.g., example_genepop.txt). If your data is already saved in frequency format, check “frequency Format File” button, and select the filename (e.g., example.txt) (Fig. 2).

Page 5: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

5

Fig. 1: The EBST dialog box. Reading in GENEPOP format data. The user specifies the appropriate directory and the data file to be read

Fig. 2: Reading in frequency format data

Page 6: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

6

3.2 Estimation of Global FST and pairwise FST To estimate global and pairwise FST values, highlight whether FST values are specific to each locus, common to loci, or both (Fig. 2). “Locus specific” is recommended. If “both” is selected, results will be output for both assumptions. Click “Calculate”, which will bring up the message box displayed in Fig. 3. Clicking “OK” will cause the R console window to be automatically opened. To start the calculations, press Ctrl-v, which will paste a non-displayed “null” R command at the R prompt (>) (Fig. 4). At the end of the calculations, the R > prompt will be redisplayed in the R console window (Fig. 5).

Fig. 3: R start-confirmation message box

Fig. 4: R console window (start screen). To begin calculations, press Ctrl-v to paste a “null” R command (not displayed) at the R prompt (>)

Page 7: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

7

Fig. 5: R console window after completion of calculations

Page 8: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

8

3.3 Displaying results When calculations are finished, clicking “Result Log” on the EBST dialog box (Fig. 6) will cause the results of the estimation to be displayed. The global FST estimate and the rate of gene flow (i.e., the sum of the hyperparameters) will be shown with standard errors (SE) in parentheses. Matrices of posterior means pairwise FST and SDs will also be shown by each locus. A result file named results.txt, which can be renamed, is output to the EBST_V11 folder. Also created in the EBST_V11 folder is a text file named mean.pairwiseFst.specific.txt, which contains 1,000 simulated pairwise FST values averaged over loci for all pairs of subpopulations. It is recommended that you save this file for later plotting of pairwise FST posterior distributions (see section 3.4). Push “Fst matrix” button to show the matrix of posterior mean pairwise FST estimates averaged over loci (simple mean) and that of standard deviations (SD). Normalized mean pairwise FST estimates and SDs are also shown (see, Appendix). The normalized mean shrinks toward the simple mean. Therefore, we recommend first using the simple mean to describe the population structure.

Fig. 6: EBST dialog box after completion of calculations

Page 9: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

9

3.4 Graphical presentation of FST posterior distributions After calculations are complete, you can enter information in the “Posterior distribution” portion of the dialog box (Fig. 6). Highlight the pair(s) for which you wish to see pairwise FST posterior distributions and set minimum (Min) and maximum (Max) values for the horizontal axis scale. Clicking “Display Graphs” will cause the message box shown in Fig. 7 to be displayed. Click “OK”, and when the R prompt (>) appears in the R console window, press Ctrl-v (see Fig. 4). To save the resulting graph, activate the graphic window in R and use the “File” pull-down menu to save the file to a designated folder in a suitable format, such as jpeg (Fig. 8). Because of graphic window space limitations, pairwise FST posterior distributions may not be displayed simultaneously for all pairs of samples. In such cases, you can plot the posterior distributions in R or another software package by using the data saved to the text file “mean.pairwiseFst.specific.txt” (see section 3.3).

Fig. 7: Message box instructing the user to paste the “null” R command into the R console window

Fig. 8: Screen snapshot showing graphed pairwise FST posterior distributions

Page 10: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

10

3.5 Calculation of the probability of FST being smaller than threshold values Threshold values of FST can be used for management purposes such as identification of management units. Given a threshold, EBST calculates the probability for all pairs of populations. After entering up to three threshold values under “Threshold” on the dialog box (Fig. 9), click “Calculate” under “Threshold” and press Ctrl-v at the R prompt (>) to obtain probabilities for these threshold values for each population pair (Fig. 10).

Fig. 9: EBST dialog box showing user-entered threshold values for calculation of FST probability

Page 11: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

11

Fig. 10: R console window after calculation of the probabilities that FST is smaller than the specified threshold values 4 Closing EBST

When you are finished running EBST, click “Exit” on the EBST dialog box and close R.

Page 12: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

12

References

Kitada S, Kitakado T, Kishino H (2007) Empirical Bayes inference of pairwise FST and its distribution in the genome. Genetics, 177, 861–873.

Nakajima K, Kitada S, Habara Y, Sano S, Yokoyama E, Sugaya T, Iwamoto A, Kishino H, Hamasaki K (2014) Genetic effects of marine stock enhancement: a case study based on the highly piscivorous Japanese Spanish mackerel. Canadian Journal of Fisheries and Aquatic Sciences, 71, 301-314.

Raymond M, Rousset F (1995) GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. Journal of Heredity, 86, 248–249.

Rousset F (2008) Genepop'007: a complete reimplementation of the Genepop software for Windows and Linux. Molecular Ecology Resources, 8, 103–106.

Rousset F (2014) Genepop 4.3 for Windows/Linux/Mac OS X. This documentation: July 8, 2014.

Page 13: EBST ver 1.1 User Manual - 東京海洋大学...locus. A result file named results.txt, which can be renamed, is output tothe EBST_V11 folder. Also created in the V1EBST_1 folder is

13

Appendix Posterior and normalized means of pairwise FST averaged over loci Let )(

,,STl

jiF be the pairwise FST estimate (posterior mean of 1 000 FST estimates) between population i and population j at locus l. The mean of the pairwise FST averaged over loci is calculated as

∑=

•=

L

l

lji,i,j F

LF

1

)(,,ST

)(ST

1 .

The standard error is calculated as

∑=

••−=

L

l,i,j

lji,i,j FF

LFSE

1

2)(ST

)(,,ST2

)(ST )(1)( .

We define the relative pairwise FST at locu l as

)(,,ST

)(ST),(

ST l

l,i,jNl

,i,jF

FF

••

= .

Here, ∑∑=

=

•• =S

i

i

j

lji

S

lF

CF

2

1

1

)(,,ST

2

)(,,ST

1 is the mean pairwise FST averaged over all pairs at loci l,

and S is the number of populations sampled. The normalized mean of the pairwise FST averaged over loci is obtained as

∑=

•••

• ××=L

l

Nl,i,j

N,i,j F

LFF

1

),(ST

)(,,ST

),(ST

1 ,

where ∑=

•••

•• =L

l

lF

LF

1

)(,,ST

)(,,ST

1 is the global mean over all pairs and loci.

The standard error is calculated as

∑=

••••

•−×=

L

l

N,i,j

Nlji

N,i,j FF

LFFSE

1

2),(ST

),(,,ST2

)(,,ST

),(ST )(1)( .