БИОИНФОРМАТИЧЕСКИЕ УЛОВКИДЛЯАНАЛИЗА...

Post on 31-Jul-2020

0 views 0 download

Transcript of БИОИНФОРМАТИЧЕСКИЕ УЛОВКИДЛЯАНАЛИЗА...

БИОИНФОРМАТИЧЕСКИЕУЛОВКИ ДЛЯ АНАЛИЗА

ДРЕВНИХ ДНК

TATIANA TATARINOVA

WHAT HAVE BEEN SEQUENCED?

KHAZARIA: WHERE AND WHEN?

• КАК НЫНЕ СБИРАЕТСЯ ВЕЩИЙ ОЛЕГ

• ОТМСТИТЬ НЕРАЗУМНЫМ ХОЗАРАМ*,

• ИХ СЕЛЫ И НИВЫ ЗА БУЙНЫЙ НАБЕГ

• ОБРЕК ОН МЕЧАМ И ПОЖАРАМ.

*Хозары — кочевой народ, некогда обитавший на юге России.

KHAZARIAN PUZZLE

Khazars were mentioned first by several Arabic historians in VIII century AD, and last in XIII century, as one of the peopleconquered by Baty-khan

CLAIMS OF ASHKENAZI CONNECTION?

ARTHUR KOESTLER THE THIRTEENTH TRIBE THE KHAZAR EMPIRE AND ITS HERITAGE HUTCHINSON OF

LONDON, LONDON 1976

Lev Gumilev, Discovery of Khazaria

No written sources from Khazaria other than three manuscripts in ancient Hebrew

One of the rules was called Joseph

Jewish artefacts

Legends about Jewish practices

THE MISSING LINK OF JEWISH EUROPEAN ANCESTRY: CONTRASTING THE RHINELAND AND THE KHAZARIAN

HYPOTHESES BY ERAN ELHAIKGENOME BIOLOGY AND EVOLUTION, 2013

SO, WHO IS MR. KHAZAR?

aDNA may provide answers to this historic riddle

Ingredient 1: high quality input

Stepped grave vs Niche grave

ANCIENT DNA (ADNA) IS THUS EXPECTED TO REVOLUTIONIZE EVOLUTIONARY

GENETICS IN THE SAME MANNER THAT SYSTEMATIC APPROACH TO ANALYSIS OF

FOSSIL RECORDS REVOLUTIONIZED PALEONTOLOGY: IT IS A DIRECT WINDOW INTO

THE PAST ‒ A “TIME CAPSULE”.

RECENTLY DNA SAMPLES WERE OBTAINED FROM NEANDERTHAL, DENISOVA,

MAMMOTH, PALEO-HORSE, ANCIENT SEEDS ETC.

Many of the questions we addressed in this paperToward high-resolution population genomics using archaeological samples

Irina Morozova, Pavel Flegontov, Alexander Mikheyev, Hosseinali Asgharian, Petr Ponomarenko, Vladimir Klyuchnikov, GaneshPrasad ArunKumar, Sergey Bruskin,Egor Prokhortchouk, Yuriy Gankin, Evgeny Rogaev, Yuri Nikolsky, Ancha Baranova,Eran Elhaik, Tatiana V. Tatarinova, DNA Research 2016

Ingredient 2: high-quality

sequencing

GENOTYPING SEQUENCING

Potentially, every position on a

genome is studied. However, quality

is variable, lower than for the SNP

chip (0.1% error is achieved for 75%

of bases). Some areas require read

depths of 100 or more.

Large, but limited number of high-quality calls.1 million SNPs can be genotyped for $100Error rate (wrong calls) <1% (reported by 23 and me)

Der Sarkissian et al. 2015

http://mammoth.psu.edu/hair.html

QUALITY ASSESSMENT

• NUMBER OF SNPS 124,780,238

• QUALITY (Q) (3.01, 226.77)

• MEAN Q 14.14, MEDIAN Q 7.80

• AVERAGE DEPTH OF COVERAGE 5

• COVERED 1-2% OF GENOME

Consider 300 bronze age genomes published in 2014-2015• Allentoft et al. 2015 (RISE*)• Haak, Lazaridis et al. Nature 2015 (I0*)• Gamba et al. Nature Communications 2014 (I1*)• Mathieson et al. 2015 (I*)

SYSTEMATIC ERRORS

Samples I and Rise: both sequenced on HiSeqRise: whole genome, I - targeted

SO, WE DEAL WITH POOR QUALITY

• LOW COVERAGE

• POOR QUALITY

• INSUFFICIENT NUMBER OF SNPS PER INDIVIDUAL

• DIFFERENT GROUPS GET DIFFERENT RESULTS FROM THE SAME SAMPLES

• INDIVIDUAL SNPS CANNOT BE TRUSTED!

APPROACH: AGGREGATION

• ADMIXTURE

• GPS

• PATHWAYS

• USING PROBABILITY TO MODEL

• LOCALLY AGGREGATED ANCESTRY RLAI

To infer population structure from genotype data, it is necessary to first reduce the

dimensionality of the dataset due to the thousands of SNPs it encompasses.

From SNPs to Admixture

Thousands of SNPs

North EastAsian Mediterranian South African

South West Asian Native American Oceanian South East Asian

NorthernEuropean

Sub-SaharanAfrican

HGDP00985 0.5253 0.0202 0 0.2222 0.0404 0.0101 0.0101 0.1717 0

HGDP01094 0.04 0.04 0 0.03 0.83 0 0.01 0.05 0

HGDP00982 0.0102 0.1531 0.0306 0.0714 0.0408 0 0.0102 0.2041 0.4796

ADMIXTURE

Admixture proportions in geographically adjacent populations, such as Italian and Greeks, and populations sharing similar history, like British and Germans, are similar.

19

GPS ORIGIN PREDICTION

20

A B

X ΔGEO = α × ΔGEN + 𝛽

APPLICATION OF GPS TO ADNA (BRONZE AGE)

30 OUT OF 100 BRONZE AGE

SAMPLES (ALLENTOFT ET AL

2015) HAD OVER 500 OF

ANCESTRY INFORMATIVE

MARKERS.

WE APPLIED GPS ALGORITHM TO

FIND THE CLOSEST MODERN

POPULATION.

GPS accurately assigned:

• ~100% of all individuals to their continental regions• 80% of all individuals to their country of origin• 60% of all individuals to their inner-country region

22

PCA

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

I1280RISE568

BritishI0550I0246Tatars

RISE546I0060I0115I0235

RISE240RISE154

PolandI0805

ChuvashsRISE562

BulgariansI1530I1281I1303

Ashkenazi_PolandNorthern Caucasian

NogaisSephardic Jews B

NorthEastAsian Mediterranean SouthAfrican SouthWestAsian NativeAmerican

Oceanian SouthEastAsian NorthernEuropean SubsaharanAfrican

SNPS PER PATHWAYS

Changes in biological pathways during 6,000 years of civilization in Europe, Chekalin et al, 2018,Molecular Biology and Evolution

• READMIX DEVELOPED TO TREAT INDIVIDUALS OF MIXED

ORIGIN AND REPRESENTS AN INDIVIDUAL AS A LINEAR

COMBINATION OF ADMIXTURE VECTORS OF REFERENCE

POPULATIONS

• 30%BRITISH+10%RUSSIAN+60%CHINESE

• P=A1RS1 + A2RS2 +... + APRSP+ERROR26

More complex cases?

reAdmix

HOW IT WORKS

• WE ASSUME RIGHT AWAY THAT THE GIVEN ANCIENT PROPORTIONS CONTAIN

ERROR

• START WITH A GUESS POPULATION

• ADD/REMOVE POPULATIONS TO ACHIEVE OPTIMAL FIT

• CONDITIONAL OPTIMIZATION (SUCH AS “I KNOW THAT THERE WAS A JEWISH

• ANCESTOR SOMEWHERE IN MY PEDIGREE”)

27

READMIX APPROACHAim: to find the smallest subset of modern populations whose combinedadmixture components are similar to those of the individual within a smalltolerance margin.

The algorithm consists of three phases:

1. Iteratively build the first candidate solution and improve it.

2. Generate the predefined number M of additional candidate solutionsrandomly and apply the Differential Evolution (DEEP).

3. Identify the populations that have stable membership in the solution acrossthe set, that is, are part of solution in at least 75% of cases.

Let R={ri}

i=1..Ibe the set of modern populations where

ri=(ri,1, ..., ri,K) and K is the dimension (K=9).

We seek two sets S=(s1,...,s

p) and A=(a

1,...,a

p) where

siare the indices of modern populations a

iare the coefficients of modern populations

in the approximation

each

of test vector T

SOHN ET (2012) AL BENCHMARK

• 2 COMPONENTS

• 4 COMPONENTS

4-dim space: European, African, Native American and East Asian

Color coding: red-European, green-African, yellow- Native American, blue-East

Asian, and white- unassigned

reAdmix

RLAI (ROBUST INFERENCE OF LOCAL ANCESTRY)

RLAI METHOD

In every window find the most similar position

COMPARISON WITH OTHERS

• LAMP

• PROBABILITY OF A SEGMENT TO BELONG TO A SPECIFIC POPULATION

• LAMP-ANC

• MODIFICATION OF LAMP, SKIPPING ESTIMATION OF ANCESTRAL ALLELES, THEREFORE MORE

RELIABLE

• RFMIX

• TREATS ORIGIN IS A HIDDEN PARAMETER

COMPARISON

• RFMIX HAS THE HIGHEST ACCURACY FOR MIXES EUROPE-JAPAN AND

EUROPE -AFRICA. TRIPLE MIXES SHOW DROP IN QUALITY

RLAI

• RLAI SHOWS ACCURACY ABOVE 0.9 FOR ALL MIXES INCLUDING TRIPLE

RLAI ACCURACY AS A FUNCTION OF GENERATIONS

ZOOMING IN AND OUT

Unique ID Part

Archaeological culture Reference

Date (2-sigma)

Min Аgе

Max Age

Location

Country Lat Lon

Coverage SNPs Sex #reads

mtDNA haplogroup

% endogenous

I0047 ToothCentral_LNBA

Haak, Lazaridis et al. Nature 2015

2111-1891 cal BCE 4037 3952

Halberstadt-Sonntagsfeld

Germany 51.89 11.04 1.655

836,247 F

17,431,013 V9 0.449

CORDED WIRE ANALYSIS

Das et al 2016Behar et al 2013

Sample Bone Gender Age Century Location Race

67 Left humerusM 35-40 IX Martynovsky district

mongoloid

166 Left femur F 25-30 VIII-IX Martynovsky district mongoloid

531Right tibia and left

ulna M 35-40 VIII-IX Dubosvky districteuropoid

619 Left femur M 35-40 VII-VIII Dubosvky district mongoloid

656 Right tibiaM 30-35 VII-VIII Dubosvky district europoid (?)

1251 Left humerus M 40 IX Zimovnikovsky district undefined

1564 Left tibiaM 25-35 VIII-IX Belokalitvinsky distict europoid (?)

1566 Right humerusM 35-40 VIII-IX Belokalitvinsky distict undefined

1986Right humerus and left

tibia M 35-45 VIII-X Orlovsky district europoid (?)

NINE 8TH -9TH CENTURY GENOMES OF KHAZARS

DNA extraction conducted in two labs independentlySequencing performed by Dr. Mikheyev (OIST)Test all samples on MiSeq and the best samples on HiSeq0.32-0.48 of human genome coveredAverage depth of coverage ~0.75X

BIOINFORMATICS PROCEDURES

• USING MULTIPLE PIPELINES IN PARALLEL

• PALEOMIX (SCHUBERT ET AL,

CHARACTERIZATION OF ANCIENT AND

MODERN GENOMES BY SNP DETECTION

AND PHYLOGENOMIC AND METAGENOMIC

ANALYSIS USING PALEOMIX. NAT PROTOC.

2014)

• MAPDAMAGE, SCHMUTZI, ANGSD, FASTQC,

CUTADAPT, FOLLOWED BY GATK

• PILEUPCALLER

(HTTP://STEPHANSCHIFFELS.DE/SOFTWARE/)

MTDNA

Sample 67 166 531 619 656 1251 1564 1566 1986

Coverage 30.69 62.11 5.43 7.51 30.86 71.07 86.44 31.29 38.52

Haplogroup D4e5 C4 X2e2 H1a3 C4a1 H5b H13c1 D4b1a1a C4a1c

Using BAM Analysis Kit

YDNA

• 619 - Q

• 1986 - R1A

• 1251 - R1A

• 656 - C3

NGSADMIX ANALYSIS

ANCESTRY INFORMATIVE MARKERS ADMIXTURE

Sample Q20 DP2 Q30 DP2 Q20 DP3 Q30 DP3

1251 7057 6715 1347 1140

1566 7404 7115 1380 1158

1564 3448 3274 512 439

166 6538 6289 1113 927

1986 10049 9572 2273 1886

656 877 858 57 47

531 389 385 8 7

67 1166 1152 79 75

619 1041 1036 37 35

GPS ANALYSIS, MODERN AND ANCIENT SAMPLES AS REFERENCE

GPS algorithm: Elhaik, Tatarinova et al (2014)

SAMPLE NEAREST MODERN DISTANCE NEAREST ANCIENT DISTANCE

1251 Tajik 0.18 Steppe MLBA 0.06

1564 Lebanese 0.16 Levant BA 0.19

1566 Yakut 0.04 Pazyryk IA (Altai) 0.27

166 Evenk 0.09 Pazyryk IA (Altai) 0.52

1986 Shor 0.10 Pazyryk IA (Altai) 0.14

531 Ishkasim 0.19 Early Sarmatian IA 0.17

619 Turkmen 0.26 Pazyryk IA (Altai) 0.40

656 Kazakh 0.29 Pazyryk IA (Altai) 0.41

67 Khanty 0.12 Pazyryk IA (Altai) 0.16

READMIX ANALYSIS, MODERN REFERENCE

Sample

Populations Proportions

1251 Turkmen Abkhazian Belarusian Yizu 0.494 0.142 0.238 0.127

1564 Druze 0.649

1566 Yakut 1

166 Yakut Even Sakha 0.637 0.339

1986 Yakut Saami Abhaz 0.368 0.428 0.167

531 Yaghnobi(Tajikistan)

Kets Kurmi Selkup 0.506 0.053 0.202 0.239

619 Egypt Yakut Azeri Yizu 0.315 0.183 0.206 0.297

656 Mongolian Even Sakha Egypt Yizu 0.369 0.252 0.193 0.187

67 Yakut 0.623

READMIX, ANCIENT REFERENCESample

Populations Proportions

1251 Steppe Eneolithic

Anatolia Neolithic

SE Iberia CA Zevakino Chilikta IA

0.624 0.149 0.124 0.103

1564 Peloponnese Neolithic

0.595

1566 Pazyryk IA 1.000

166 Pazyryk IA 1.000

1986 Pazyryk IA 0.832

531 Beaker Central Europe

Armenia MLBA

Yamnaya Ukraine

Maros.SG 0.388 0.217 0.258 0.138

619 Beaker Central Europe

Pazyryk IA Anatolia Neolithic

Peloponnese Neolithic

0.489 0.237 0.175 0.099

656 Beaker Central Europe

Armenia MLBA

Pazyryk IA 0.443 0.299 0.193

67 Pazyryk IA 0.875

Ancestry and demography and descendants of Iron Age nomads of the Eurasian Steppe, Martina Unterlander et al, Nature Comm 2017

F3 OUTGROUP

166 1564

OVERLAP WITH THE ASHKENAZI GENOME CONSORTIUM MARKERS

• SEQUENCING AN ASHKENAZI

REFERENCE PANEL SUPPORTS

POPULATION-TARGETED

PERSONAL GENOMICS AND

ILLUMINATES JEWISH

• AND EUROPEAN ORIGINS,

SHAI CARMI, KEN Y. HUI,…, ITSIK PE’ER, NATURE COMMUNICATIONS VOLUME 5,

ARTICLE NUMBER: 4835 (2014)

Sample 1251 1564 1566 166 1986 531 619 656 67

Total 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09 3.1E+09

Overlapping positions (out

of 953mutations in

known Ashkenazi

genes)

247 190 301 282 327 21 23 48 70

Same allele as in Ashkenazi database

1 0 3 1 5 0 0 0 0

ASHKENAZIM AND KHAZARS

1. NO SIGNIFICANT ASHKENAZI GENETIC AFFINITY WAS DETECTED IN ANY OF THE SEQUENCED INDIVIDUALS

2. ALL OF THE STUDIED KHAZARS, EVEN THOSE WITH SIGNIFICANT CAUCASIAN ANCESTRY, HAD SIGNIFICANT

ASIATIC NUCLEAR GENETIC CONTRIBUTIONS, WHICH ARE MISSING FROM PRESENT-DAY JEWISH POPULATIONS

3. WHILE LOCAL WOMEN WERE RECRUITED INTO ASHKENAZI COMMUNITIES, NONE OF THE IDENTIFIED

MITOCHONDRIAL HAPLOTYPES ARE COMMON IN PRESENT-DAY ASHKENAZI JEWS

4. THE EUROPEAN GENETIC COMPONENTS OF THE KHAZARS DERIVE FROM THE CAUCASUS TRIBES THAT WERE

UNDER CONTROL OF THE KHAGANATE, RATHER THAN FROM MORE DISTANT LEVANTINE POPULATIONS MORE

CLOSELY RELATED TO ASHKENAZI AND SEPHARDIC JEWS. WHILE JEWS PROBABLY LIVED IN THE TERRITORY OF

THE KHAZAR KHAGANATE ALONG WITH CHRISTIANS, MUSLIMS AND PAGANS, IT SEEMS UNLIKELY THAT THEY

FORMED ITS RULING CLASSES, WHICH WERE DOMINATED BY STEPPE NOMADS FROM THE EAST, AND THUS THE

KHAZARS WERE NOT LIKELY PROGENITORS OF THE ASHKENAZIM.

CONCLUSIONS

• USE MULTIPLE METHODS FOR ANALYSIS FOR VALIDATION

• THERE WERE TWO GROUPS OF KHAZARS, EUROPEAN AND

ASIAN, BOTH GROUPS MIXED

• KHAZARS WERE PROBABLY NOT THE DIRECT ANCESTORS OF

ASHKENAZI

• NEED MORE MONEY FOR MORE GENOMES

ACKNOWLEDGEMENTS

• OKINAWA: ALEXANDER MIKHEYEV

• ROSTOV: IGOR KORNIENKO, ELENA BATYEVA,

VLADIMIR KLYUCHNIKOV

• PETERSBURG: YURI ORLOV, IVAN DMITRIEVSKY

• TOMSK: ALEXEI ZARUBIN

• MOSCOW: NIKITA MOSHKOV