Clustering and optimization in genetic data: the problem of Tag-SNPs selection
description
Transcript of Clustering and optimization in genetic data: the problem of Tag-SNPs selection
Clustering and optimization in genetic data: the problem of
Tag-SNPs selection
Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa**
* Istituto di Analisi dei Sistemi ed Informatica “Antonio Ruberti”, CNR** Dipartimento di Dipartimento di Matematica e Applicazioni "R.M. Caccioppoli“, Universita’ degli Studi di Napoli “Federico II”
Summary
• Biological background– DNA– Chromosomes– Haplotypes and Genotypes– SNPs
• Haplotype analysis• Tag SNPs selection
– Problem definition– State of the art– Reconstruction Function and Linkage disequilibrium– Clustering techniques– Set covering techniques– Computational results– Conclusions and future work
Double Helix ((Watson-Crick) of two sequences of Nucleotides A, T, C. G
Base pairs (A-T, G-C) are complementary
One DNA sequence contains regions (i.e. genes, introns, exons) located in the same position of the sequence, in each individual of a species
DNA Structure
Chromosomes
One individual genome is organized in Chromosomes, i.e. large DNA macromolecules packaged in linear or circular shape
In polyploid organisms multiple copies of each chromosome exist
In diploid organisms (human) there are two copies of each chromosome, packaged in linear shape.
Each Chromosome includes hundreds of different genes
Four-arm structure during meiosis and mitosis
A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype.
H1 AATCGCCTTA (maternal chrom) H2 ACACGTCTCA (paternal chrom)
G(H1,H2) A A/C T/A C G T T/C A
• For disease association studies, haplotype data is more valuable than genotype data
• Haplotype data is hard to collect.
• Genotype data is easy to collect
Haplotypes and genotypes
SNPs
All humans are 99,99 % identical.
Diversity? polymorphismpolymorphism..
A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).
A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
AATATATCGAATATATCG
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
TCCGTATACCTATCCGTATACCTA
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
GGGGTGTGTGTACGGGGTGTGTGTAC
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGCTAGCACGCGTGCTAGCACGCG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
TGTGTAATATACGTGTGTAATATACG
Haplotype analysis 1/2
A
GG
A
A
A
G
T
T
T
T
G
A
A
CC
C
C
C
C
CT
T
T
Haplotype analysis* focuses on haplotypes and genotypes that are sequences of SNPs
*http://www.hapmap.org/
To reduce prohibitively expensive haplotyping costs, atwo stage methodology has been proposed [1]•Pilot Study
•All SNPs of interest are genotyped in a small sample of the population•Common haplotypes are inferred using statistical methods•A set of tag SNPs is selected for the population study
•Population Study•Tag SNPs are genotyped in the remaining population•Statistical methods are used to infer haplotypes over the tag SNPs•Haplotypes over the tag SNPs are extrapolated to full haplotypes
•Two problems:•Find a set of minimum cardinality•Find a reconstruction function
Haplotype analysis 2/2
Tag SNPs Selection: methods and models
1. Methods that find a minimum set of clusters of SNPs in high correlation (e.g. linkage disequlibrium) with each other (clusters are called blocks). SNPs prediction should be easier within a block
2. Methods that, given the block structure (based on correlation or on proximity) find a minimum set of SNPs which is able to distinguish each pair of haplotypes in a block; or assume that the number of tag SNPs is given and find a set of Tag which can reconstruct the haplotype of a unknown sample with high accuracy
Tag SNPs Selection: Problem definition
Problem Definition• Given a population of N haplotypes over M SNPs find a
small set of SNPs (Tag SNPs) such that all the values of the other SNPs can be derived, with some reconstruction rule, from the selected values of the Tag SNPs.
Two aspects:(1) Find a reconstruction function(2) Find a set of minimum cardinality that can
reconstruct the other SNPs using (1)And Also:
(3) Given (1) and (2), is there a proper way to identify blocks?
Tag SNPs Selection: Problem definition
The Approaches• Use a reconstruction function based on SNPs similarityMethod 1• Cluster the SNPs according to a proper metric; • Select the centroid of each cluster as a TAG SNPs.
Method 2• Select a subset of SNPs that are able to differentiate each
pair of haplotypes (Set Covering formulation)
• Both method are coherent with the adopted reconstruction function
• The performance in reconstruction can be used to derive the blocks ex-post
The “Majority Vote”1. Given
the set of TAG SNPs A training set T of haplotypes of which we know the
value of all the SNPs A new haplotype H of which we know only the value of
the TAG SNPs
2. Let S be the set of haplotypes in T that have the same values of H on the TAG SNPs
3. For each non-TAG SNPs, determine its most frequent value in S and use it as a prediction of the value of this SNPs of H
The reconstruction function
• The majority vote rule is based on the assumption that TAG SNPs characterize almost completely the haplotype
• If two haplotypes are equal on the TAG SNPs, then they are equal also on the other SNPs.
The reconstruction function
Method 1: SNPs Clustering
• Clustering : find groups of elements with high dissimilarity between groups and small dissimilarity within each group w.r.t. a chosen distance function
• Main Assumption: TAG SNPs are those that are very similar to many other SNPs in the Training Data
Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule
cluster the SNPs in the
haplotypes space using Hamming
Distance (HD) with k-means
algorithm, for a proper value of
k
Select k TAG SNPs as those closest to the HD-centroids of each clusters
Method 1: Set Covering Model
The “classical” model: Find a minimal subset of TAG SNPs in such a way that each pair of haplotypes in the training set differ in the value of at least 1 TAG SNPs
Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule
Select SNPs associated with xi = 1 in the solution of
the SC problem
otherwise
k SNPon differj and i haplotype ifaijk
0
1
1,0,1
..
min
k
k kijk
k k
x
ji xa
ts
x
The above problem cannot be solved optimally for
realistic sizes
Variants of the Set Covering Model
• The SC problem has a number of constraints quadratic in the number of haplotypes
• We use variations of the SC model (SCV) that enable to control the number of TAGs and their quality in a more effective way
• Used iterative herusitic based on reduced costs
0
1,0
,
..
max
k
k k
k kijk
x
x
ji xa
ts
0
1,0
,
..
min
k
k k
k kijk
x
x
ji xa
ts
Minimize the number of TAGs for a given
level of differentiation
between haplotypes
Maximize the capacity to
differentiate between haplotypes for a
given number of TAG SNPs
Some Remarks
• A good estimation on the number of TAG SNPs to be used in the model can be found efficiently measuring the quality of the clusters for different values of
• The quality of the two methods (Clustering and Set Covering) can be compared directly using the same dimensions of the TAG SNPs set
SC still non tractable if all SNPs are used (most literature uses the first 1000-1500SNPs).
Start with centroids of clustering
Add columns with pricing until LP oprimal
Add columns with metric on SNPs until F.O. increases
Solve IP
Computational results
International HapMap Project
Data on Chromosoma 21 of human genome
YRI : Yoruba in Ibadan, Nigeria. JPT: Japanese in Tokyo, Japan CHB: Han Chinese in Beijing, China CEU : Utah Residents with Northern and Western European
Ancestry
# haplotypes # SNPsYRI 120 38.852 JPT+CHB 180 33.878 CEU 120 34.103
Computational results
Experiments Setting
a) Limited to the first block of 1500 SNPs (as in related literature), or
b) Using all SNPs ( 40.000)c) Used clustering with standard HD with modal centroids and
random starting centroidsd) Used SCR with fixed using iterative heuristics based on
reduced costs solved with CPLEXe) Reconstruction with majority rulef) Quality of reconstraction: if SNPs value coherent in more
than 70% of matching haplotypes (set S), then predict, else declare undetermined
g) 2/3 of haplotypes used for training, 1/3 for testing
Computational results
DATASET beta alpha %error %undecided % correct columns %wrong columns
CEU 9 1 20.8 19.33 14,01 39.5
CEU 20 4 20.24 33.29 13.31 25.33
YRI 13 1 21.71 17.11 16.54 40.34
YRI 17 8 18.75 13.33 21.98 28.66
JPT+CHB 9 0 16.18 23.47 17.5 39.57
JPT+CHB 20 4 27.47 10.2 18.58 21.55
DATASET beta alpha %error %undecided % correct columns %wrong columns
CEU 13 2 26.88 14.83 13.27 47.33
YRI 20 4 24.92 13.27 13.2 42.11
JPT+CHB 20 2 26.16 17.04 13.09 50.77
Set Covering results, 1500 SNPs, 0.7 majority threshold
Set Covering results, ALL SNPs, 0.7 majority threshold
Computational results
DATASET beta iterations %error %undecided % correct columns %wrong columns
CEU 20 12 17.5 14.16 14.52 26.41
YRI 17 11 18.53 11.3 21.82 28.58
JPT+CHB 20 11 17.47 10.76 18.58 21.55
DATASET beta iterations %error %undecided % correct columns %wrong columns
CEU 20 9 28.49 15.46 11.76 47.98
YRI 17 5 25.63 15.61 12.64 45.04
JPT+CHB 20 10 26.1 17.66 12.7 50.89
Clustering results, 1500 SNPs, 0.7 majority threshold
Clustering results, ALL SNPs, 0.7 majority threshold
Computational results
ObservationsReconstruction error in the range of 20% of the SNPs,
improving on previous results (where comparable)
1. SCV method performs better that clustering expecially when all SNPs are used
2. Best results are obrtained with approx. 30 TAG SNPs. Larger values do not reduce the reconstruction error and slow down the computation
3. First time so many SNPs are treated simultaneously
4. Completely correct SNPs are in the range 10-20%
With 30 TAGs we can reconstruct correctly 6000 SNPs…
Computational results
Work in ProgressUse the proposed method to indentify the blocks
Use all SNPs on Training Set Apply SCV to select TAG SNPs Apply majority rule to test set and select those SNPs
that are predicted correclty all over the test set Create one block with these SNPs, associate them to
TAG set, remove these SNPs from samples Iterate until sample contains only TAG SNPs or when
no improvement is obtained
…Preliminary results are encouraging
… Larger data sets are needed in order to test the method properly