Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014....
Transcript of Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014....
Efficient and Accurate Clustering for Large-Scale Genetic Mapping
V. Strnadová (Neeley) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez ,
John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker
* ++ § ¶
*,++ * *, ¶ §
§++ *,§, ¶ *
Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute
Motivation
• High-throughput sequencing methods have produced a flood of inexpensive genetic information
• Genetic maps are important to breeding studies but genetic mapping software is prohibitively slow on large data sets
The Genetic Mapping Problem
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(missing data)
Data
The Genetic Mapping Problem
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(missing data)
𝑚1
𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4𝑚5
𝑚6
𝑚7𝑚10𝑚11𝑚12
𝑚13𝑚14
Linkage group 1
Linkage group 2
cluster
Data
The Genetic Mapping Problem
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(missing data)
𝑚1
𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4𝑚5
𝑚6
𝑚7𝑚10𝑚11𝑚12
𝑚13𝑚14
𝑚8
𝑚15
𝑚2
𝑚1
𝑚9
Linkage group 1
Linkage group 2
cluster
Linkage group 1 Linkage group 2
Data
𝑚11
𝑚6
𝑚13
𝑚3
𝑚12
𝑚10
𝑚4
𝑚7
𝑚5𝑚14
The Genetic Mapping Problem
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(missing data)
𝒎𝟏
𝒎𝟐
𝒎𝟖𝒎𝟗
𝒎𝟏𝟓
𝒎𝟑
𝒎𝟒𝒎𝟓
𝒎𝟔
𝒎𝟕𝒎𝟏𝟎𝒎𝟏𝟏𝒎𝟏𝟐
𝒎𝟏𝟑𝒎𝟏𝟒
Linkage group 1
Linkage group 2
cluster
Data
𝑚8
𝑚15
𝑚2
𝑚1
𝑚9
Linkage group 1 Linkage group 2
𝑚11
𝑚6
𝑚13
𝑚3
𝑚12
𝑚10
𝑚4
𝑚7
𝑚5𝑚14
The Need for Large-Scale Clustering in Genetic Mapping
• Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers
• A major bottleneck is the linkage-group-finding phase
• Popular mapping tools all handle this phase the same way, with an 𝑂(𝑀2) clustering algorithm for 𝑀 markers
• Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers
• A major bottleneck is the linkage-group-finding phase
• Popular mapping tools all handle this phase the same way, with an 𝑂(𝑀2) clustering algorithm for 𝑀 markers
Our solution: A fast, scalable clustering algorithm tailored to genetic marker data
The Need for Large-Scale Clustering in Genetic Mapping
Standard Approach to Genetic Marker Clustering
cluster
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
Data
𝑚1
𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4𝑚5
𝑚6
𝑚7𝑚10𝑚11𝑚12
𝑚13𝑚14
Linkage group 1
Linkage group 2
Standard Approach to Genetic Marker Clustering
(1)
𝑚1𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
Linkage group 2
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices
• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked
• (2) Cut all edges below a LOD threshold
• (3) The resulting connected components = linkage groups
Standard Approach to Genetic Marker Clustering
(1)
𝑚1𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
Linkage group 2
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices
• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked
• (2) Cut all edges below a LOD threshold
• (3) The resulting connected components = linkage groups
LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:
𝐿𝑂𝐷(𝑚𝑖 , 𝑚𝑗) = log10𝑃(𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)
𝑃(𝑛𝑜 𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)
Formally,
𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑁𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+𝑁𝑅𝑖𝑗
Where:
𝑅𝑖𝑗 = number of recombinant offspring
𝑁𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅
𝑅+𝑁𝑅
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:
𝐿𝑂𝐷(𝑚𝑖 , 𝑚𝑗) = log10𝑃(𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)
𝑃(𝑛𝑜 𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)
Formally,
𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
Where:
𝑅𝑖𝑗 = number of recombinant offspring
𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅𝑖𝑗
𝑅𝑖𝑗+ 𝑅𝑖𝑗
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:
𝑳𝑶𝑫(𝒎𝒊,𝒎𝒋) = 𝐥𝐨𝐠𝟏𝟎(𝟏 − 𝟏 𝟑)
𝟐( 𝟏 𝟑)𝟏
𝟎. 𝟓𝟑= 𝟎. 𝟎𝟕𝟒
Formally,
𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
Where:
𝑅𝑖𝑗 = number of recombinant offspring
𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅𝑖𝑗
𝑅𝑖𝑗+ 𝑅𝑖𝑗
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
Standard Approach to Genetic Marker Clustering
(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices
• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked
(2) Cut all edges below a LOD threshold
(2)
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
Standard Approach to Genetic Marker Clustering
(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices
• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked
(2) Cut all edges below a LOD threshold
(3) The resulting connected components = linkage groups
(3)
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
Our Approach: The BubbleCluster Algorithm
Primary assumption: Clusters have a “linear structure”
Our Approach: The BubbleCluster Algorithm
Primary assumption: Clusters have a “linear structure”
Key idea: Maintain a set of representative or “sketch” points which reveal the cluster structure
LOD threshold
Input: set of markers, LOD threshold 𝜏, non-missing threshold 𝜂, low-quality threshold 𝑐, cluster size threshold 𝜎
Output: set of clusters C and set of representative points R
Phase I: Build initial set of clusters and set of representative points using high-quality markers (those with at least 𝜂 non-missing entries)
Phase II: Add low quality markers (less than 𝜂 non-missing entries) to intialset of clusters
Phase III: Attempt to merge small clusters with large
The BubbleCluster Algorithm: Overview
The BubbleCluster Algorithm
𝒎
Iteration i:find 𝑟𝑀𝐴𝑋 ∶= 𝑟𝑗 for which 𝐿𝑂𝐷(𝑚, 𝑟𝑗) is
maximal;set 𝐶𝑀𝐴𝑋 ≔ 𝐶𝐾 ∈ 𝐶 containing 𝑟𝑀𝐴𝑋
𝐿𝑂𝐷(𝑚, 𝑟𝑗)
𝑟𝑗𝐶1 𝐶2
The BubbleCluster Algorithm
If (𝐿𝑂𝐷 𝑚, 𝑟𝑀𝐴𝑋 < 𝑳𝑶𝑫_𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅)
𝒎𝐿𝑂𝐷(𝑚, 𝑟𝑀𝐴𝑋)
𝑟𝑀𝐴𝑋𝐶1 𝐶2
The BubbleCluster Algorithm
If (𝐿𝑂𝐷 𝑚, 𝑟𝑀𝐴𝑋 < 𝑳𝑶𝑫_𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅)𝐶 = 𝐶 ∪ {𝑚}
𝒎
𝑟𝑀𝐴𝑋𝐶1 𝐶2
𝐶3
The BubbleCluster Algorithm
Else If ( IS_INTERIOR 𝑟𝑀𝐴𝑋 )
𝒎
𝑟𝑀𝐴𝑋
𝐶1𝐶2
𝐿𝑂𝐷(𝑚, 𝑟𝑀𝐴𝑋)
The BubbleCluster Algorithm
Else If ( IS_INTERIOR 𝑟𝑀𝐴𝑋 )𝐶𝑀𝐴𝑋 = 𝐶𝑀𝐴𝑋 ∪ {𝑚}
𝒎
𝑟𝑀𝐴𝑋
𝐶1 = 𝐶𝑀𝐴𝑋𝐶2
Else If ( IS_EXTERIOR 𝑚, 𝑟𝑀𝐴𝑋 )
𝒎
𝐶2𝐶1 = 𝐶𝑀𝐴𝑋
𝑟𝑀𝐴𝑋
The BubbleCluster Algorithm
The BubbleCluster Algorithm
Else If ( IS_EXTERIOR 𝑚, 𝑟𝑀𝐴𝑋 )Add 𝑚 to representative points of 𝐶𝑀𝐴𝑋Add 𝑚 to 𝐶𝑀𝐴𝑋
𝒎
𝐶2𝐶1 = 𝐶𝑀𝐴𝑋
𝑟𝑀𝐴𝑋
Else // 𝑚 is interior to the outer point 𝑟𝑀𝐴𝑋
The BubbleCluster Algorithm
𝒎
𝐶2𝐶1 = 𝐶𝑀𝐴𝑋
𝑟𝑀𝐴𝑋
Else // 𝑚 is interior to the outer point 𝑟𝑀𝐴𝑋Add 𝑚 to 𝐶𝑀𝐴𝑋
The BubbleCluster Algorithm
𝒎
𝐶2𝐶1 = 𝐶𝑀𝐴𝑋
𝑟𝑀𝐴𝑋
The BubbleCluster Algorithm
𝒎
𝐶1 𝐶2
If 𝑚 has a LOD score above the threshold to two clusters,
𝐶𝑁𝐸𝑊 = 𝐶1 ∪ 𝐶2
If 𝑚 has a LOD score above the threshold to two clusters, Then merge the clusters and add m to the merged cluster
𝒎
The BubbleCluster Algorithm
End of Phase IStop when all markers with at least 𝜂 non-missing entries have been processed
Running time: 𝑶 |𝜢| 𝐥𝐨𝐠𝟐 𝚮 + |𝜢||𝑹|where: |𝛨| = size of high-quality marker set,
|𝑅| = size of representative point set
Phase II: add low-quality markers
Phase III: merge small clusters
BubbleCluster Parameters
LOD threshold
Non-missing threshold: determines how many markers are high-quality
Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃)𝑁𝑅𝜃𝑅
0.5𝑅+𝑁𝑅
Highest achievable LOD: lim𝑅→0log10
(1 − 𝜃)𝑁𝑅𝜃𝑅
0.5𝑅+𝑁𝑅= 𝑁𝑅 log10 2
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
𝜏1
𝜏2
LOD threshold
Non-missing threshold: determines how many markers are high-quality
Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
𝜏1
𝜏2
BubbleCluster Parameters
LOD threshold
Non-missing threshold: determines how many markers are high-quality
Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
Example LOD: log10(1 − 1 3)
2( 1 3)1
0.53= 0.074
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
𝜏1
𝜏2
BubbleCluster Parameters
LOD threshold
Non-missing threshold: determines how many markers are high-quality
Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
Highest achievable LOD: lim𝑅𝑖𝑗→0log10
(1 − 𝜃𝑖𝑗) 𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅+ 𝑅𝑖𝑗
= 𝑅𝑖𝑗log10 2
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
𝜏1
𝜏2
BubbleCluster Parameters
Evaluation Metric: 𝐹-score
Given a golden standard clustering, the 𝐹-score measures the quality of another clustering by comparing it to the golden standard
Range: 0 – 1
The 𝐹-score is a harmonic mean of precision and recallAn 𝐹-score of 1 indicates perfect precision and perfect recall for every golden
standard cluster
Results: BubbleCluster on Real Data Sets
Dataset Size Time F-Score
Barley 64K 15 sec 0.9993
Switchgrass 113K 8.9 min 0.9745
Switchgrass 548K 1.9 hrs 0.9894
Wheat 1.58M 1.22 hrs N/A *
* Results under review at Genome Biology
Comparison of Clustering Algorithms for Simulated Data
ClusteringMethod
12.5K Markers 25K Markers
F-score Time F-score Time
JoinMap 0.99964 14 min 0.99982 46 min
MSTMap 0.99964 4.5 min 0.99982 20 min
PIC 0.47024 11 sec (+ 4min)
0.60782 44 sec (+ 16.5min)
BubbleCluster 0.99964 6 sec 0.99982 15 sec
Simulated data created with Nicholas Tinker’s Spaghetti Software.
4.4
9.8
21.0
47.2
96.6
237.8
6.7
14.0
31.7
86.4
198.7
475.1
1
2
4
8
16
32
64
128
256
512
1024
12.5 25 50 100 200 400
1E-5
1E-4
1E-3
1E-2
1E-1
1E+0
Ru
nti
me
(s)
Dataset Size (in thousands of markers)
Erro
r (1
-Fs
core
)
error, 65% missing
error, 35% missing
runtime, 65% missing
runtime, 35% missing
Scaling Results for Bubble Cluster on Simulated Data
Effect of the LOD threshold
LOD threshold 5 10 15 20 25 30
F-Score 0.6225 0.9999 0.9999 0.9999 0.9999 0.9999
Time (s) 48.6 67.0 70.9 82.0 106 170
Fixed missing entry threshold, increasing LOD threshold
200K markers, 300 individuals, 35% missing rate
𝜏1𝜏2 vs.
Effect of the missing data threshold
Fixed LOD threshold, increasing non-missing entry threshold
Non-missing threshold
132 166 172 179 186 192
F-Score 0.9999 0.9999 0.9992 0.9930 0.9610 0.8948
Time (s) 82.0 84.6 82.7 83.0 81.7 82.0
200K markers, 300 individuals, 35% missing rate
𝜏 𝜏𝜏vs.
ConclusionBy exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data
𝒎
𝐶1 𝐶2
ConclusionBy exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data
𝒎
𝐶1 𝐶2While remaining highly accurate, we outperform popular existing tools in both runtime and scalability
Clustering Method 12.5K Markers 25K Markers
F-score Time F-score Time
JoinMap 0.99964 14 min 0.99982 46 min
MSTMap 0.99964 4.5 min 0.99982 20 min
PIC 0.47024 11 sec (+ 4 min)
0.60782 44 sec (+ 16.5 min)
BubbleCluster 0.99964 6 sec 0.999982 15 sec
Future Work
• Use representative points as starting point for ordering phase
Future Work
• Use representative points as starting point for ordering phase
• Provide a more thorough theoretical analysis of achievable clustering as well as order accuracy given assumptions about error and missing data rates
• Develop efficient and accurate, large-scale genetic mapping software
Thank You
Code for BubbleCluster soon available at: www.ucsb.edu/~veronika
Backup Slides
Choosing LOD and non-missing thresholdGoal: minimize 𝑷(mistake) and maximize 𝑭- score
Let 𝑝 = 𝑃 𝐿𝑂𝐷 𝑚𝑖 , 𝑚𝑗 > 𝐿𝑂𝐷𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑥𝑖 ∈ 𝐶𝑖 , 𝑥𝑗 ∈ 𝐶𝑗 , 𝑖 ≠ 𝑗)
Let 𝜏 = LOD threshold
By definition of the LOD score, 𝑝 =1
10𝜏
Let 𝑛𝑐𝑜𝑚𝑝 = number of LOD comparisons we make
Then, if we want to ensure that (1 − 𝑝)𝑛𝑐𝑜𝑚𝑝< 1 − 𝜀 then we need:
𝜏 > log10(1
1 − (1 − 𝜀) 1𝑛𝑐𝑜𝑚𝑝)
At the same time, we want to include a marker in the high-quality set only if we expect that it will achieve a LOD of 𝜏 or greater with another marker, requiring:
𝑛𝑛𝑚 > 𝜏(1 − 𝜇) log10 2
Where 𝜇 is the missing rate and 𝑛𝑛𝑚 is the number of non-missing entries in the marker
Evaluation Metric for Cluster Quality
F-score (range: 0 to 1)• Given a “golden standard clustering”, the F-score measures the
quality of a clustering as follows:
• The F-score between a golden standard cluster 𝑔 and a test cluster 𝑐is a harmonic mean of precision 𝑃 and recall 𝑅:
𝐹𝑠𝑐𝑜𝑟𝑒 𝑔, 𝑐 =2𝑃𝑅
𝑃 + 𝑅• The overall F-score between the golden standard clustering G and a
test clustering C is a weighted average of the F-scores for each golden standard cluster 𝑔:
𝑜𝑣𝑒𝑟𝑎𝑙𝑙_𝐹𝑠𝑐𝑜𝑟𝑒 𝐺, 𝐶 =1
𝑚
𝑔 ∋ 𝐺
𝑔 ∗ max𝑐 ∋𝐶𝐹𝑠𝑐𝑜𝑟𝑒(𝑔, 𝑐)
Local Linearity Assumption
Although the LOD score does not obey the triangle inequality, we assume that it does at close ranges and with enough data
𝑳𝑶𝑫(𝒎, 𝒓𝟏)𝑳𝑶𝑫(𝒎, 𝒓𝟐)
𝒎
𝒓𝟏𝒓𝟐𝑳𝑶𝑫(𝒓𝟐, 𝒓𝟏)
𝑳𝑶𝑫 𝒎, 𝒓𝟐 < 𝑳𝑶𝑫 𝒎, 𝒓𝟏&&
𝑳𝑶𝑫 𝒎, 𝒓𝟐 < 𝑳𝑶𝑫 𝒓𝟏, 𝒓𝟐&&
𝑳𝑶𝑫 𝒎, 𝒓𝟏 > 𝑳𝑶𝑫 𝒓𝟏, 𝒓𝟐
𝒎 is a new boundary point
The Effect of the LOD threshold
The standard approach to clustering can be viewed as single linkage clustering
Time complexity: 𝑂(𝑀2)
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
The Effect of the LOD threshold
A high LOD threshold ensures that the only edges that remain in the completely connected graph are between markers that are extremely likely to be genetically linked
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
The Effect of the LOD threshold
Example: LOD threshold = 10
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
The Effect of the LOD threshold
Example: LOD threshold = 8
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
The Effect of the LOD threshold
Example: LOD threshold = 7
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1