Post on 20-Dec-2015
Population Population Stratification with Stratification with
Limited DataLimited DataByBy
Kamalika ChaudhuriKamalika Chaudhuri, Eran , Eran Halperin, Satish Rao and Halperin, Satish Rao and
Shuheng ZhouShuheng Zhou
The ProblemThe Problem
Given:Given: Samples from two hidden distributions PSamples from two hidden distributions P11
and Pand P22
Unknown labelsUnknown labels Each sample/individual:Each sample/individual:
k features: 0/1 valuesk features: 0/1 values Population PPopulation P11 : : feature f is 1 w.p. pfeature f is 1 w.p. p11
ff
Population PPopulation P22 : : feature f is 1 w.p. pfeature f is 1 w.p. p22ff
Unknown feature probabilitiesUnknown feature probabilities
The ProblemThe Problem
Given:Given: 2n samples from two hidden distributions P2n samples from two hidden distributions P11 and P and P22
Unknown labelsUnknown labels
Goal: Classify each individual correctly for most Goal: Classify each individual correctly for most inputsinputs
ApplicationsApplications
Preprocessing step in statistical Preprocessing step in statistical analysis:analysis: Analyze the factors that cause a complex Analyze the factors that cause a complex
disease, such as cancerdisease, such as cancer Cluster the samples into populations, then Cluster the samples into populations, then
apply statistical analysisapply statistical analysis
Collaborative FilteringCollaborative Filtering Feature can be “likes Star Wars or not” Feature can be “likes Star Wars or not” Cluster users into types using the featuresCluster users into types using the features
Our ResultsOur Results
Need some separation between the Need some separation between the distributions!distributions!
Measure of Separation : distance between Measure of Separation : distance between meansmeans = L= L11 distance between means / k distance between means / k
= L= L2222 distance between means / k distance between means / k
Our Results: Our Results: Optimization function and poly-time Optimization function and poly-time
algorithm : algorithm : k = k = (√k log n)(√k log n) Optimization function : Optimization function : k = k = ( log n)( log n)
Our ResultsOur Results
This talk:This talk: Optimization function and poly-time Optimization function and poly-time
algorithm : algorithm : k = k = (√k log n)(√k log n) Example:Example:
PP11 : For each feature f, p : For each feature f, p11ff = ½ = ½
PP22 : For each feature f, p : For each feature f, p22ff = ½ + √log n/√k = ½ + √log n/√k
Information-theoretically optimal:Information-theoretically optimal: There exists two distributions with this There exists two distributions with this
separation and constant overlap in separation and constant overlap in probability mass probability mass
Optimization FunctionOptimization Function
What measure to optimize to get the What measure to optimize to get the correct clustering?correct clustering?
Need a robust measure which works Need a robust measure which works for small separationsfor small separations
A Robust MeasureA Robust Measure
Find the best balanced partition (S,S’) such Find the best balanced partition (S,S’) such that:that: ff |N |Nff(S) – N(S) – Nff(S’)|(S’)|
is maximumis maximum
NNff(S), N(S), Nff(S’) : # of individuals with feature f in S, S’(S’) : # of individuals with feature f in S, S’
A Robust MeasureA Robust Measure
Find the best balanced partition (S,S’) such Find the best balanced partition (S,S’) such that:that: ff |N |Nff(S) – N(S) – Nff(S’)|(S’)|
is maximumis maximum
NNff(S), N(S), Nff(S’) : # of individuals with feature f in S, S’(S’) : # of individuals with feature f in S, S’
Theorem : Optimizing this measure provides the correct partition w.h.p. if
k = k = (√k log n)(√k log n)
Proof Sketch:Proof Sketch:
How does the optimal partition behave?How does the optimal partition behave?
E[ f(P)] = k n + k √n
Pr[ | f(P) – E[f] | >n√k ] · 2-n
E[ f(Any partition)] = k √n
Pr[ | f(P) – E[f] | > n√k] · 2-nThe partition with the optimal value of f
in (I) dominates all the partitions in (II) w.h.p for the separation conditions
An AlgorithmAn Algorithm How can we find the partition which optimizes this How can we find the partition which optimizes this
measure?measure?
Theorem:Theorem: There exists an There exists an algorithm which finds the correct algorithm which finds the correct partition whenpartition when
k = k = (√k log(√k log22n)n)
Running Time : O(nk log2 n)
An AlgorithmAn Algorithm
Algorithm:Algorithm:
1.1. Divide individuals into two sets: A Divide individuals into two sets: A and Band B
2.2. Start with a random partition of AStart with a random partition of A
3.3. Iterate log n times:Iterate log n times:1.1. Classify B using current partition of A Classify B using current partition of A
and a proximity scoreand a proximity score
2.2. And the same for AAnd the same for A
An AlgorithmAn Algorithm
Iterate:Iterate: Classify B using Classify B using
current partition of A current partition of A and a scoreand a score
And vice versa.And vice versa.
Random Partition:Random Partition: ( 1/2 + 1/√n) ( 1/2 + 1/√n)
imbalanceimbalance Each iteration Each iteration
produces a partition produces a partition with more imbalancewith more imbalance
Classification ScoreClassification Score
Our Score: For each feature f,Our Score: For each feature f, If NIf Nff(S) > N(S) > Nff(S’)(S’)
add 1 to the score if f is present, else add 1 to the score if f is present, else subtract 1subtract 1
If NIf Nff(S) < N(S) < Nff(S’) (S’)
add 1 to the score if f is absent, else add 1 to the score if f is absent, else subtract 1subtract 1
Classify:Classify: Individuals above the median score : SIndividuals above the median score : S Individuals below the median score : S’Individuals below the median score : S’
ClassificationClassification LemmaLemma: If the current partition has (1/2 + : If the current partition has (1/2 + )-)-
imbalance, the next iteration produces a partition imbalance, the next iteration produces a partition with (1/2 + 2with (1/2 + 2)-imbalance [for )-imbalance [for < c] < c]
LemmaLemma: If the current partition has (1/2 + c)-: If the current partition has (1/2 + c)-imbalance, the next iteration produces the correct imbalance, the next iteration produces the correct partition with our separation conditions.partition with our separation conditions.
(log n) rounds needed to get the correct partition(log n) rounds needed to get the correct partition
Use a fresh set of features in each round to get Use a fresh set of features in each round to get independenceindependence
Proof Sketch:Proof Sketch: LemmaLemma: If the current partition has (1/2 + : If the current partition has (1/2 + )-)-
imbalance, the next iteration produces a imbalance, the next iteration produces a partition with (1/2 + 2partition with (1/2 + 2)-imbalance [for )-imbalance [for < c] < c]
Initially:
G ≈ (log n)
X, Y ≈ Bin(k, ½)
Population 1 Population 2
G
G = ( 2 k√n)
Proof Sketch:Proof Sketch: LemmaLemma: If the current partition has (1/2 + : If the current partition has (1/2 + )-)-
imbalance, the next iteration produces a imbalance, the next iteration produces a partition with (1/2 + 2partition with (1/2 + 2)-imbalance [for )-imbalance [for < c] < c]
Population 1 Population 2
G
G = ( 2 k√n)
Pr[ Correct Classification ]
= ½ + Ga/√k /(½ + ½)
> ½ + 2
[From separation conditions]
Proof Sketch:Proof Sketch: LemmaLemma: If the current partition has (1/2 + c)-: If the current partition has (1/2 + c)-
imbalance, the next iteration produces the imbalance, the next iteration produces the correct partition with our separation conditions.correct partition with our separation conditions.
Population 1 Population 2
G = ( 2 k√n)
All but a 1/poly(n) fraction is correctly classified
Related WorkRelated Work
Learning Mixtures of Gaussians Learning Mixtures of Gaussians [D99]:[D99]: Best performance by Spectral Best performance by Spectral
Algorithms [VW02, AM05,KSV05]Algorithms [VW02, AM05,KSV05]
Our algorithm :Our algorithm : Matches the bounds in [VW02] for two Matches the bounds in [VW02] for two
clustersclusters Not a spectral algorithm !Not a spectral algorithm !
Open QuestionsOpen Questions
How to extend our algorithm to work How to extend our algorithm to work for multiple clusters ?for multiple clusters ?
What is the relationship between our What is the relationship between our algorithm and spectral algorithms?algorithm and spectral algorithms? Matches spectral algorithms of [M01] Matches spectral algorithms of [M01]
for two-way graph partitioning for two-way graph partitioning Can our algorithm do better?Can our algorithm do better?
Thank You!Thank You!