Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff...
Transcript of Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff...
BloomFilters,Minhashes,andOtherRandomStuff
BrianBrubachUniversityofMaryland,CollegePark
StringBio 2018,UniversityofCentralFlorida
What?
• Probabilistic• Space-efficient• Fast• Notexact
Why?
• Datadeluge/Bigdata/Massivedata• Millionsorbillionsofsequences• Humangenome:3Gbp• 1giga basepairs=1billioncharacters
• Microbiomesampleof1.6billion100bp readsgeneratedin10.8days(Caporaso,etal.,2012)• Mediumdata,butonalaptop• Lotsofbioinformaticshappenshere
• BeyondscalabilityofBWT,FM-index,etc.
(Berger,Daniels,andYu,2016)
CurseofDimensionality
• Sequencesarecomparedinhighdimensionalspace• Comparing𝑁 sequencestakes𝑁" time• Computingeditdistancebetweentwosequencesoflength𝑛 takes𝑛" time• Allegedly
CurseofDimensionality• ATGATCGAGGCTATGCGACCGATCGATCGATTCGTA• ATGATGGAGGCTATGGGAACGATCGATCGACTCGTA• ATGATCGAGGCTATGCCACCGATCGAACGATTCGTA• ATCATCGAGGCTATGCGACCGTTCGATCGATTCCTA• GTGATCGTGGCTATGCGACCGATCGATCGATTCGTC• ATGATCGAGGCTATGCCACCGATCGAACGATTCGTA• ATGATCCAGGCTATGCGACCGATCGATGCATTCGTA
WhyStayinHighDimensions?
• 4%&& possibleDNAstringsoflength100• 4%' ≈ 1billionreads
k-mers ofaSequence
• Allsubstringsoflengthk• Canonical:lexicographicallysmallestamongforwardandreversecomplement• Forgetthisfornow All 7-mers:
ATCTGAGGTCACATCTGAG TCTGAGG CTGAGGT TGAGGTC GAGGTCA AGGTCAC
Reverse complement:ATCTGAGGTCACGTGACCTCAGAT
Hashfunction
String HashMagic Randomintegerin{1,m}
• Willassumeidealizedmodelofhashingforthistalk• Lotsofresearchinthisarea
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable• Atleast2𝑁𝑘 bitsfordataplustableoverhead
BloomFilterExampleProblem
• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable• Atleast2𝑁𝑘 bitsfordataplustableoverhead
• Whatifwejuststoreonebitateachhashforpresence/absence?• SimpleBloomfilter,potentiallysuboptimal
BloomFilter
• Probabilisticdatastructure• Fastandspace-efficient• Falsepositives,butnofalsenegatives• Insertandcontains,butnodelete• DuetoBurtonHowardBloomin1970• Gaveexampleofautomatichyphenation• Identifythe10%ofwordsthatrequirespecialhyphenationrules
BloomFilter
• 𝑁 itemstostore:𝑥%, 𝑥", … , 𝑥/• 𝑚-bitvector• 𝑑 hashfunctions:ℎ%, ℎ", … , ℎ3• Insert(𝑥):setbitsℎ%(𝑥), ℎ"(𝑥), … , ℎ3(𝑥) to1• Contains(𝑦):• Yesifbitsℎ%(𝑦), ℎ"(𝑦), … , ℎ3(𝑦) are1• Noifanyare0
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 0 0 0 0 0 0 0 0 0
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 0 0 0
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%)
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 1 0 1
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 1 0 1
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 1 0 1
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")Contains(𝑦):ℎ% 𝑦 , ℎ" 𝑦 , ℎ;(𝑦)
• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;
BloomFilterExample
0 1 1 0 0 0 1 1 0 1
Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")
Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")Contains(𝑦):ℎ% 𝑦 , ℎ" 𝑦 , ℎ;(𝑦)
FalsePositive!
FalsePositiveprobability• Pr[onehashmissesabit]
• 1 − %=
• Pr[oneinsertionmissesabit]• 1 − %
=
3
• Pr[allinsertionsmissabit]• 1 − %
=
3>
• Pr[asinglebitflippedto1]• 1 − 1 − %
?
@A≈ 1 − 𝑒C3>/=
• Falsepositiveprobability(assumingindependence)• 1 − 𝑒C3>/= 3
Optimalparameters
• Falsepositiverate𝑝 ≈ 1 − 𝑒C3>/= 3
• Falsepositivesminimizedat𝑑 = =>ln2
• Bitsperitem=>≈ − HIJKL
HA"≈ −1.44log"𝑝
• Approximate:assumingasymptotic,independence,andintegralityof𝑑• 𝑝 = 0.01,needs9.59bitsperitem• 𝑝 = 0.001,needs14.38bitsperitem
• Numberofhashes𝑑 ≈ −log"𝑝
Properties
• Insertandcheckin𝑂(𝑑) time• Independentofnumberofitemsinserted
• Fastandparalleltocomputehashes• CandounionandintersectionwithORandANDofbitvectors• Canestimate𝑁 ifunknown
EndlessVariations
• Deletions• Counting• Bloomier filters:storingvalues• Cacheoptimizations• Distancesensitive:is𝑥 closetotheset
𝑘-mer BloomFilter
• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?
𝑘-mer BloomFilter
• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?• Observation:the“items”areoverlappingsubstringsfroma4letteralphabet
𝑘-mer BloomFilter
• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?• Observation:the“items”areoverlappingsubstringsfroma4letteralphabet• Aftergettingpositive,• Checkall4preceding𝑘-mers andall4following𝑘-mers• Onemustbeinthesetforatruepositive• Falsepositivenexttoanotherpositivelesslikely
• Canreducefalsepositivesorspace• (Pellow,Filippova,andKingsford,2017)
ATCCxATCTCCx
BioApplications
• Pan-genomestorage• Bloomfiltertrie (Holley,Wittler,andStoye,2015)
• Short-readRNA-seq database• SplitSequenceBloomtree(SolomonandKingsford,2016)
• SuccinctdeBruijn graphs• ProbabilisticdeBruijn graph(Pell,etal.,2011)• Exactversion(Chikhi andRizk,2012)
• Humangenome:3Gbp,𝑘 = 27,3.7GB,13.2bitspervertex
LocalitySensitiveHashing(LSH)
• Whatdowetypicallywanttoavoidwhenhashing?
LocalitySensitiveHashing(LSH)
• Whatdowetypicallywanttoavoidwhenhashing?• Collisions!
• Approximatenearestneighbors:towardsremovingthecurseofdimensionality(Indyk andMotwani,1998)• Idea:getsimilarelementstohashtogether• “Itskeyingredientisthenotionoflocality-sensitivehashing whichmaybeofindependentinterest;…”
ComparingTwoSequences
• Mash:fastgenomeandmetagenomedistanceestimationusingMinHash (Ondov etal.,2016)• Let𝐴 and𝐵 betwoDNAsequencestocompare• Construct𝑘-mer sets𝐴 and𝐵• Assume 𝐴 = |𝐵| fornow(nottrue)
• Comparethesetssomehow• Notfasteryet,butwe’llgetthere…
Jaccard Index
• Similaritybetweensets𝐴 and𝐵• |U∩W||U∪W|
• CorrelatedwithAverageNucleotideIdentity(ANI)• Empiricalsupport,butdebatable
Jaccard Index:|U∩W||U∪W|
A B
Jaccard Index:|U∩W||U∪W|
• Whatwouldyoudoifyouwerestudyingapopulation?
Peoplewholikepeanutbutter
PeoplewholikejellyA B
Jaccard Index:|U∩W||U∪W|
• Whatwouldyoudoifyouwerestudyingapopulation?Sample!
Peoplewholikepeanutbutter
PeoplewholikejellyA B
Sketch
• Small“fingerprint”ofadatapoint(string)
A B
Warm-up:NaïveSketch
• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)
A B
Warm-up:NaïveSketch
• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)
A B
Warm-up:NaïveSketch
A B
• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)• Smalloverlap
Minhashing/Bottom-𝑑 Sketch
• Ontheresemblanceandcontainmentofdocuments(Broder,1997)• Forcomparingdocuments
• Hasheach𝑘-mer inasequence• Sketch𝑆(𝐴):smallest𝑑 hashvaluesin𝐴• Ortakeminforeachof𝑑 differenthashfunction
• Usesamehashfunctionfor𝑆 𝐴 and𝑆(𝐵)• Letsussketcheachstring,but“simulate”sketchingtheunion𝑆(𝐴 ∪ 𝐵)• Canonicalk-mers,𝐴 and𝐵 couldbereversecomps
Minhashing/Bottom-𝑑 Sketch
• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset
A B
Minhashing/Bottom-𝑑 Sketch
• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset
A B
10
3
9
2
4
7
1
5
8
Minhashing/Bottom-𝑑 Sketch
• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset
A B
10
3
9
2
4
7
1
5
8
Thiscan’thappen
Comparingsketches
• Jaccardestimate𝑗• U∩WU∪W ≈ |f(U∪W)∩f(U)∩f W |
f U∪W
• Get𝑆 𝐴 ∪ 𝐵 bymergesortoperationin𝑂 𝑑 time• Mergeuntil𝑑 uniquehashesseen• Countnumberofmatches𝑐 = 3• 𝑗 = h
3
• Errorofestimateis𝜖 = %3�
𝑆 𝐴2347910
𝑆 𝐴 ∪ 𝐵123457
𝑆 𝐵124578
BuildingBottom-𝑑 Sketch
• Takes𝑂(𝑛log𝑑) time• Traversestring,hashing𝑘-mers• Keepsortedlistofsmallest𝑑• Checkeachnewhashagainstmaxinlist• 𝑂 log𝑑 timetoinsertifnecessary
• Actuallyexpectedtime𝑂 𝑛 + 𝑑log𝑑log𝑛• BecausePr[𝑖th hashgetsinsertedinlist]= 3
m• Soeffectivelylinear
Minhash parameters
• Probabilitysome𝑘-mer 𝑥 appearsinarandomgenomeoflength𝑛• Pr 𝑥 ∈ 𝐴 ≈ 1 − 1 − Σ Cq >
• Alphabetsize Σ = 4
• For𝑘 = 16,𝑛 = 3Gbp:• Probabilityofagiven16-merinagenomeis≈ 0.5• ≈ 25% of16-mersexpectedtobesharedbetweentworandom3Gbp genomes• Tooshort𝑘-mers canoverestimateJaccard,especiallyfordistantgenomes• Verylongcouldunderestimate,butlessofanissue
Minhash parameters
• Valueof𝑘 toachieveadesiredprobability𝑞 ofseeingagivenk-mer insequencelength𝑛• 𝑘 ≈ log u
> %Cvv
• 5Mbp genome,𝑞 = 0.01, 𝑘 ≈ 14• 3Gbp genome,𝑞 = 0.01,𝑘 ≈ 19• Mashdefault:k=21ands=1000• 8kBpersketch
Mashdistance
• MashdistancebasedonJaccard estimate𝑗• − %
q ln"x%yx
• BasedonPoissonerrormodel• Implicitlyusesaveragesizeofthetwosets,penalizingsetsofdifferentsize
Somerelatedworks
• Assemblyoverlaps• Assemblinglargegenomeswithsingle-moleculesequencingandlocality-sensitivehashing(Berlinetal.,2015)
• Containmentfordifferentsizesets• ImprovingMinHashViatheContainmentIndexwithApplicationstoMetagenomic Analysis(Koslicki andZabeti,2017)
Implementation
• MurmurHash3• OpenBloomFilterLibrary• Mash
OtherRandomStuff
OtherRandomStuff
OtherRandomStuff
FruitFlyBrains
• LocalitySensitiveHashing(LSH)• Aneuralalgorithmforafundamentalcomputingproblem(Dasgupta,Stevens,andNavlakha,2017)
• Bloomfilters• (Dasgupta,Sheehan,Stevens,andNavlakha,upcoming)• Have3specialproperties
• Continuous-valuednovelty• Distancesensitivity• Timesensitivity
Thanks!