Lecture 8 - Stanford...
Transcript of Lecture 8 - Stanford...
Lecture8HASHING!!!!!
Announcements
• HW3dueFriday!
• HW4postedFriday!
Today:hashing
n=9buckets
1
2
3
9
13
22
43
9…
NIL
NIL
NIL
NIL
#
Outline
• HashtablesareanothersortofdatastructurethatallowsfastINSERT/DELETE/SEARCH.
• likeself-balancingbinarytrees
• Thedifferenceiswecangetbetterperformanceinexpectationbyusingrandomness.
• LikeQuickSort vs.MergeSort
• Hashfamiliesarethemagicbehindhashtables.
• Universalhashfamiliesareevenmoremagic.
Goal:JustlikeonMonday
• WeareinterestinginputtingnodeswithkeysintoadatastructurethatsupportsfastINSERT/DELETE/SEARCH.
• INSERT
• DELETE
• SEARCH
5
datastructure
5
4
52
HEREITIS
nodewithkey“2”
Today:
• Hashtables:
• O(1)expectedtimeINSERT/DELETE/SEARCH
• Worseworst-caseperformance,butoftengreatinpractice.
OnMonday:
• Selfbalancingtrees:
• O(log(n))deterministicINSERT/DELETE/SEARCH
#prettysweet
#evensweeterinpractice
eg,Python’sdict,Java’sHashSet/HashMap,C++’sunordered_map
Hashtablesareusedfordatabases,caching,objectrepresentation,…
OnewaytogetO(1)time
• Sayallkeysareintheset{1,2,3,4,5,6,7,8,9}.
• INSERT:
• DELETE:
• SEARCH:
9 6 3 5
4 5 6 7 8 9
963 5
1 2 3
6
3 2
3ishere.
Thisiscalled
“directaddressing”
Thatshouldlookfamiliar
• KindoflikeBUCKETSORT fromLecture6.
• Sameproblem:ifthekeysmaycomefromauniverse U={1,2,….,10000000000}….
Thesolutionthenwas…• Putthingsinbucketsbasedononedigit.
1 2 3 4 5 6 7 8 90
345
50 1321
101
1
234
21 345 13 101 50 234 1
INSERT:
NowSEARCH 21
It’sinthisbucketsomewhere…
gothroughuntilwefindit.
22 342 12 102 52 232 2
INSERT:
Problem…
1 2 3 4 5 6 7 8 90
342
52
12
22
102
2
232
NowSEARCH 22….thishasn’tmade
ourliveseasier…
Hashtables
• Thatwasanexampleofahashtable.
• notaverygoodone,though.
• Wewillbemoreclever(andlessdeterministic) aboutourbucketing.
• Thiswillresultinfast(expectedtime)INSERT/DELETE/SEARCH.
Butfirst!Terminology.• WehaveauniverseU,ofsizeM.
• Misreallybig.
• Butonlyafew(sayatmostnfortoday’slecture)elementsofMareevergoingtoshowup.
• Miswaaaayyyyyyy biggerthann.
• Butwedon’tknowwhichoneswillshowupinadvance.
Allofthekeysinthe
universeliveinthis
blob.
UniverseU
Afewelementsarespecial
andwillactuallyshowup.
Example:Uisthesetofallstringsofatmost
140ascii characters.(128140 ofthem).
TheonlyoneswhichIcareaboutarethose
whichappearastrendinghashtagson
twitter.#hashhashtags
Therearewayfewerthan128140 ofthese.
Examplesaside,I’mgoingtodrawelementslikeI
alwaysdo,asblueboxeswithintegersinthem…
Thepreviousexamplewiththisterminology
• WehaveauniverseU,ofsizeM.• atmostnofwhichwillshowup.
• Mis waaaayyyyyy biggerthann.
• WewillputitemsofUintonbuckets.
• Thereisahashfunction h:U →{1,…,n}whichsayswhatelementgoesinwhatbucket.
Allofthekeysinthe
universeliveinthis
blob.
UniverseU
nbuckets1
2
3
h(x)=least
significantdigitofx.
Forthislecture,I’massumingthatthe
numberofthingsisthesameasthe
numberofbuckets,botharen.
Thisdoesn’thavetobethecase,
althoughwedowant:
#buckets=O(#thingswhichshowup)
Thisisahashtable(withchaining)
• Arrayofnbuckets.
• Eachbucketstoresalinkedlist.• WecaninsertintoalinkedlistintimeO(1)
• TofindsomethinginthelinkedlisttakestimeO(length(list)).
• h:U → {1,…,n}canbeanyfunction:• butforconcretenesslet’sstickwithh(x)=leastsignificantdigitofx.
nbuckets(sayn=9)
1
2
3
9
13 22 43
Fordemonstration
purposesonly!
Thisisaterriblehash
function!Don’tusethis!
9
INSERT:
13
22
43
9
…
SEARCH43:
Scanthroughalltheelementsin
bucketh(43)=3.
Aside:Hashtableswithopenaddressing
• Thepreviousslideisabouthashtableswithchaining.
• There’salsosomethingcalled“openaddressing”
• You’llseeitonyourhomeworkJ
n=9buckets
1
2
3
9
13 43
…
Thisisa“chain”
n=9buckets
1
2
3
9
…
13
43
\end{Aside}
Thisisahashtable(withchaining)
• Arrayofnbuckets.
• Eachbucketstoresalinkedlist.• WecaninsertintoalinkedlistintimeO(1)
• TofindsomethinginthelinkedlisttakestimeO(length(list)).
• h:U → {1,…,n}canbeanyfunction:• butforconcretenesslet’sstickwithh(x)=leastsignificantdigitofx.
nbuckets(sayn=9)
1
2
3
9
13 22 43
Fordemonstration
purposesonly!
Thisisaterriblehash
function!Don’tusethis!
9
INSERT:
13
22
43
9
…
SEARCH43:
Scanthroughalltheelementsin
bucketh(43)=3.
Thisisagoodideaaslongastherearenottoomanyelementsinthatbucket!
Themainquestion
• Howdowepickthatfunctionsothatthisisagoodidea?
1. Wewanttheretobenotmanybuckets(say,n).
• Thismeanswedon’tusetoomuchspace
2. Wewanttheitemstobeprettyspread-out inthebuckets.
• ThismeansitwillbefasttoSEARCH/INSERT/DELETE
n=9buckets
1
2
3
9
13
22
43
9
…
n=9buckets
1
2
3
9
13 43
…
21
93
vs.
Worst-caseanalysis
• Designafunctionh:U->{1,…,n} sothat:
• Nomatterwhatinput(fewerthannitemsofU)DarthVaderchooses,thebucketswillbebalanced.
• Here,balancedmeansO(1)entriesperbucket.
• Ifwehadthis,thenwe’dachieveourdreamofO(1)INSERT/DELETE/SEARCH
Takeaminutetotalktotheperson
nexttoyou.Canyoucomeupwith
suchafunction?
Wereallycan’tbeatDarthVaderhere.
.
UniverseU
h(x)nbuckets
Theseareallthethingsthat
hashtothefirstbucket.
• TheuniverseUhasM items
• Theygethashedintonbuckets
• Atleastonebucket receivesatleastM/nitems
• MisWAAYYYYYbigger thenn,soM/nisbiggerthann.
• DarthVaderchoosesnoftheitemsthatlandedinthis
veryfullbucket.
Solution:
Randomness
Thegame
13 22 43 92
1. Anadversarychoosesanynitems
�", �$, … , �& ∈ �,andanysequence
ofINSERT/DELETE/SEARCH
operationsonthoseitems.
2. You,thealgorithm,
choosesarandom hash
functionℎ: � → {1,… , �}.
3. HASHITOUT
1
2
3
n
13
22
92
…
437
7
Whatdoes
randommean
here?Uniformly
random?
Pluckythepedanticpenguin
INSERT13,INSERT22,INSERT43,
INSERT92,INSERT7,SEARCH43,
DELETE92,SEARCH7,INSERT92
Whyshouldthishelp?
• Saythathis uniformlyrandom.
• Thatmeansthath(1)isauniformlyrandom numberbetween1andn.
• h(2)isalsoauniformlyrandomnumberbetween1andn,independentofh(1).
• h(3)isalsoauniformlyrandom numberbetween1andn,independentofh(1),h(2).
• …
• h(n)isalsoauniformlyrandom numberbetween1andn,independentofh(1),h(2),…,h(n-1).
Universe
U
nbucke
ts
h
Whatdowewant?
1
2
3
n
14
22
92
…
43
8
7 ui 32 5 15
It’sbad iflotsofitemslandinui’s bucket.
Sowewantnotthat.
Moreprecisely
1
2
3
n
14
22
92
…
43
8
ui
• Supposethatforallui thatthebadguychose• E[numberofitemsinui ‘sbucket]≤ 2.
• Thenforeachoperationinvolvingui• E[timeofoperation]=O(1)
• Bylinearityofexpectation,
• � �������������ℎ������������
• = � ∑ ���������������BCDEFGHIC&J
• = ∑ �[���������������BCDEFGHIC&J ]
• = ∑ � 1BCDEFGHIC&J
• =O(numberofoperations)
aka,O(1)peroperation!
Sowewant:
• Foralli=1,…,n,
E[numberofitemsinui ‘sbucket]≤ 2.
Aside:whynotjust:
• Foralli=1,…,n:
E[numberofitemsinbucketi ]≤ 2?
1
2
3
n
14 22 92
…
43 8
thishappenswith
probability1/n
Suppose:
1
2
3
n
14 22 92
…
43 8
andthishappens
withprobability1/netc.
ThenE[numberofitemsinbucketi ]=1foralli.
ButP{thebucketsgetbig}=1.
Sowewant:
• Foralli=1,…,n,
E[numberofitemsinui ‘sbucket]≤ 2.
Expectednumberofitemsinui’s bucket?
UniverseU
nbucke
ts
h
ujui
• � = ∑ � ℎ �I = ℎ �N&NO"
• = 1 +∑ � ℎ �I = ℎ �NBNQI
• = 1 +∑ 1/�BNQI
• = 1 +&S"
&≤ 2.
That’swhat
wewanted.youwillverify
thisonHW
COLLISION!
That’sgreat!
• Foralli=1,…,n,
• E[numberofitemsinui ‘sbucket]≤ 2
Thisimplies(aswesawbefore):
Foranysequence ofLINSERT/DELETE/SEARCH
operationsonanynelementsofU,theexpected
runtime(overtherandomchoiceofh)isO(L).
aka,anythingDarthVadermight
pickinStep1ofthegame. aka,O(1)per
operation.
Theelephantintheroom
Theelephantintheroom
h(1)=2
h(2)=7
h(3)=9
h(4)=1
h(5)=0
h(6)=7
h(7)=2
h(8)=3
h(9)=7
h(10)=3
h(11)=4
h(12)=5
h(13)=7
h(14)=3
h(15)=2
h(16)=9
h(17)=3
h(18)=2
h(19)=1
h(20)=5
h(4511)=3
h(4512)=7
h(4513)=2
h(4514)=6
h(4515)=3
h(4516)=1
h(4517)=0
h(4518)=0
h(4519)=3
h(4520)=1
h(264511)=3
h(264512)=1
h(264513)=0
h(264514)=0
h(264515)=7
h(264516)=8
h(264517)=9
h(264518)=2
h(264519)=6
h(264520)=3
... ….
Randomizationisfine…
• Saythatthiselephant-shapedblob
representsthesetofallhashfunctions.
• Howbigisthisset?
• n|U| =nM =REALLYBIG.
• Inordertowritedown
anarbitraryelement
ofasetofsizeA,we
needlog(A)bits.
• Sowe’dneedaboutMlog(n)bits
torememberoneofthesehash
functions. That’s enough to do direct addressing!!!!
butweneedtobeabletostoreourchoiceofh!
Anotherthought…
• Justrememberhontherelevantvalues
Algorithmnow Algorithmlater
1322
4392
7
h(13)=6
h(13)=6
h(22)=3
h(92)=3
Butthat’swhatwe
wantedtobeginwith…
Solution
• Pickfromasmallersetoffunctions.
Acleverlychosen subset
offunctions.Wecallsuch
asubsetahashfamily.
Weneedonlylog|H|bits
tostoreanelementofH.
H
Howtopickthehashfamily?
• Let’sgobacktothatcomputationfromearlier….
Expectednumberofitemsinui’s bucket?
UniverseU
nbucke
ts
h
ujui
• � = ∑ � ℎ �I = ℎ �N&NO"
• = 1 +∑ � ℎ �I = ℎ �NBNQI
• = 1 +∑ 1/�BNQI
• = 1 +&S"
&≤ 2.
Sothenumber
ofitemsinui’s
bucketisO(1).
youwillverify
thisonHW
COLLISION!
Howtopickthehashfamily?
• Let’sgobacktothatcomputationfromearlier….
• � numberofthingsinbucketℎ �I
• =∑ � ℎ �I = ℎ �N&NO"
• = 1 +∑ � ℎ �I = ℎ �NBNQI
• ≤ 1 +∑ 1/�BNQI
• = 1 +&S"
&≤ 2.
• Allweneededwasthatthis ≤ 1/n.
Strategy
• PickasmallhashfamilyH,sothatwhenIchoosehrandomlyfromH,
forall�I , �N ∈ �with�I ≠ �N ,
�i∈j ℎ �I = ℎ �N ≤1
�
H
h
• ThenwestillgetO(1)-sizedbuckets
inexpectation.
• Butnowthespaceweneedis
log(|H|)bits.• Hopefullyprettysmall!
Sothewholeschemewillbe
nbucke
ts
h
ui
UniverseU
Choosehrandomly
fromauniversalhash
familyH
Wecanstorehinsmallspace
sinceHissosmall.
Probably
these
bucketswill
bepretty
balanced.
Whatisthisuniversalhashfamily?
• Here’sone:
• Pickaprime� ≥ �.
• Define�G,m � = �� + �����
ℎG,m � = �G,m � ����
• Claim:
� = {ℎG,m � ∶ � ∈ {1,… , � − 1}, � ∈ {0,… , � − 1}}
isauniversalhashfamily.
Saywhat?
• Example:M=p=5,n=3
• TodrawhfromH:
• Pickarandomain{1,…,4},bIn{0,…,4}
• Asperthedefinition:
• �$," � = 2� + 1���5
• ℎ$," � = �$," � ���3
1,2,3,4,5a=2,b=1
1
23
40
�$," �
1
23
4 0
�$," 1
�$," 0
�$," 3
�$," 4�$," 2U=
1
2
3
mod3
Thisstepjust
scramblesstuffup.
Nocollisionshere!
Thisstepistheone
wheretwodifferent
elementsmightcollide.
Ignoringwhythisisagoodidea…
howbigisH?
• Wehavep-1choicesfora,andpchoicesforb.
• So|H|=p(p-1)=O(M2)
• ThisismuchbetterthannM!!!!
• spaceneededtostoreh:O(log(M)).
O(Mlog(n))
bits
O(log(M))bits
Whydoesthiswork?
• Thisisactuallyalittlecomplicated.
• I’llgoovertheargumentnow,becauseit’sagoodexampleofhowtoreasonabouthashfunctions.
• Fancycounting!
• BUT! don’tworryifyoudon’tfollowallthecalculationsrightnow.
• Youcanalwaystakealookbackattheslidesorlecturenoteslater.
• Theimportantpartisthestructureoftheargument.
Whydoesthiswork?
• Wanttoshow:
• forall�I , �N ∈ �with�I ≠ �N , �i∈j ℎ �I = ℎ �N ≤"
&
• aka,theprobabilityofanytwoelementscollidingissmall.
• Let’sjustfixtwoelementsandseeanexample.
• Let’sconsider�I , = 0, �N = 1.
1
23
40
�G,m �
1
23
4 0U=
1
2
3
mod3
�� + �����
Convince
yourselfthatit
willbethesame
foranypair!
Theprobabilitythat0and1collideissmall
• Wanttoshow:
• �i∈j ℎ 0 = ℎ 1 ≤"
&
• Forany�w ≠ �" ∈ {0,1,2,3,4},howmanya,b aretheresothat�G,m 0 = �wand�G,m 1 = �"?
• Claim:it’sexactlyone.
• Proof:solvethesystemofeqs.foraandb.
1
23
40
�G,m �
1
23
4 0U=
1
2
3
mod3
�� + �����
eg,y0 =3,y1 =1.
� ⋅ 1 + � = �"����
� ⋅ 0 + � = �w����
Theprobabilitythat0and1collideissmall
• Wanttoshow:
• �i∈j ℎ 0 = ℎ 1 ≤"
&
• Forany�w ≠ �" ∈ {0,1,2,3,4}, exactlyonepaira,b have�G,m 0 = �wand�G,m 1 = �".
• If0and1collideit’sb/cthere’ssome�w ≠ �"sothat:
• �G,m 0 = �wand�G,m 1 = �".
• �w = �"����.
1
23
40
�G,m �
1
23
4 0U=
1
2
3
mod3
�� + �����
eg,y0 =3,y1 =1.
Theprobabilitythat0and1collideissmall
• Wanttoshow:
• �i∈j ℎ 0 = ℎ 1 ≤"
&
• Thenumberofa,b sothat0,1collideunderha,b isatmostthenumberof�w ≠ �"sothat�w = �"����.
• Howmanyisthat?• Wehavepchoicesfor�w,thenatmost1/noftheremainingp-1arevalidchoicesfor�"…
• Soatmost� ⋅DS"
&.
1
23
40
�G,m �
1
23
4 0U=
1
2
3
mod3
�� + �����
eg,y0 =3,y1 =1.
Theprobabilitythat0and1collideissmall
• Wanttoshow:
• �i∈j ℎ 0 = ℎ 1 ≤"
&
• The#of(a,b) sothat0,1collideunderha,b is≤ � ⋅DS"
&.
• Theprobability(overa,b)that0,1collideunderha,b is:
• �i∈j ℎ 0 = ℎ 1 ≤D⋅
yz{
|
j
• = D⋅
yz{
|
D DS"
• ="
&.
Thesameargumentgoesforanypair
forall�I , �N ∈ �with�I ≠ �N ,
�i∈j ℎ �I = ℎ �N ≤1
�
That’sthedefinitionofauniversalhashfamily.
SothisfamilyHindeeddoesthetrick.
Sothewholeschemewillbenbucke
ts
h
ui
UniverseUofsizeM
Chooseh
randomlyfromH
Wecanstorehinspace
O(log(M)).
TheexpectedtimetodoanyL
operationsonthesenelementsisO(L).
Recap
WantO(1)INSERT/DELETE/SEARCH
• WeareinterestinginputtingnodeswithkeysintoadatastructurethatsupportsfastINSERT/DELETE/SEARCH.
• INSERT
• DELETE
• SEARCH
5
datastructure
5
4
52
HEREITIS
Westudiedthisgame
13 22 43 92
1. Anadversarychoosesanynitems
�", �$, … , �& ∈ �,andanysequence
ofLINSERT/DELETE/SEARCH
operationsonthoseitems.
2. You,thealgorithm,
choosesarandom hash
functionℎ: � → {1,… , �}.
3. HASHITOUT
1
2
3
n
13
22
92
…
437
7
INSERT13,INSERT22,INSERT43,
INSERT92,INSERT7,SEARCH43,
DELETE92,SEARCH7,INSERT92
Uniformlyrandomhwasgood
• Ifwechoosehuniformlyatrandom,forall�I , �N ∈ �with�I ≠ �N ,
�i∈j ℎ �I = ℎ �N ≤1
�
• Thatwasenoughtoensurethat,inexpectation,abucketisn’ttoofull.
Abitmoreformally:
Foranysequence ofLINSERT/DELETE/SEARCH
operationsonanynelementsofU,theexpected
runtime(overtherandomchoiceofh)isO(L).
aka,O(1)peroperation.
Uniformlyrandomhwasbad
• Ifweactuallywanttoimplementthis,wehavetostorethehashfunctionh!
• Thattakesalotofspace!• WemayaswellhavejustinitializedabucketforeverysingleiteminU.
• Instead,wechoseafunctionrandomlyfromasmallerset.
Weneededasmallersetthatstillhasthisproperty
• Ifwechoosehuniformlyatrandom,forall�I , �N ∈ �with�I ≠ �N ,
�i∈j ℎ �I = ℎ �N ≤1
�
Thiswasallweneededtomake
surethatthebucketswere
balancedinexpectation!
• Wecallanysetwiththatpropertya
universalhashfamily.
• Wewereabletocomeupwithareallysmallone!
Conclusion:
• WecanbuildahashtablethatsupportsINSERT/DELETE/SEARCHinO(1)expectedtime,• ifweknowthatonlynitemsareeverygoingtoshowup,whereniswaaaayyyyyy lessthanthesizeMoftheuniverse.
• Thespacetoimplementthishashtableis
O(nlog(M)).
• Miswaaayyyyyy biggerthann,butlog(M)probablyisn’t.
NextWeek
• Graphalgorithms!