GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Introduc*on!to!
Informa(on)Retrieval)
!
!
Course!Introduc*on!
GIAN!Course!
Big!Social!Data!Analysis!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Course!staff!
!
!
%%Introduc9on%
!!!!!!!!!!!!!!!!!!!!!!Erik!Cambria!
!!!!!!!!!!!!!!!!!!!!Haiyun!Peng!!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
NTULIndia!Connect!programme!
%%Introduc9on%
hHp://global.ntu.edu.sg/gmp/ic!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Textbook!
! Introduc9on%to%Informa9on%Retrieval%! Christopher!D.!Manning,!Prabhakar!Raghavan,!
Hinrich!Schutze!
! Cambridge!University!Press!2008!
! eBook,!lecture!slides,!etc.,!available!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!from!hHp://informa*onretrieval.org!%
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Reference!books!
Modern)Informa(on)Retrieval)Ricardo!BaezaLYates!
&!Berthier!RibeiroL
Neto!
Addison%Wesley%1999%
Mining)the)Web)Discovering)Knowledge)from)Hypertext)Data)
Soumen!Chakrabar*!
Elsevier%Morgan%Kaufmann%2002%
A)Prac(cal)Guide)to)Sen(ment)Analysis)Erik!Cambria,!
Dipankar!Das,!Sivaji!
Bandyopadhyay,!
Antonio!Feraco!
Springer%Verlag%2017%
Text)Mining)Predic(ve)Methods)for)Analyzing)Unstructured)Informa(on)
Sholom!Weiss,!Ni*n!
Indurkhya,!Tong!Zhang,!
&!Fred!Damerau!
Springer%Verlag%2005%
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Course!details!
! PreLrequisites!! Computer!science!
! Data!structures!! Algorithm!
! Mathema*cs!
! Linear!algebra!! Probability!
! Related!subjects!! Ar*ficial!Intelligence!! Natural!Language!Processing!! Data!Mining!
! Machine!Learning!!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Big!data:!Blessing!or!curse?!
!
!
Informa*on!is!the!main!treasure!of!humankind.!!
!
%%Introduc9on%
!
Without!efficient!management,!!
such!a!treasure!becomes!useless:!
the!more!we!have,!the!less!we!can!use.!!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Making!sense!of!data!
%%Introduc9on%
hHp://youtu.be/kb7RL6bLmHE!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Informa*on!need!
! IR!is!a!task!of!automa*cally!sa*sfying!the!
user’s!informa9on%need%!! Understand!the!informa*on!need!
! Sa*sfy!the!informa*on!need!!
! But!the!needs!are!very!different,!!!!!!!!!!!!!!!!!!!!!!!IR!has!many!faces!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Types!of!informa*on!need!
! Find!a!document!or!documents!
! What!for?!
! Answer!a!ques*on!! Save!*me!on!reading!
! Find!what!the!document!is!about!
! Check!authen*city!! Find!what’s!new!! Mine!opinions!
! and!more…!
%%Introduc9on%
→!Ques*on!Answering!
→!Text!Summariza*on!
→!Topic!Modeling!!
→!Plagiarism!Detec*on!!
!→!Trend!Discovery!
→!Sen(ment)Analysis)
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Sen*ment!analysis!
! �!Sen*ment!analysis!has!
raised!growing!interest!!!!
both!within!the!scien*fic!
community,!leading!to!many!
exci*ng!open!challenges,!!!!!
as!well!as!in!the!business!
world,!due!to!the!remarkable!
benefits!to!be!had!from!
financial!forecas*ng!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Microtext!normaliza*on!
! Before!NLP!techniques!can!be!applied,!informal!
text!(e.g.,!c!u!l8r),!
acronyms!(e.g.,!LOL),!
and!emo*cons!(e.g.,!:>),!
need!to!be!translated!
into!plain!English!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Sentence!boundary!disambigua*on!
The!group!included!J.!M.!Freeman!Jr.!and!T.!Boone!
Pickens.!The!group!included!J.!M.!Freeman!Jr.!T.!Boone!
Pickens!had!lem.!!
He!stopped!to!see!Dr.!Lawson.!He!stopped!at!Meadows!
Dr.!Lawson!was!s*ll!open.!!
It!was!due!Friday!by!5︎!p.m.!Saturday!would!be!too!late. ︎!︎︎She!has!an!appointment!at!5!p.m.!Saturday!to!get!her!
car! ︎fixed ︎.!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Text!chunking!
! Text!chunking,!also!referred!to!as!shallow!parsing,!is!a!task!that!follows!POS!tagging!and!that!adds!
more!structure!to!the!sentence!
! The!result!is!a!grouping!of!the!words!in!“chunks”!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Lemma*za*on!
%%Introduc9on%
eat_burger!
eats_a_burger!
ate_burger!
ea*ng_burger!
eat_burgers!eat_the_burger!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Concept!extrac*on!
the!camera!has![long!focus!*me]!!
the!camera!takes!a![long!*me]!to![focus]!!
the![focusing]!of!the!camera!takes![long!*me]!!
the![focus!*me]!of!the!camera!is!very![long]!!
long_focus_*me!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
NamedLen*ty!recogni*on!!
! NamedLen*ty!
recogni*on!is!key!for!
improving!anaphora!
resolu*on!and,!
hence,!for!detec*ng!
aspects!or!opinion!
targets!in!reviews!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Anaphora!resolu*on!
! Anaphora!is!the!use!of!an!expression!the!
interpreta*on!of!which!
depends!upon!another!
one!
! It!is!commonly!resolved!
by!gender!and!number!
agreement!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Subjec*vity!detec*on!!
! Subjec*vity!detec*on!is!a!binary!
classifica*on!task!that!
consists!in!classifying!
text!into!either!
objec*ve!(neutral)!or!
subjec*ve!(i.e.,!
posi*ve!or!nega*ve)!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Personality!recogni*on!!
! Personality!recogni*on!is!an!important!step!
towards!userL
dependent!sen*ment!
analysis!(user!profiling)!
and!it!is!useful!for!
sarcasm!detec*on!
%%Introduc9on%
hHp://datanami.com/2017/09/21/deepLlearningLrevealsLnewLinsightsLpeople!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Aspect!extrac*on!
%%Introduc9on%
touchscreen baHery!
I!love!iPhoneX’s!touchscreen!but!the!baHery!life!is!so!short!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Polarity!detec*on!
! Early!works!treated!polarity!detec*on!as!
a!binary!classifica*on!
problem!(pos!vs!neg)!
! Recent!works!calculate!polarity!
intensity!as!a!!!!!!!!
float! ![L1,!+1]!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Retrieval!vs.!extrac*on!
! Informa*on!retrieval!is!about!retrieving!relevant!
data!based!on!a!query!–!you!specify!what!
informa*on!you!need!and!it!is!returned!in!a!
humanLunderstandable!form!
! (e.g.,!find!relevant!opinions)!
! Informa*on!extrac*on!is!about!structuring!
unstructured!informa*on!–!given!some!sources!
all!of!the!(relevant)!informa*on!is!structured!in!!!
a!form!that!will!be!easy!for!processing!
! (e.g.,!classify!opinions!as!posi*ve!or!nega*ve)!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Informa*on!retrieval!
! Informa*on!retrieval!is!finding!material!
(usually!documents)!of!an!unstructured!
nature!(usually!text)!that!sa*sfies!an!
informa*on!need!from!within!large!
collec*ons!(usually!stored!on!computers)!
! ELmail!search!
! Searching!your!laptop!! Corporate!knowledge!bases!! Legal!informa*on!retrieval!
! Web)search)
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Web:!connec*ng!people!
! The!poten*al!for!knowledge!sharing!
today!is!unmatched!in!
history!
! Never!before!have!so!many!knowledgeable!
people!been!connected!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
To!be!or!not!to!be!connected?!
! Being!connected!is!good!but!being!
disconnected!for!the!
past!millions!years!
was!the!main!reason!
behind!our!cultural!
diversity!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Connected!but!alone!
%%Introduc9on%
hHp://ted.com/talks/sherry_turkle_alone_together!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Socialnomics!
%%Introduc9on%
hHp://youtu.be/PWa8L43kELQ!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Being!connected!
! Leonardo’s!discoveries!and!inven*ons!in!
science,!art,!
engineering,!and!
aesthe*cs,!were!based!
only!on!his!percep*on!
of!the!world!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
The!Web!is!very!young!
! Less!than!30!years!have!elapsed!since!the!
inven*on!of!the!Web!
! We!are!s*ll!just!
‘playing’!with!it!as!we!
are!yet!to!discover!
how!to!fully!make!use!
of!it!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
A!Web!of!knowledge!
! In!1910,!Otlet!first!envisioned!a!"city!of!
knowledge"!that!
would!serve!as!a!
central!repository!
for!the!world’s!
informa*on,!but!text!
was!not!digital!yet!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
The!Machine!is!Us/ing!Us!
hHp://youtu.be/NLlGopyXT_g!
!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
The!Web!as!a!laboratory!
! The!Web!today!not!
only!represents!an!
unlimited!data!store!
but!also!a!mul*L
disciplinary!laboratory!
environment!for!
worldLscale!
experiments!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
The!five!eras!of!the!Web!
! The!Web!is!evolving!
towards!a!shared!
social!experience,!in!
which!consumers!will!
rely!on!their!peers!as!
they!make!online!
decisions!and!will!
shape!future!products!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
! Between!the!dawn!of!the!Internet!and!
year!2003,!there!
were!five!exabytes!
of!informa*on!on!
the!Web!
! Now,!we!create!five!exabytes!every!two!
days!
%%Introduc9on%
Big!social!data!analysis!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
%%Introduc9on%
Drowning!in!data?!
hHp://straits*mes.com/singapore/worldLfacesLdataLstorageLcrunchLahead!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
%%Introduc9on%
Storages!are!not!forever!
E Cambria, A Chattopadhyay, E Linn, B Mandal, B White. Storages are not forever. Cognitive Computation 9(5), 646-658 (2017)
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Collected!intelligence!
! Informa*on!today!is!
extremely!portable!
and!processable!
! However,!this!collected!intelligence!
is!far!from!being!
addressed!as!
collec*ve!intelligence!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Is!it!people’s!fault?!
! Online!contents!are!mostly!meant!for!
human!consump*on!
! Why!should!web!
developers!and!
bloggers!care!about!
making!their!content!
machine!processable?!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Evolu*on!of!NLP!
! NLP!technologies!evolved!from!the!era!
of!punch!cards!!!!!!!!!
(7!mins!per!
sentence)!to!the!era!
of!Google!and!its!like!
(less!than!a!second!
per!sentence)!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
NLP!emergency!
! In!a!Web!where!UGC!
has!hit!cri*cal!mass,!
NLP!is!becoming!key!
for!aggrega*ng!
informa*on!
although!systems!
are!s*ll!limited!by!
what!they!can!‘see’!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
More!than!we!see!
! Language!is!somewhere!
in!between!percep*on!
and!understanding!
! A!translucent!material,!
so!that!the!world!bears!
the!*nt!and!focus!of!
what!we!express!
through!it!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Understanding!language!
! Natural!language!understanding!
requires!highLlevel!
symbolic!capabili*es!
that!most!NLP!
technologies!
currently!do!not!
possess!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
The!hardest!problem?!
We!can!understand!almost!anything,!!!
but!we!can’t!understand!how!we!understand.!
Albert)Einstein))
We!understand!human!mental!processes!!
only!slightly!beHer!than!a!fish!understands!swimming.!
John)McCarthy)))
How!the!mind!works!is!s*ll!a!mystery.!We!understand!the!
hardware,!but!we!don't!have!a!clue!about!the!opera*ng!system.!!
James)Watson)
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Veneer!of!intelligence!
%%Introduc9on%
hHp://youtu.be/SrTfzHXQdkc!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Illusion!of!understanding!
! Assessing!the!intelligence!of!AI!
systems!is!like!a!dog!
chasing!its!own!tail!
! We!are!ones!
interpre*ng!the!
results!of!the!AI!
systems!we!build!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
AI!meets!natural!stupidity!
! A!key!failure!of!AI!is!the!persistency!in!
seeking!the!best!way!
to!solve!a!problem,!
which!leads!to!the!
crea*on!of!expert!
(not!intelligent)!
systems!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Commonsense!blindness!
! The!defini*on!of!today’s!AI!is!a!
machine!that!can!
make!a!perfect!
chess!move!while!
the!room!is!on!fire!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Machine!learning!
! Unlike!symbolic!AI,!
machine!learning!
(subLsymbolic!AI)!
does!not!need!!!
handLcramed!rules!
(topLdown)!as!it!is!
mostly!dataLdriven!
(boHomLup)!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Neural!networks!
! Ar*ficial!neural!networks!were!
actually!invented!in!
the!1940s!so!what!is!
the!reason!for!so!
much!excitement!
around!them!now?!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Learning!what!again?!
! In!most!cases,!we!are!
simply!teaching!
machines!word!coL
occurrence!frequencies!
! It’s!like!teaching!someone!snowboarding!
by!only!showing!them!
videos!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Machine!learning!issues!
! Dependency!it!requires!(a!lot!of)!training!data!and!is!domainLdependent!
!
! Consistency!different!training!or!tweaking!leads!to!different!results!
!
! Transparency!the!reasoning!process!is!uninterpretable!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Dependency!
! A!machine!learning!
algorithm!trained!on!
dataset!A!will!not!
work!well!on!dataset!
B,!especially!if!A!and!
B!are!about!different!
domains!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Consistency!
! Pushed!by!the!PublishLorLPerish!
principle,!some!
researchers!omen!
“s*r!their!pile”!to!
improve!algorithm!
accuracy!by!a!few!
percent!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Transparency!
! Most!machine!
learning!techniques!
are!blackLbox!
algorithms:!they!
classify!data!based!
on!learnt!features!
we!do!not!know!
much!about!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
The!dark!side!of!AI!
! BlackLbox!algorithms!!
! Deep!learning!! Opaque!reasoning!! Machiavellian!approach!!
! Brute)force)
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Do!like!the!Ancient!One!
%%Introduc9on%
hHp://dmnews.com/ar*cle/738754!
!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
An*LCopernican!AI!revolu*on!
%%Introduc9on%
Top-down (theory-driven) approach
Bottom-up (data-driven) approach
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Keyword!spoÄng!
! Although!the!most!
naïve!approach,!the!
accessibility!and!
economy!of!keyword!
spoÄng!make!it!one!
of!the!most!popular!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Lexical!affinity!
! Lexical!affinity!assigns!
arbitrary!words!
probable!“affinity”!to!a!
par*cular!class!–!
“accident”!has!a!75%!
probability!of!indica*ng!
a!nega*ve!affect!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Sta*s*cal!methods!
! By!feeding!a!ML!
algorithm!a!large!
training!corpus,!
sta*s*cal!methods!not!
only!learn!the!valence!
of!affect!words,!but!also!
that!of!other!arbitrary!
keywords!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
ConceptLlevel!analysis!
! By!relying!on!ontologies!or!seman*c!networks,!
conceptLlevel!
approaches!step!away!
from!blindly!using!affect!
keywords!and!word!coL
occurrence!frequencies!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Conceptualiza*on!
! Concepts!are!immaterial!
en**es!that!only!exist!in!
the!mind!of!the!speaker!
! To!be!communicated,!
they!must!be!
represented!in!terms!of!
some!concrete!ar*fact!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Understand!is!simplify!
! Conceptual!primi*ves!
allow!machines!to!
beHer!grasp!the!
meaning!of!concepts!
that!are!omen!opaque!
due!to!the!richness!
and!ambiguity!of!
natural!language!
%%Introduc9on%
hHp://*nyurl.com/topLdownLboHomLupLnlp!
!
%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
A!‘pipe’!is!not!a!pipe!
! You!can!know!the!name!of!all!the!
different!kinds!of!
‘pipe’,!but!you!know!
nothing!about!a!pipe!
un*l!you!comprehend!
its!purpose!and!
method!of!usage!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Drawbacks!of!the!bagLofLwords!model!
%%Introduc9on%
long!
big!
cold!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
%%Introduc9on%
smile!
!
!!
damn!
!
!!
preHy!
!
=>!!damn_good!
=>!!preHy_ugly!
=>!!sad_smile!
!
Drawbacks!of!the!bagLofLwords!model!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
IR!system!
IR System
Query String
Document corpus
Ranked Documents
1. Doc1 2. Doc2 3. Doc3 …
A!set!of!documents:!
Assume!it!is!a!
sta*c!collec*on!
for!the!moment!
Goal:!Retrieve!documents!with!informa*on!that!is!relevant!!
to!the!user’s!informa*on!need!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
how$trap$mice$alive$
IR!system:!Example!
Collection
User task
Info need
Query
Results
Search engine
Query refinement
Get rid of mice in a politically correct way
Info about removing mice without killing them
Misconception?
Misformulation?
Search!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
IE!system!
%%Introduc9on%
Goal:!Extract!informa*on!from!the!retrieved!documents!to!
help!the!user!complete!a!task!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
What!this!course!covers!
! Understanding!of!IR!&!IE!systems!
! BeHer!usages!of!IR!&!IE!services!! Improvement!of!exis*ng!systems!
! Design!and!development!for!new!domains!
! Innova*ons!in!IR!&!IE!!
%%Introduc9on%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
What!this!course!does!NOT!cover!
%%Introduc9on%
! Mul*modal!retrieval!
! Image!and!video!
! Audio!! Advanced!NLP!topics!
! Parsing!! Ontologies!! Anaphora!resolu*on!! Named!en*ty!recogni*on!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
IR!&!IE!business!
! So!many!tech!*tans!are!doing!IR!&!IE!
! What!can!we!do!more/beHer?!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
IR!&!IE!business!
! Google,!Yahoo,!Bing!know!how!to!build!a!search!
engine!
! Do!we!know!?!! Did!they!tell!us?!
! They!never!will!(fully)!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Mo*va*on#1:!Acquire!knowLhow!
! To!know!about!the!methods!Google,!
Bing,!Yahoo!hide!
! To!customize!such!
algorithms!to!other!
businesses!
! To!do!things!that!these!companies!
cannot!do!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Represen*ng!documents!
! We!will!see!3!kinds!of!document!representa*on!
! One!hot!encoding!! Boolean!–!Yes!vs!No!
! TfLidf!weigh*ng!scheme!
! similar!to!one!hot!encoding!but!encodes!contextual!
informa*on!
! Word!embeddings!
! most!popular!method!and!works!beHer!than!the!rest!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
One!hot!encoding!
! Vector!space!modeling!!
! S1!–!{Informa*on!Retrieval!is!an!excellent!topic!to!study}!
! Vocabulary!–!{Informa*on,!knowledge,!Retrieval,!student,!
class,!google,!index,!is,!an,!the,!excellent,!understanding,!
topic,!grasp,!course,!people,!web,!to,!study}!
! One!hot!encoding!! N!dimensional!vector!where!each!coLordinate!represents!one!
word!in!the!vocabulary!
! X={x1!,!x2!,!……!,!xn}!!!
! Xi!=!1!if!word!I%is!present!in!the!document!otherwise!Xi!=!0!!
! S1={1,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,1,1}!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
TfLidf!weigh*ng!scheme!
! TfLidf!
! á!=!term!frequency!!
! df!=!document!frequency!
! idf!=!inverse!document!frequency!
! Replace!one!hot!encoding!with!áLidf!weight!
! á!=!á(t,!D)!=!log[freq(t,D)]+1!
! Idf!=!idf(t)!=!log(n/N)!
! N!=!Number!of!documents!
! N!=!number!of!documents!contain!term!t!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Word!embeddings!
! Embeddings!of!a!word!is!a!d!dimensional!vector!
which!encodes!seman*c!informa*on!
! CBOW!aims!to!predict!a!context!word!given!a!word!w!!!!!!!!!!!!!(goal!is!to!calculate!the!embedding!of!the!word!w)%
! You!can!build!your!own!! Randomly!form!d!dimensional!word!embeddings!of!words!
! Form!a!complex!neural!network!
! Input!of!the!network!is!the!words!with!random!word!
embeddings!(goal!is!to!tune!these!embeddings!through!
training)!
! Design!a!loss!func*on!! Backpropagate!the!error!to!the!input!layer!to!train!the!network!
!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Word!embeddings!
! Excellent!if!you!try!to!design!your!own!word!embeddings!method!
! There!are!helper!func*ons,!libraries!available!in!python!
! The!basic!is!backpropaga*on!! Otherwise!use!Gensim!–!a!perfect!word2vec!tool!
which!uses!CBOW!model!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Word2vec!in!IR!
! Forming!document!vectors!from!word!vectors!
! Finding!similar!words!to!a!given!word!
! Clustering!! Analy*cal!inference!
! vector('king')%,%vector('man')%+%vector('woman')!!=!vector(�queen�)%%
!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Word2vec !=!universal!solu*on?!
! Word2vec!captures!seman*c!but!it!is!s*ll!word!coL
occurrence!based!
! We!need!seman*cs,!e.g.,!knowledge!bases,!seman*c!
networks,!ontologies,!etc.!
! WordNet!
! ConceptNet!! Sen*cNet!! NELL!! YAGO!! Probase!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Introduc*on!to!
Informa(on)Retrieval)
!
!
Lecture!1:!Boolean!Retrieval!
!
GIAN!Course!
Big!Social!Data!Analysis!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Structured!(db)!vs.!unstructured!(txt)!data!
! Structured!data!tends!to!refer!to!informa*on!in!
tables !
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000 Ivy Smith
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Unstructured!data!
! Which!plays!of!Shakespeare!contain!the!words!
Brutus!AND!Caesar!!but!NOT!Calpurnia?!! One!could!grep!all!of!Shakespeare’s!plays!for!Brutus!and!Caesar,!then!strip!out!lines!containing!Calpurnia?!
! Why!is!that!not!the!answer?!
! Slow!(for!large!corpora)!! Other!opera*ons!(e.g.,!find!the!word!Romans1near)countrymen)!not!feasible!
! Ranked!retrieval!(best!documents!to!return)!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
TermLdocument!incidence!
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Incidence!vectors!
! So!we!have!a!0/1!vector!for!each!term!
! To!answer!query:!take!the!vectors!for!Brutus,1Caesar!and!Calpurnia!(complemented)!"!!bitwise!AND!
! 110100!AND!110111!AND!101111!=!100100!
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Answers!to!query!
! Antony and Cleopatra,!Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain
! Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Bigger!collec*ons!
! Consider!N%=!1!million!documents,!each!with!about!
1000!words!
! Avg!6!bytes/word!including!spaces/punctua*on!!! 6GB!of!data!in!the!documents!
! Say!there!are!M%=!500K!dis9nct!terms!among!these!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Build!the!matrix!
! 500K!x!1M!matrix!has!halfLaLtrillion!0s!and!1s!
! But!it!has!no!more!than!one!billion!1s!
! matrix!is!extremely!sparse!
! What’s!a!beHer!representa*on?!
! We!only!record!the!1s!
Why?
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Inverted!index!
! For!each!term!t,!we!must!store!a!list!of!all!documents!
that!contain!t!! Iden*fy!each!by!a!docID,!a!document!serial!number!
! Can!we!used!fixedLsize!arrays!for!this?!
Brutus
Calpurnia
Caesar 2 4 5 6 16 57
2 4 11 31 45 173
31 54 101
What!happens!if!the!word!Caesar!is!added!to!document!14?!!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Inverted!index!
! We!need!variableLsize!pos*ngs!lists!
! On!disk,!a!con*nuous!run!of!pos*ngs!is!normal!and!best!
! In!memory,!can!use!linked!lists!or!variable!length!arrays!
! Some!tradeoffs!in!size/ease!of!inser*on!
Dictionary Postings
Pos9ng%
Brutus
Calpurnia
Caesar 2 4 5 6 16 57
2 4 11 31 45 173
31 54 101
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Inverted!index!construc*on!
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend1
roman1
countryman1
2 4
2
13 16
1
Documents to be indexed.
Friends, Romans, countrymen.
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Ini*al!stages!of!text!processing!
! Tokeniza*on!! Cut!character!sequence!into!word!tokens!
! Deal!with!�John’s�,!a1state9of9the9art1solu:on1! Normaliza*on!
! Map!text!and!query!term!to!same!form!! You!want!U.S.A.!and!USA1to!match!
! Stemming!
! We!may!wish!different!forms!of!a!root!to!match!! authorize,1authoriza:on1
! Stopwords!! We!may!omit!very!common!words!(or!not)!
! the,1a,1to,1of1
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Indexer!steps:!Token!sequence!
! Sequence!of!(Modified!token,!Document!ID)!pairs!
I did enact Julius Caesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be with Caesar. The noble
Brutus hath told you Caesar was ambitious
Doc 2
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Indexer!steps:!Sort!
! Sort!by!terms!
! And!then!docID!!
Core)indexing)step)
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Indexer!steps:!Dic*onary!&!pos*ngs!
! Mul*ple!term!entries!in!a!single!document!are!merged!
! Split!into!Dic*onary!and!Pos*ngs!
! Doc.!frequency!!(DF)!informa*on!is!added!
Why!frequency?!
Will!discuss!later!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
What!are!stored!in!index?!
Pointers
Terms!
and!
counts!
IR system implementation: • How do we index efficiently? • How much storage do we need?
Lists!of!
docIDs!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
The!index!we!just!built!
! How!do!we!process!a!query?!! Later!L!what!kinds!of!queries!can!we!process?!
! So!we!have!a!0/1!vector!for!each!term!
! To!answer!query:!take!the!vectors!for!Brutus,1Caesar!and!Calpurnia!(complemented)!"!!bitwise!AND!
! 110100!AND!110111!AND!101111!=!100100!!
Review: Process with incidence matrix
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Query!processing:!AND!
! Consider!processing!the!query:!Brutus!AND!Caesar!! Locate!Brutus!in!the!Dic*onary!
! Retrieve!its!pos*ngs!! Locate!Caesar!in!the!Dic*onary!
! Retrieve!its!pos*ngs!! Merge !the!two!pos*ngs!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
The!merge!
! Walk!through!the!two!pos*ngs!simultaneously,!in!
*me!linear!in!the!total!number!of!pos*ngs!entries!
If the list lengths are x and y, the merge takes O(x+y) operations Crucial: postings sorted by docID
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Boolean!queries:!Exact!match!
! The!Boolean!retrieval!model!is!being!able!to!ask!a!
query!that!is!a!Boolean!expression:!
! Boolean!Queries!are!queries!using!AND,%OR!and!NOT!to!join!query!terms!
! Views!each!document!as!a!set!of!words!
! Is!precise:!document!matches!condi*on!or!not!
! Perhaps!the!simplest!model!to!build!an!IR!system!on!
! Primary!commercial!retrieval!tool!for!3!decades!
! Many!search!systems!you!s*ll!use!are!Boolean:!
! Email,!library!catalog,!Mac!OS!X!Spotlight!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Example:!WestLaw!!!http://www.westlaw.com/
! Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992; federated search added 2010)
! Tens of terabytes of data; 700,000 users ! Majority of users still use boolean queries ! e.g. What is the statute of limitations in cases
involving the federal tort claims act? ! LIMIT! /3 STATUTE ACTION /S FEDERAL /2
TORT /3 CLAIM ! !: variant endings, space: disjunction ! /3 = within 3 words, /S = in same sentence
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Merging!
What!about!an!arbitrary!Boolean!formula?!
(Brutus!OR%Caesar)1AND%NOT1(Antony1OR%Cleopatra))! Can!we!always!merge!in!“linear”!*me?!
! Linear!in!what?!! Can!we!do!beHer?!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Query!op*miza*on!
! What!is!the!best!order!for!query!processing?!
! Consider!a!query!that!is!an!AND!of!n!terms!
! For!each!of!the!n!terms,!get!its!pos*ngs,!then!
AND!them!together!
Query:1Brutus!AND!Calpurnia!AND!Caesar1
Brutus
Calpurnia
Caesar 2 4 5 6 16 31
2 4 11 31 45 173
31 54 101
57
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Query!op*miza*on!example!
! Process!in!order!of!increasing!freq:!! start%with%smallest%set,%then%keep%cuWng%further!
This is why we kept document freq in dictionary
Thus,!execute!the!query!as!(Calpurnia!AND!Brutus)!AND%Caesar!
Brutus
Calpurnia
Caesar 2 4 5 6 16 31
2 4 11 31 45 173
31 54 101
57
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
More!general!op*miza*on!
! e.g.,!(tangerine1OR%trees)%AND%(marmalade%OR%skies)%AND%(kaleidoscope%OR%eyes)!
! Get!doc!freqs!for!all!terms!
! Es*mate!the!size!of!each!OR!by!the!sum!of!its!
doc!freqs!(conserva*ve)!
! Process!in!increasing!order!of!OR!sizes!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Exercise!
! Recommend!a!query!
processing!order!for!
! Which!two!terms!
should!we!process!
first?!
Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Phrase!queries!
! Want!to!be!able!to!answer!queries!such!as!!!
stanford1university� –!as!a!phrase1! Thus!the!sentence!“I%went%to%university%at%Stanford”%is!not!a!match!
! The!concept!of!phrase!queries!has!proven!easily!understood!by!users;!one!of!the!few! advanced!search !
ideas!that!works!
! Many!more!queries!are!implicit%phrase%queries!
! For!this,!it!no!longer!suffices!to!store!only!
!!!<term%:!docs>!entries!
!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
BoW!vs.!BoC!
cloud_compu*ng!!
!
!
!
!
pain_killer!!
=>!!!!cloud!!
=>!!!!pain,!killer!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Solu*on!1:!Biword!indexes!
! Index!every!consecu*ve!pair!of!terms!in!the!text!
as!a!phrase!
! For!example!the!text! Friends,!Romans,!
Countrymen !would!generate!the!biwords!
! friends1romans1! romans1countrymen1
! Each!of!these!biwords!is!now!a!dic*onary!term!
! TwoLword!phrase!queryLprocessing!is!now!immediate!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Longer!phrase!queries!
! stanford1university1palo1alto1can!be!broken!into!the!Boolean!query!on!biwords:!
stanford1university1AND1university1palo1AND1palo1alto11
How%to%know%which%one%is%a%significa9ve%biword?%
! Index!blowup!due!to!bigger!dic*onary!! Infeasible!for!more!than!biwords,!big!even!for!them!
%
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Solu*on!2:!Posi*onal!indexes!
! In!the!pos*ngs,!store,!for!each!term,1the!posi*on(s)!in!which!tokens!of!it!appear:!
<term,%number!of!docs!containing!term;!
doc1:!posi*on1,!posi*on2!…!;!
doc2:!posi*on1,!posi*on2!…!;!
etc.>!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Processing!a!phrase!query!
! Extract!inverted!index!entries!for!each!dis*nct!term:!to,1be,1or,1not1
! Merge!their!doc:posi9on!lists!to!enumerate!all!
posi*ons!with! to1be1or1not1to1be !
! to:%%! 2:1,17,74,222,551;%4:8,16,190,429,433;!7:13,23,191;!...!
! be:%%%! 1:17,19;!4:17,191,291,430,434;!5:14,19,101;!…!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Posi*onal!index!size!
! A!posi*onal!index!expands!pos*ngs!storage!substan9ally%! Even!though!indices!can!be!compressed!
! Nevertheless,!a!posi*onal!index!is!now!standardly!used!because!of!the!power!and!usefulness!of!phrase!
and!proximity!queries!…!whether!used!explicitly!or!
implicitly!in!a!ranking!retrieval!system!
! In!the!case!of!a!sta*c!collec*on,!we!have!to!do!it!only!once!anyway!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Rules!of!thumb!
! A!posi*onal!index!is!2–4!as!large!as!a!nonLposi*onal!index!
! Posi*onal!index!size!35–50%!of!volume!of!original!
text!
! Caveat:!all!of!this!holds!for! EnglishLlike !languages!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Combina*on!schemes!
! These!two!approaches!(biword!index,!posi*onal!index)!can!be!profitably!combined!
! For!par*cular!phrases!(�Michael1Jackson�,1�Britney1Spears�)!it!is!inefficient!to!keep!on!merging!posi*onal!
pos*ngs!lists!
! Even!more!so!for!phrases!like!�The1Who�1
! Williams!et!al.!(2004)!evaluate!a!more!
sophis*cated!mixed!indexing!scheme!
! A!typical!web!query!mixture!was!executed!in!¼!of!the!
*me!of!using!just!a!posi*onal!index!
! It!required!26%!more!space!than!having!a!posi*onal!
index!alone!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Skip!pointers!
! Walk!through!the!two!pos*ngs!simultaneously,!in!
*me!linear!in!the!total!number!of!pos*ngs!entries!
128
31
2 4 8 41 48 64
1 2 3 8 11 17 21
Brutus
Caesar 2 8
If the list lengths are m and n, the merge takes O(m+n) operations
Can we do better? Yes (if index isn t changing too fast)
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Skip!pointers!
128 2 4 8 41 48 64
31 1 2 11 17 21 31 11
41 128
Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is smaller But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings
8
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Compression!
! Use!less!disk!space!! Saves!a!liHle!money!
! Keep!more!stuff!in!memory!
! Increases!speed!! Increase!speed!of!data!transfer!from!disk!to!memory!
! [read!compressed!data!|!decompress]!is!faster!than!!!!!
[read!uncompressed!data]!
! Premise:!Decompression!algorithms!are!fast!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Compression!
! Dic*onary!! Make!it!small!enough!to!keep!in!main!memory!
! Make!it!so!small!that!you!can!keep!some!pos*ngs!lists!
in!main!memory!too!
! Pos*ngs!files!! Reduce!disk!space!needed!! Decrease!*me!needed!to!read!pos*ngs!lists!from!disk!
! Large!search!engines!keep!a!significant!part!of!the!pos*ngs!in!memory!
! Compression!lets!you!keep!more!in!memory!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Index!parameters!vs.!what!we!index!
size of word types (terms)
non-positional postings
positional postings
dictionary non-positional index
positional index
Size (K) Size (K) Size (K)
Unfiltered 484 109,971 197,879 No numbers 474 100,680 179,158 Case folding 392 96,969 179,158 30 stopwords 391 83,390 121,858 150 stopwords 391 67,002 94,517 stemming 322 63,812 94,517
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Lossless!vs.!lossy!compression!
! Lossless!compression:!All!informa*on!is!preserved!
! What!we!mostly!do!in!IR!
! Lossy!compression:!Discard!some!informa*on!
! Several!of!the!preprocessing!steps!can!be!viewed!as!lossy!compression:!case!folding,!stopwords,!
stemming,!number!elimina*on!
! Prune!pos*ngs!entries!that!are!unlikely!to!turn!up!in!the!top!k!list!for!any!query!! Almost!no!loss!quality!for!top!k!list!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Vocabulary!vs.!collec*on!size!
! Heaps’!law:!M%=%kTb%
! M!is!the!size!of!the!vocabulary,!T!is!the!number!of!
tokens!in!the!collec*on!
! Typical!values:!30!≤!k!≤!100!and!b!≈!0.5!! In!a!logLlog!plot!of!vocabulary!size!M!vs.!T,!Heaps’!law!predicts!a!line!with!slope!about!½!
! It!is!the!simplest!possible!rela*onship!between!the!two!
in!logLlog!space!
! An!empirical!finding!(“empirical!law”)!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Vocabulary!vs.!collec*on!size!
! xLaxis:!text!size!! yLaxis:!number!of!dis*nct!vocabulary!elements!present!in!the!text!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Zipf’s!law!
! Heaps’!law!gives!the!vocabulary!size!in!collec*ons!! We!also!study!the!rela*ve!frequencies!of!terms!
! In!natural!language,!there!are!a!few!very!frequent!terms!and!very!many!very!rare!terms!
! Zipf’s!law:!The!ith!most!frequent!term!has!frequency!
propor*onal!to!1/i!
! cfi! !1/i%=%K/i%where!K!is!a!normalizing!constant%
! cfi!is!collec*on!frequency:!the!number!of!
occurrences!of!the!term!ti!in!the!collec*on!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Zipf!consequences!
! Zipf's!law!states!that!given!a!natural!language!corpus,!the!frequency!of!any!word!is!inversely!propor*onal!to!
its!rank!in!the!frequency!table.!
! Thus!the!most!frequent!word!will!occur!
approximately!twice!as!omen!as!the!second!most!
frequent!word,!three!*mes!as!omen!as!the!third!most!
frequent!word,!etc.!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Zipf!consequences!
! For!example,!in!the!Brown!Corpus!"the"!is!the!
most!frequently!occurring!word,!and!by!itself!
accounts!for!nearly!7%!of!all!word!occurrences!
(69,971!out!of!slightly!over!1!million)!
! True!to!Zipf's!Law,!the!secondLplace!word!"of"!accounts!for!slightly!over!3.5%!of!words!(36,411!
occurrences),!followed!by!"and"!(28,852)!
! Only!135!vocabulary!items!are!needed!to!
account!for!half!the!Brown!Corpus!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Why!compress!the!dic*onary?!
! Search!begins!with!the!dic*onary!! We!want!to!keep!it!in!memory!
! Memory!footprint!compe**on!with!other!
applica*ons!
! Memory!footprint:!amount!of!memory!a!program!uses!
! Embedded/mobile!devices!may!have!very!liHle!
memory!
! Even!if!the!dic*onary!isn’t!in!memory,!we!want!it!
to!be!small!for!a!fast!search!startup!*me!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Dic*onary!storage!L!first!cut!
! Array!of!fixedLwidth!entries!! ~400,000!terms;!28!bytes/term!=!11.2!MB!
Terms Freq. Postings ptr.
a 656,265
aachen 65
…. ….
zulu 221
Dictionary search structure
20 bytes 4 bytes each
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
FixedLwidth!terms!are!wasteful!
! Most!of!the!bytes!in!the!Term!column!are!
wasted!–!we!allot!20!bytes!for!1!leHer!terms!
! And!we!s*ll!can’t!handle!supercalifragilis9cexpialidocious%or!hydrochlorofluorocarbons%
! Ave.!dic*onary!word!in!English:!~8!characters!! How!do!we!use!~8!characters!per!dic*onary!term?!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Dic*onary!as!a!string!
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
Total string length = 400K x 8B = 3.2MB
Pointers resolve 3.2M positions: log23.2M =
22bits = 3bytes
! Store!dic*onary!as!a!(long)!string!of!characters:!! Pointer!to!next!word!shows!end!of!current!word!! Hope!to!save!up!to!60%!of!dic*onary!space!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Blocking!
! Store!pointers!to!every!kth!term!string!
! Need!to!store!term!lengths!(1!extra!byte)!
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
7
! Save 9 bytes " on 3 # pointers.
Lose 4 bytes on term lengths.
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Front!coding!
! Sorted!words!commonly!have!long!common!prefix!
–!store!differences!only!(see!wildcard!queries!in!
next!lecture)!
8automata8automate9automa:c10automa:on1
→8automat*a1◊e2◊ic3◊ion
Encodes automat
Extra length beyond automat
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
RCV1!dic*onary!compression!summary!
Technique) Size)in)MB)
Fixed!width! 11.2!
Dic*onary!as!string!with!pointers!to!every!term! 7.6!
Blocking! 7.1!
Blocking!+!front!coding! 5.9!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Pos*ngs!compression!
! We!store!the!list!of!docs!containing!a!term!in!
increasing!order!of!docID!
! computer:!33,47,154,159,202!…!
! Consequence:!it!suffices!to!store!gaps.!! 33,47,154,5,202!…!
! Hope:!most!gaps!can!be!encoded/stored!with!far!
fewer!than!20!bits!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Unstructured!vs.!semiLstructured!data!
! Typically!refers!to!free!text!
! Allows!! Keyword!queries!including!operators!
! More!sophis*cated!
concept !queries!e.g.,!
! find!all!web!pages!dealing!with!drug%abuse%
! In!fact!almost!no!data!!
is! unstructured !
! E.g.,!this!slide!has!dis*nctly!iden*fied!
zones!such!as!the!Title!and!Bullets%
! Facilitates! semiL
structured !search,!e.g.,!
! Title!contains!data!AND!Bullets!contain!search!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Takeaways!
! TermLdocument!incidence!
! Inverted!index!! Boolean!query!
! Merging!
! Query!op*miza*on!
! Phrase!queries!! Op*miza*on!
! Skip!pointers!! Compression!
%%%Lecture%%1%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Introduc*on!to!
Informa(on)Retrieval)
!
!
Lecture!2:!Tolerant!Retrieval!
!
GIAN!Course!
Big!Social!Data!Analysis!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Parsing!a!document!
! What!format!is!it!in?!
! pdf/word/excel/html?!
! What!language!is!it!in?!
! What!character!set!is!in!use?!
! e.g.,!CP1252,!UTFL8!
Each of these is a classification problem, which we will study later in the course
But these tasks are often done heuristically …
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Complica*ons:!Format/language!
! Documents!being!indexed!can!include!docs!from!many!different!languages!
! A!single!index!may!contain!terms!of!several!
languages!
! Some*mes!a!document!or!its!components!can!contain!mul*ple!languages/formats!
! e.g.,!French!email!with!a!German!pdf!aHachment!
! There!are!commercial!and!open!source!
libraries!that!can!handle!a!lot!of!this!stuff!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Complica*ons:!What!is!a!document?!
! We!return!from!our!query! documents !but!there!are!omen!interes*ng!ques*ons!of!grain!
size,!e.g.,!!
! What!is!a!unit!document?!
! A!file?!! An!email?!!An!email!with!5!aHachments?!
! A!group!of!files!(e.g.,!PPT!or!LaTeX!as!HTML!pages)!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Tokeniza*on!
! Input:! Friends,1Romans1and1Countrymen !
! Output:!Tokens!! Friends1! Romans1! Countrymen1
! A!token!is!an!instance!of!a!sequence!of!characters!! Each!such!token!is!now!a!candidate!for!an!index!entry,!amer!further!processing!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Tokeniza*on!
! Issues!in!tokeniza*on:!! Finland’s1capital1→1111111Finland1AND!s?1Finlands?1Finland’s?!! HewleL9Packard!→!HewleL!and!Packard!as!two!tokens?!
! state9of9the9art:!break!up!hyphenated!sequence.!!!! co9educa:on1! lowercase,!lower9case,!lower1case?!! It!can!be!effec*ve!to!get!the!user!to!put!in!possible!hyphens!
! San1Francisco:!one!token!or!two?!!!! How!do!you!decide!it!is!one!token?!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Numbers!
! 3/20/91 1 1 11Mar.112,119911 1 1 120/3/911! 551B.C.1! B9521! My1PGP1key1is1324a3df234cb23e1! (800)1234923331
! Omen!have!embedded!spaces!
! Older!IR!systems!may!not!index!numbers!
! But!omen!very!useful:!think!about!things!like!looking!up!error!codes/stacktraces!on!the!Web!
! (One!answer!is!using!nLgrams:!Lecture!3)!
! Will!omen!index! metaLdata !separately!
! Crea*on!date,!format,!etc.!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Tokeniza*on:!Language!issues!
! French:!e.g.!L'ensemble!→!one!token!or!two?!
! L1?!L� ?!Le1?!! Want!l�ensemble!to!match!with!un1ensemble1
! Un*l!at!least!2003,!it!didn t!on!Google:!
Interna*onaliza*on!!
! German!noun!compounds!are!not!segmented!
! LebensversicherungsgesellschaZsangestellter1! life!insurance!company!employee !
! German!retrieval!systems!benefit!greatly!from!a!
compound)spliLer)module!
! Can!give!a!15%!performance!boost!for!German!!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Tokeniza*on:!Language!issues!
! Chinese!and!Japanese!have!no!spaces!between!words:!
! � � �! Not!always!guaranteed!a!unique!tokeniza*on!!
! Further!complicated!in!Japanese,!with!mul*ple!
alphabets!intermingled!
! Dates/amounts!in!mul*ple!formats!
������500������������$500K(6,000��)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Tokeniza*on:!Language!issues!
! Arabic!(or!Hebrew)!is!wriHen!right!to!lem,!but!with!certain!items!like!numbers!wriHen!lem!to!right!
! Words!are!separated,!but!leHer!forms!within!a!word!
form!complex!ligatures!
! !!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!←!!→!!!!←!→!!!!!!!!!!!!!!!!!!!!!!←!start!
! Algeria!achieved!its!independence!in!1962!amer!132!
years!of!French!occupa*on !
! With!Unicode,!the!surface!presenta*on!is!complex,!but!the!
stored!form!is!!straigháorward!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Stopwords!
! With!a!stop!list,!you!exclude!from!the!dic*onary!
en*rely!the!commonest!words.!Intui*on:!
! They!have!liHle!seman*c!content:!the,%a,%and,%to,%be%
! There!are!a!lot!of!them:!~30%!of!pos*ngs!for!top!30!words!
! But!the!trend!is!away!from!doing!this:!
! Good!compression!techniques!means!the!space!for!including!
stopwords!in!a!system!is!very!small!
! Good!query!op*miza*on!techniques!mean!you!pay!liHle!at!query!
*me!for!including!stop!words!
! You!need!them!for:!
! Phrase!queries:! King!of!Denmark !
! Various!song!*tles,!etc.:! Let!it!be ,! To!be!or!not!to!be !
! Rela*onal !queries:! flights!to!London !
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Normaliza*on!
! We!need!to!“normalize”!words!in!indexed!text!as!
well!as!query!words!into!the!same!form!
! We!want!to!match!U.S.A.!and!USA1
! Result!is!terms:!a!term!is!a!(normalized)!word!type,!
which!is!an!entry!in!our!IR!system!dic*onary!
! We!most!commonly!implicitly!define!equivalence!
classes!of!terms!by,!e.g.,!!
! dele*ng!periods!to!form!a!term!
! U.S.A.,!USA1→1USA1
! dele*ng!hyphens!to!form!a!term!
! an:9discriminatory,1an:discriminatory11→11an:discriminatory1
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Normaliza*on:!Other!languages!
! Accents:!e.g.,!French1résumé!vs.!resume)! Umlauts:!e.g.,!German:!Tuebingen!vs.!Tübingen1
! Should!be!equivalent!! Most!important!criterion:!
! How!do!your!users!like!to!write!their!queries!for!these!words?!
! Even!in!languages!that!standardly!have!accents,!users!omen!may!not!type!them!
! Omen!best!to!normalize!to!a!deLaccented!term!
! Tuebingen,1Tübingen,1Tubingen1→1Tubingen!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Normaliza*on:!Other!languages!
! Normaliza*on!of!things!like!date!forms!
! 7�30� vs. 7/30 ! Japanese use of kana vs. Chinese characters!
! Tokeniza*on!and!normaliza*on!may!depend!on!the!
language!and!so!is!intertwined!with!language!
detec*on!
! Crucial:!Need!to! normalize !indexed!text!as!well!as!
query!terms!iden*cally!
Morgen will ich in MIT … Is this
German mit ?
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Case!folding!
! Reduce!all!leHers!to!lowercase!! excep*on:!upper!case!in!midLsentence?!
! e.g.,!General1Motors1! Fed!vs.!fed1! SAIL!vs.!sail1
! Omen!best!to!lower!case!everything,!since!users!will!use!lowercase!regardless!of!‘correct’!capitaliza*on…!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Thesauri!and!Soundex!
! Do!we!handle!synonyms!and!different!spellings?!
! E.g.,!by!handLconstructed!equivalence!classes!! car!=!automobile 11color!=!colour1
! We!can!rewrite!to!form!equivalenceLclass!terms!
! When!the!document!contains!automobile,!index!it!under!car1as!well!(and!viceLversa)!
! Or!we!can!expand!a!query!! When!the!query!contains!automobile,!look!under!car!as!well!
! What!about!spelling!mistakes?!
! One!approach!is!Soundex,!which!forms!equivalence!classes!of!words!based!on!phone*c!heuris*cs,!e.g.,!!!!!!!!
c!u!l8r!=>!/siː/juː/ˈleɪtəʳ/!=>!see!you!later!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Lemma*za*on!
! Reduce!inflec*onal/variant!forms!to!base!form!
! E.g.,!! am,%are,!is%→!be!
! car,%cars,%car's,!cars'!→!car%
! the%boy's%cars%are%different%colors!→!the%boy%car%be%different%color%
! Lemma*za*on!implies!doing! proper !reduc*on!
to!dic*onary!headword!form!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Stemming!
! Different!wordforms!
! e.g.,!automate(s),1automa:c,1automa:on!
! Reduce!terms!to!their! roots !before!indexing!
! Stemming !suggest!crude!affix!chopping!
! language!dependent!! e.g.,!automates,1automa:c,1automa:on!all!reduced!to!automat!
for example compressed and compression are both accepted as equivalent to compress.
for exampl compress and compress ar both accept as equival to compress
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Lemma*za*on!pros!
!
!
!
Lemma*za*on!!!!!!!!!!!!!!!!!!!!!!!!!!!!vs.!!
democrat%=>%democrat%democrats%=>%democrat%%
democra9c%=>%democra9c%%%
democra9ze%=>%democra9ze%democra9zed%=>%democra9ze%democra9zing%=>%democra9ze%democra9se%=>%democra9ze%!
Stemming!!
democrat%=>%democrat%democrats%=>%democrat%%
democra9c%=>%democrat%%
democra9ze%=>%democrat%democra9zed%=>%democrat%democra9zing%=>%democrat%democra9se%=>%democrat%%
! Preserves!POS!tags,!e.g.,!democrat!(Noun)!vs.!
democra*c!(Adj)!vs.!democra*ze!(Verb)!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Lemma*za*on!pros!
eat_burger%
eats_burger!
ate_burger!
ea*ng_burger!
eat_burgers!
eat_the_burger!
! Improves!concept!extrac*on!through!text!
normaliza*on,!e.g.,!noun/verb!inflec*on!elimina*on!
eaten_burger!
ea*ng_of_burgers!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Lemma*za*on!cons!
cross!
get_beHer!
!window!
!kick_ball!
=>!!!!get_the_beHer_of!
!
=>!!!!Windows!
=>!!!!crossing!
!
! Lemma*za*on!may!lead!to!misunderstanding!
=>!!!!kick_balls!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Porter’s!algorithm!
! Most!popular!algorithm!for!stemming!English!
! Results!suggest!it’s!at!least!as!good!as!other!stemming!
op*ons!
! Example!rules!
! sses!→!ss % % %(caresses%→!caress)%
! ies!→!i % % %(ponies%→!poni)%
! (m>1)%ement%→ !(replacement!→!replac;!cement!→!cement)!
! Conven*ons!+!5!phases!of!reduc*ons!! phases!applied!sequen*ally!
! each!phase!consists!of!a!set!of!commands!
! sample!conven*on:!Of%the%rules%in%a%compound%command,%select%the%one%that%applies%to%the%longest%suffix%
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Exercise!
! What!is!the!purpose!of!including!the!following!rule?!
! ss!→!ss!
! [Hint]!s!→!nothing%
! e.g.,!boss!→!boss!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
WildLcard!queries!
! mon*:!find!all!docs!containing!any!word!beginning!mon !
! Easy!with!binary!tree!(or!BLtree)!lexicon:!retrieve!all!words!in!range:!mon1≤1w1<1moo1
! *mon:1find!words!ending!in! mon :!harder!
! Maintain!an!addi*onal!BLtree!for!terms!backwards!
Can!retrieve!all!words!in!range:!nom1≤1w1<1non%
Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
WildLcard!queries!
! How!can!we!handle!*!in!the!middle!of!query!term?!
! co*:on1! Solu*on!1:!Look!up!co*!in!a!BLtree!and!find!all!terms!
ending!with! *on !
! Solu*on!2:!We!could!look!up!co*!AND!*:on!in!a!!!!!!!BLtree!and!intersect!the!two!term!sets!
! Both!are!expensive!! Solu*on!3:!transform!wildLcard!queries!so!that!!!!!!!
the!*!occurs!at!the!end!
! Permuterm!Index!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Permuterm!index!
! For!term!hello,!index!under:!! hello$,1ello$h,1llo$he,1lo$hel,1o$hell1where)$)is)a)special)symbol)
! Queries:!! X!!!!lookup!on!X$ )))))X*)))lookup!on!!!$X*)! *X)))lookup!on!X$* !!!!!*X*!!lookup!on!!!X*)! X*Y!lookup!on!Y$X* !!
Query = hel*o X=hel, Y=o
Lookup o$hel*
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Permuterm!query!processing!
! Rotate!query!wildLcard!to!the!right!! Now!use!BLtree!lookup!as!before!! Permuterm%problem:%≈%quadruples%lexicon%size%
Empirical observation for English
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Processing!wildLcard!queries!
! As!before,!we!must!execute!a!Boolean!query!for!
each!enumerated,!filtered!term!
! WildLcards!can!result!in!expensive!query!execu*on!
(very!large!disjunc*ons…)!
! pyth*!AND!prog*!! If!you!encourage! laziness !people!will!respond!!
Search Type your search terms, use * if you need to. E.g., Alex* will match Alexander.
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Spell!correc*on!
! Two!principal!uses!! Correc*ng!documents!being!indexed!
! Correc*ng!user!queries!to!retrieve! right !answers!
! Two!main!flavors:!
! Isolated!word!! Check!each!word!on!its!own!for!misspelling!
! Will!not!catch!typos!resul*ng!in!correctly!spelled!words!
! !e.g.,!from1→1form1or1sue1→1use1
! ContextLsensi*ve!! Look!at!surrounding!words,!!! e.g.,!I1flew1form1Heathrow1to1Narita1
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Spell!correc*on!
! Based!on!syntax!! Related!to!intrinsic!property!of!words!
! Based!on!probabili*es!! Related!to!probabili*es!of!words!to!appear!in!a!specific!context!(neighboring!words)!
! Based!on!preference!! Related!to!clicks/choices!of!users!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Document!correc*on!
! Especially!needed!for!OCR!and!ASR!! Correc*on!algorithms:!e.g.,!“rn”!vs.!“m”!or!“write”!vs.!“right”!
! Can!use!domainLspecific!knowledge!
! E.g.,!OCR!can!confuse!O!and!D!more!omen!than!it!would!confuse!O!
and!I!(adjacent!on!the!QWERTY!keyboard,!so!more!likely!
interchanged!in!typing)!
! But!also:!web!pages!and!even!printed!material!has!
typos!
! Goal:!the!dic*onary!contains!fewer!misspellings!
! But!omen!we!don’t!change!the!documents!but!aim!to!
fix!the!queryLdocument!mapping!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Query!misspellings!
! We!can!either!
! Retrieve!documents!indexed!by!the!correct!spelling,!OR!
! Return!several!suggested!alterna*ve!queries!with!the!correct!spelling!
! Did%you%mean%…%?%
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Isolated!word!correc*on!
! Fundamental!premise!–!there!is!a!lexicon!from!
which!the!correct!spellings!come!
! Two!basic!choices!for!this!! A!standard!lexicon!such!as!
! Webster s!English!Dic*onary!
! An! industryLspecific !lexicon!–!handLmaintained!
! The!lexicon!of!the!indexed!corpus!! E.g.,!all!words!on!the!web!! All!names,!acronyms!etc.!(including!the!misspellings)!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Isolated!word!correc*on!
! Given!a!lexicon!and!a!character!sequence!Q,!return!the!words!in!the!lexicon!closest!to!Q!
! What’s!“closest”?!
! We’ll!study!several!alterna*ves!
! Edit!distance!(Levenshtein!distance)!! Weighted!edit!distance!
! nLgram!overlap!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Edit!distance!
! Given!two!strings!S1!and!S2,!the!minimum!number!
of!opera*ons!to!convert!one!to!the!other!
! Opera*ons!are!typically!characterLlevel!! Insert,!Delete,!Replace!(e.g.,!distance!=!1)!! Transposi*on!(e.g.,!distance!=!2)!
! E.g.,!the!edit!distance!from!dof!to!dog!is!1!! From!cat!to!act!is!2!!!!(Just!1!with!transpose)!! from!cat!to!dog!is!3!
! Generally!found!by!dynamic!programming!! See!hHp://www.csse.monash.edu.au/~lloyd/*ldeAlgDS/Dynamic/Edit/!!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Levenshtein!distance!
! levcats,fast(4,4)!=!min!!
! levcats,fast(4,3)!+!1!(inser*on)!! levcats,fast(3,4)!!+!1!(dele*on)!! levcats,fast(3,3)!+!1!(replacement)!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Levenshtein!distance!
0 1 2 3 4
1 1 2 3 4
2 2 1 2 3
3 3 2 2 2
4 4 3 2 3
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Exercise:!Levenshtein!distance!
! Compute!edit!distance!between! NTU !and! NUS’!
filling!out!the!following!table!for!Levenshtein!
distance!algorithm.!
N! T! U!
0!
N!
U!
S!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Weighted!edit!distance!
! As!above,!but!the!weight!of!an!opera*on!depends!on!the!character(s)!involved!
! OCR!errors:!e.g.!O)–)D!or!O!–!I!! Keyboard!errors:!e.g.!O!–!D!or!O)–)I)! This!may!be!formulated!as!a!probability!model!
! Requires!weight!matrix!as!input!
! Modify!dynamic!programming!to!handle!weights!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Edit!distance!to!all!dic*onary!terms?!
! Given!a!(misspelled)!query!–!do!we!compute!its!
edit!distance!to!every!dic*onary!term?!
! Expensive!and!slow!! Alterna*ve?!
! How!do!we!cut!the!set!of!candidate!dic*onary!terms?!
! One!possibility!is!to!use!n,gram!overlap!for!this!
! This!can!also!be!used!by!itself!for!spelling!correc*on!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
nLgram!overlap!
! Enumerate!all!the!nLgrams!in!the!query!string!as!
well!as!in!the!lexicon!
! Use!the!nLgram!index!to!retrieve!all!lexicon!terms!
matching!any!of!the!query!nLgrams!
! Threshold!by!number!of!matching!nLgrams!
! Variants!–!weight!by!keyboard!layout,!etc.!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Example!with!trigrams!
! Suppose!the!text!is!november1! Trigrams!are!nov,%ove,%vem,%emb,%mbe,%ber!
! The!query!is!december1! Trigrams!are!dec,%ece,%cem,%emb,%mbe,%ber!
! So!3!trigrams!overlap!(of!6!in!each!term)!
! How!can!we!turn!this!into!a!normalized!measure!
of!overlap?!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
One!op*on!–!Jaccard!coefficient!(J)!
! A!commonlyLused!measure!of!overlap!
! Let!X!and!Y!be!two!sets;!then!J!is!
! Equals!1!when!X!and!Y!have!the!same!
elements!and!zero!when!they!are!disjoint!
! X!and!Y!don’t!have!to!be!of!the!same!size!
! Always!assigns!a!number!between!0!and!1!
! Now!threshold!to!decide!if!you!have!a!match!
! E.g.,!if!J!>!0.8,!declare!a!match!!
YXYX ∪∩ /
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Calcula*ng!J!and!dJ!
! Given!A!and!B,!each!with!n!binary!aHributes,!!!!!!!!!!the!Jaccard!coefficient!is!a!useful!measure!of!the!
overlap!that!A!and!B!share!with!their!aHributes !!
M11:!number!of!aHributes!where!A!and!B!both!have!a!value!of!1!
M01:!number!of!aHributes!where!the!aHribute!of!A!is!0!and!the!aHribute!of!B!is!1!
M10:!number!of!aHributes!where!the!aHribute!of!A!is!1!and!the!aHribute!of!B!is!0!!
M00:!number!of!aHributes!where!A!and!B!both!have!a!value!of!0!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
ContextLsensi*ve!spell!correc*on!
! Text:!I1flew1from1Heathrow1to1Narita1! Consider!the!phrase!query!�flew1form1Heathrow�1! We’d!like!to!respond!
! !Did!you!mean! flew1from1Heathrow ?!
because!the!query!probably!didn’t!match!any!
document!and!the!alterna*ve!is!“sta*s*cally!more!
plausible”!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
ContextLsensi*ve!correc*on!
! Need!surrounding!context!to!catch!this!! First!idea:!retrieve!dic*onary!terms!close!(in!
weighted!edit!distance)!to!each!query!term!
! Now!try!all!possible!resul*ng!phrases!with!one!word! fixed !at!a!*me!
! flew1from1heathrow11! fled1form1heathrow1! flea1form1heathrow1
! HitTbased)spelling)correc(on:)Suggest!the!alterna*ve!that!has!lots!of!hits!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
ContextLsensi*ve!correc*on!
! Generaliza*on!(through!lemma*za*on)!
! FLY!L>!from1! fly1from1! flying1from1! flies1from1! flown1from1! flew1from!
! See!also,!fly_to,!fly_through,!fly_away,!etc.!
! But!“fly”!can!be!also!a!NOUN!or!an!ADJ…!! Need!for!POS!tags!!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Another!approach!
! Break!phrase!query!into!a!conjunc*on!of!biwords!! E.g.! flew!form ,! form!Heathrow !
! Look!for!biwords!that!need!only!one!term!corrected!
! X!form ,! form!Heathrow !
! flew!Y ,! Y!Heathrow !
! flew!form ,! form!Z !
! Enumerate!phrase!matches!and!…!rank!them!!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
General!issues!in!spell!correc*on!
! We!enumerate!mul*ple!alterna*ves!for! Did!you!
mean? !
! Need!to!figure!out!which!to!present!to!the!user!! Use!heuris*cs!
! The!alterna*ve!hiÄng!most!docs!
! Query!log!analysis!!! SpellLcorrec*on!is!computa*onally!expensive!
! Avoid!running!rou*nely!on!every!query?!! Run!only!on!queries!that!matched!few!docs!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Soundex!
! Class!of!heuris*cs!to!expand!a!query!into!phone*c!equivalents!
! Language!specific!–!mainly!for!names!
! E.g.,!chebyshev!(English)!→!tchebycheff1(French)!
! Invented!for!the!U.S.!census!…!in!1918!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Soundex!
! Turn!every!token!to!be!indexed!into!a!4Lcharacter!reduced!form!
! E.g.,!Herman!becomes!H655!
! Do!the!same!with!query!terms!
! Build!and!search!an!index!on!the!reduced!forms!
! (when!the!query!calls!for!a!soundex!match)!
!
! hHp://www.crea*vyst.com/Doc/Ar*cles/SoundEx1/SoundEx1.htm!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Soundex!–!typical!algorithm!
1. Retain!the!first!leHer!of!the!word!!2. Change!all!occurrences!of!the!following!leHers!
to!'0'!(zero):!!!'A',!E',!'I',!'O',!'U',!'H',!'W',!'Y'.!!
3. Change!leHers!to!digits!as!follows:!!! B,!F,!P,!V!→!1!
! C,!G,!J,!K,!Q,!S,!X,!Z!→!2!
! D,T!→!3!
! L!→!4!
! M,!N!→!5!
! R!→!6!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Soundex!–!typical!algorithm!
4. Remove!all!pairs!of!consecu*ve!digits!
5. Remove!all!zeros!from!the!resul*ng!string!
6. Pad!the!resul*ng!string!with!trailing!zeros!and!return!the!first!four!posi*ons,!which!will!be!of!the!
form!<uppercase!leHer>!<digit>!<digit>!<digit>!
!
E.g.,!Herman!becomes!H655!
Will hermann generate the same code?
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Beyond!Soundex!
! Soundex!is!the!classic!algorithm,!provided!by!most!
databases!(Oracle,!Microsom,!…)!
! How!useful!is!soundex?!! Not!very!–!for!informa*on!retrieval!
! Okay!for! high!recall !tasks!(e.g.,!Interpol),!though!biased!to!names!of!certain!na*onali*es!
! Zobel!and!Dart!(1996)!show!that!other!algorithms!
for!phone*c!matching!perform!much!beHer!in!the!
context!of!IR!
%%%Lecture%%2%
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
! Pronuncia*on!more!consistent!than!orthography!
# Phone9c,based!approach!to!normaliza*on!
! Interna*onal!Phone*c!Alphabet!(IPA)!
IPA!normaliza*on!
GIAN%Course%,%Big%Social%Data%Analysis% !! !!
Do!not!reinvent!the!wheel!!
! Stemmers,!lemma*zers,!and!spell!correctors!are!
widely!and!freely!available!on!the!Web!in!all!
major!programming!languages!
! For!Python,!I!recommend:!
! NLTK!Lemma*zer!
! hHp://nltk.org!! Peter!Norvig’s!spell!corrector!
! hHp://norvig.com/spellLcorrect.html!
%%%Lecture%%2%
Top Related