Introduc*on!to! GIAN!Course! Informa(on)Retrieval)...

238
Introduc*on to Informa(on Retrieval Course Introduc*on GIAN Course Big Social Data Analysis

Transcript of Introduc*on!to! GIAN!Course! Informa(on)Retrieval)...

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Introduc*on!to!

Informa(on)Retrieval)

!

!

Course!Introduc*on!

GIAN!Course!

Big!Social!Data!Analysis!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Course!staff!

!

!

%%Introduc9on%

!!!!!!!!!!!!!!!!!!!!!!Erik!Cambria!

[email protected]!

!!!!!!!!!!!!!!!!!!!!Haiyun!Peng!!

[email protected]!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Sen*c!Team!

%%Introduc9on%

hHp://sen*c.net!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

NTULIndia!Connect!programme!

%%Introduc9on%

hHp://global.ntu.edu.sg/gmp/ic!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Sen*cNet!

%%Introduc9on%

hHp://business.sen*c.net!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Textbook!

!  Introduc9on%to%Informa9on%Retrieval%!  Christopher!D.!Manning,!Prabhakar!Raghavan,!

Hinrich!Schutze!

!  Cambridge!University!Press!2008!

!  eBook,!lecture!slides,!etc.,!available!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!from!hHp://informa*onretrieval.org!%

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Reference!books!

Modern)Informa(on)Retrieval)Ricardo!BaezaLYates!

&!Berthier!RibeiroL

Neto!

Addison%Wesley%1999%

Mining)the)Web)Discovering)Knowledge)from)Hypertext)Data)

Soumen!Chakrabar*!

Elsevier%Morgan%Kaufmann%2002%

A)Prac(cal)Guide)to)Sen(ment)Analysis)Erik!Cambria,!

Dipankar!Das,!Sivaji!

Bandyopadhyay,!

Antonio!Feraco!

Springer%Verlag%2017%

Text)Mining)Predic(ve)Methods)for)Analyzing)Unstructured)Informa(on)

Sholom!Weiss,!Ni*n!

Indurkhya,!Tong!Zhang,!

&!Fred!Damerau!

Springer%Verlag%2005%

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Course!details!

!  PreLrequisites!!  Computer!science!

!  Data!structures!!  Algorithm!

!  Mathema*cs!

!  Linear!algebra!!  Probability!

!  Related!subjects!!  Ar*ficial!Intelligence!!  Natural!Language!Processing!!  Data!Mining!

!  Machine!Learning!!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Course!schedule!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Big!data:!Blessing!or!curse?!

!

!

Informa*on!is!the!main!treasure!of!humankind.!!

!

%%Introduc9on%

!

Without!efficient!management,!!

such!a!treasure!becomes!useless:!

the!more!we!have,!the!less!we!can!use.!!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Making!sense!of!data!

%%Introduc9on%

hHp://youtu.be/kb7RL6bLmHE!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Informa*on!need!

!  IR!is!a!task!of!automa*cally!sa*sfying!the!

user’s!informa9on%need%!!  Understand!the!informa*on!need!

!  Sa*sfy!the!informa*on!need!!

!  But!the!needs!are!very!different,!!!!!!!!!!!!!!!!!!!!!!!IR!has!many!faces!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Types!of!informa*on!need!

!  Find!a!document!or!documents!

!  What!for?!

!  Answer!a!ques*on!!  Save!*me!on!reading!

!  Find!what!the!document!is!about!

!  Check!authen*city!!  Find!what’s!new!!  Mine!opinions!

!  and!more…!

%%Introduc9on%

→!Ques*on!Answering!

→!Text!Summariza*on!

→!Topic!Modeling!!

→!Plagiarism!Detec*on!!

!→!Trend!Discovery!

→!Sen(ment)Analysis)

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Sen*ment!analysis!

!  �!Sen*ment!analysis!has!

raised!growing!interest!!!!

both!within!the!scien*fic!

community,!leading!to!many!

exci*ng!open!challenges,!!!!!

as!well!as!in!the!business!

world,!due!to!the!remarkable!

benefits!to!be!had!from!

financial!forecas*ng!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

A!big!suitcase!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Syntac*cs!layer!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Microtext!normaliza*on!

!  Before!NLP!techniques!can!be!applied,!informal!

text!(e.g.,!c!u!l8r),!

acronyms!(e.g.,!LOL),!

and!emo*cons!(e.g.,!:>),!

need!to!be!translated!

into!plain!English!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Sentence!boundary!disambigua*on!

The!group!included!J.!M.!Freeman!Jr.!and!T.!Boone!

Pickens.!The!group!included!J.!M.!Freeman!Jr.!T.!Boone!

Pickens!had!lem.!!

He!stopped!to!see!Dr.!Lawson.!He!stopped!at!Meadows!

Dr.!Lawson!was!s*ll!open.!!

It!was!due!Friday!by!5︎!p.m.!Saturday!would!be!too!late. ︎!︎︎She!has!an!appointment!at!5!p.m.!Saturday!to!get!her!

car! ︎fixed ︎.!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

POS!tagging!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Text!chunking!

!  Text!chunking,!also!referred!to!as!shallow!parsing,!is!a!task!that!follows!POS!tagging!and!that!adds!

more!structure!to!the!sentence!

!  The!result!is!a!grouping!of!the!words!in!“chunks”!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Lemma*za*on!

%%Introduc9on%

eat_burger!

eats_a_burger!

ate_burger!

ea*ng_burger!

eat_burgers!eat_the_burger!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Seman*cs!layer!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Word!sense!disambigua*on!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Concept!extrac*on!

the!camera!has![long!focus!*me]!!

the!camera!takes!a![long!*me]!to![focus]!!

the![focusing]!of!the!camera!takes![long!*me]!!

the![focus!*me]!of!the!camera!is!very![long]!!

long_focus_*me!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

NamedLen*ty!recogni*on!!

!  NamedLen*ty!

recogni*on!is!key!for!

improving!anaphora!

resolu*on!and,!

hence,!for!detec*ng!

aspects!or!opinion!

targets!in!reviews!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Anaphora!resolu*on!

!  Anaphora!is!the!use!of!an!expression!the!

interpreta*on!of!which!

depends!upon!another!

one!

!  It!is!commonly!resolved!

by!gender!and!number!

agreement!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Subjec*vity!detec*on!!

!  Subjec*vity!detec*on!is!a!binary!

classifica*on!task!that!

consists!in!classifying!

text!into!either!

objec*ve!(neutral)!or!

subjec*ve!(i.e.,!

posi*ve!or!nega*ve)!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Pragma*cs!layer!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Personality!recogni*on!!

!  Personality!recogni*on!is!an!important!step!

towards!userL

dependent!sen*ment!

analysis!(user!profiling)!

and!it!is!useful!for!

sarcasm!detec*on!

%%Introduc9on%

hHp://datanami.com/2017/09/21/deepLlearningLrevealsLnewLinsightsLpeople!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Sarcasm!detec*on!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Metaphor!understanding!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Aspect!extrac*on!

%%Introduc9on%

touchscreen baHery!

I!love!iPhoneX’s!touchscreen!but!the!baHery!life!is!so!short!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Polarity!detec*on!

!  Early!works!treated!polarity!detec*on!as!

a!binary!classifica*on!

problem!(pos!vs!neg)!

!  Recent!works!calculate!polarity!

intensity!as!a!!!!!!!!

float! ![L1,!+1]!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Retrieval!vs.!extrac*on!

!  Informa*on!retrieval!is!about!retrieving!relevant!

data!based!on!a!query!–!you!specify!what!

informa*on!you!need!and!it!is!returned!in!a!

humanLunderstandable!form!

!  (e.g.,!find!relevant!opinions)!

!  Informa*on!extrac*on!is!about!structuring!

unstructured!informa*on!–!given!some!sources!

all!of!the!(relevant)!informa*on!is!structured!in!!!

a!form!that!will!be!easy!for!processing!

!  (e.g.,!classify!opinions!as!posi*ve!or!nega*ve)!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Informa*on!retrieval!

!  Informa*on!retrieval!is!finding!material!

(usually!documents)!of!an!unstructured!

nature!(usually!text)!that!sa*sfies!an!

informa*on!need!from!within!large!

collec*ons!(usually!stored!on!computers)!

!  ELmail!search!

!  Searching!your!laptop!!  Corporate!knowledge!bases!!  Legal!informa*on!retrieval!

!  Web)search)

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Web:!connec*ng!people!

!  The!poten*al!for!knowledge!sharing!

today!is!unmatched!in!

history!

!  Never!before!have!so!many!knowledgeable!

people!been!connected!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

To!be!or!not!to!be!connected?!

!  Being!connected!is!good!but!being!

disconnected!for!the!

past!millions!years!

was!the!main!reason!

behind!our!cultural!

diversity!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Being!disconnected!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Connected!but!alone!

%%Introduc9on%

hHp://ted.com/talks/sherry_turkle_alone_together!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Socialnomics!

%%Introduc9on%

hHp://youtu.be/PWa8L43kELQ!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Being!connected!

!  Leonardo’s!discoveries!and!inven*ons!in!

science,!art,!

engineering,!and!

aesthe*cs,!were!based!

only!on!his!percep*on!

of!the!world!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!Web!is!very!young!

!  Less!than!30!years!have!elapsed!since!the!

inven*on!of!the!Web!

!  We!are!s*ll!just!

‘playing’!with!it!as!we!

are!yet!to!discover!

how!to!fully!make!use!

of!it!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

A!Web!of!knowledge!

!  In!1910,!Otlet!first!envisioned!a!"city!of!

knowledge"!that!

would!serve!as!a!

central!repository!

for!the!world’s!

informa*on,!but!text!

was!not!digital!yet!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!Machine!is!Us/ing!Us!

hHp://youtu.be/NLlGopyXT_g!

!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!Web!as!a!laboratory!

!  The!Web!today!not!

only!represents!an!

unlimited!data!store!

but!also!a!mul*L

disciplinary!laboratory!

environment!for!

worldLscale!

experiments!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!five!eras!of!the!Web!

!  The!Web!is!evolving!

towards!a!shared!

social!experience,!in!

which!consumers!will!

rely!on!their!peers!as!

they!make!online!

decisions!and!will!

shape!future!products!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Power!to!the!people!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

!  Between!the!dawn!of!the!Internet!and!

year!2003,!there!

were!five!exabytes!

of!informa*on!on!

the!Web!

!  Now,!we!create!five!exabytes!every!two!

days!

%%Introduc9on%

Big!social!data!analysis!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

%%Introduc9on%

Drowning!in!data?!

hHp://straits*mes.com/singapore/worldLfacesLdataLstorageLcrunchLahead!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

%%Introduc9on%

Storages!are!not!forever!

E Cambria, A Chattopadhyay, E Linn, B Mandal, B White. Storages are not forever. Cognitive Computation 9(5), 646-658 (2017)

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

It’s!not!just!about!size!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Social!data!shim!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Collected!intelligence!

!  Informa*on!today!is!

extremely!portable!

and!processable!

!  However,!this!collected!intelligence!

is!far!from!being!

addressed!as!

collec*ve!intelligence!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!Web!3.0!dream!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Is!it!people’s!fault?!

!  Online!contents!are!mostly!meant!for!

human!consump*on!

!  Why!should!web!

developers!and!

bloggers!care!about!

making!their!content!

machine!processable?!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Is!it!technology’s!fault?!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Evolu*on!of!NLP!

!  NLP!technologies!evolved!from!the!era!

of!punch!cards!!!!!!!!!

(7!mins!per!

sentence)!to!the!era!

of!Google!and!its!like!

(less!than!a!second!

per!sentence)!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

NLP!emergency!

!  In!a!Web!where!UGC!

has!hit!cri*cal!mass,!

NLP!is!becoming!key!

for!aggrega*ng!

informa*on!

although!systems!

are!s*ll!limited!by!

what!they!can!‘see’!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

More!than!we!see!

!  Language!is!somewhere!

in!between!percep*on!

and!understanding!

!  A!translucent!material,!

so!that!the!world!bears!

the!*nt!and!focus!of!

what!we!express!

through!it!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Understanding!language!

!  Natural!language!understanding!

requires!highLlevel!

symbolic!capabili*es!

that!most!NLP!

technologies!

currently!do!not!

possess!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!hardest!problem?!

We!can!understand!almost!anything,!!!

but!we!can’t!understand!how!we!understand.!

Albert)Einstein))

We!understand!human!mental!processes!!

only!slightly!beHer!than!a!fish!understands!swimming.!

John)McCarthy)))

How!the!mind!works!is!s*ll!a!mystery.!We!understand!the!

hardware,!but!we!don't!have!a!clue!about!the!opera*ng!system.!!

James)Watson)

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Veneer!of!intelligence!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Veneer!of!intelligence!

%%Introduc9on%

hHp://youtu.be/SrTfzHXQdkc!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Veneer!of!intelligence!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Illusion!of!understanding!

!  Assessing!the!intelligence!of!AI!

systems!is!like!a!dog!

chasing!its!own!tail!

!  We!are!ones!

interpre*ng!the!

results!of!the!AI!

systems!we!build!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

AI!meets!natural!stupidity!

!  A!key!failure!of!AI!is!the!persistency!in!

seeking!the!best!way!

to!solve!a!problem,!

which!leads!to!the!

crea*on!of!expert!

(not!intelligent)!

systems!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Commonsense!blindness!

!  The!defini*on!of!today’s!AI!is!a!

machine!that!can!

make!a!perfect!

chess!move!while!

the!room!is!on!fire!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

AI!1.0!vs!AI!2.0!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Machine!learning!

!  Unlike!symbolic!AI,!

machine!learning!

(subLsymbolic!AI)!

does!not!need!!!

handLcramed!rules!

(topLdown)!as!it!is!

mostly!dataLdriven!

(boHomLup)!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Neural!networks!

!  Ar*ficial!neural!networks!were!

actually!invented!in!

the!1940s!so!what!is!

the!reason!for!so!

much!excitement!

around!them!now?!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Deep!learning!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Mad!rush!for!data!scien*sts!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Learning!what!again?!

!  In!most!cases,!we!are!

simply!teaching!

machines!word!coL

occurrence!frequencies!

!  It’s!like!teaching!someone!snowboarding!

by!only!showing!them!

videos!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Are!we!fooling!ourselves?!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

AI!winters!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Is!AI!a!bubble!to!burst?!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Sta*s*cs!are!no!panacea!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Machine!learning!issues!

!  Dependency!it!requires!(a!lot!of)!training!data!and!is!domainLdependent!

!

!  Consistency!different!training!or!tweaking!leads!to!different!results!

!

!  Transparency!the!reasoning!process!is!uninterpretable!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Dependency!

!  A!machine!learning!

algorithm!trained!on!

dataset!A!will!not!

work!well!on!dataset!

B,!especially!if!A!and!

B!are!about!different!

domains!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Consistency!

!  Pushed!by!the!PublishLorLPerish!

principle,!some!

researchers!omen!

“s*r!their!pile”!to!

improve!algorithm!

accuracy!by!a!few!

percent!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Transparency!

!  Most!machine!

learning!techniques!

are!blackLbox!

algorithms:!they!

classify!data!based!

on!learnt!features!

we!do!not!know!

much!about!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!dark!side!of!AI!

!  BlackLbox!algorithms!!

!  Deep!learning!!  Opaque!reasoning!!  Machiavellian!approach!!

!  Brute)force)

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!dark!side!of!AI!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Black!magic!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

White!magic!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Do!like!the!Ancient!One!

%%Introduc9on%

hHp://dmnews.com/ar*cle/738754!

!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

An*LCopernican!AI!revolu*on!

%%Introduc9on%

Top-down (theory-driven) approach

Bottom-up (data-driven) approach

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Example:!Sen*cNet!5!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

%%Introduc9on%

Example:!Sen*cNet!5!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Jumping!NLP!curves!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Jumping!curves!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Bike!sharing!example!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Keyword!spoÄng!

!  Although!the!most!

naïve!approach,!the!

accessibility!and!

economy!of!keyword!

spoÄng!make!it!one!

of!the!most!popular!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Lexical!affinity!

!  Lexical!affinity!assigns!

arbitrary!words!

probable!“affinity”!to!a!

par*cular!class!–!

“accident”!has!a!75%!

probability!of!indica*ng!

a!nega*ve!affect!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Sta*s*cal!methods!

!  By!feeding!a!ML!

algorithm!a!large!

training!corpus,!

sta*s*cal!methods!not!

only!learn!the!valence!

of!affect!words,!but!also!

that!of!other!arbitrary!

keywords!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

ConceptLlevel!analysis!

!  By!relying!on!ontologies!or!seman*c!networks,!

conceptLlevel!

approaches!step!away!

from!blindly!using!affect!

keywords!and!word!coL

occurrence!frequencies!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Conceptualiza*on!

!  Concepts!are!immaterial!

en**es!that!only!exist!in!

the!mind!of!the!speaker!

!  To!be!communicated,!

they!must!be!

represented!in!terms!of!

some!concrete!ar*fact!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Understand!is!simplify!

!  Conceptual!primi*ves!

allow!machines!to!

beHer!grasp!the!

meaning!of!concepts!

that!are!omen!opaque!

due!to!the!richness!

and!ambiguity!of!

natural!language!

%%Introduc9on%

hHp://*nyurl.com/topLdownLboHomLupLnlp!

!

%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

A!‘pipe’!is!not!a!pipe!

!  You!can!know!the!name!of!all!the!

different!kinds!of!

‘pipe’,!but!you!know!

nothing!about!a!pipe!

un*l!you!comprehend!

its!purpose!and!

method!of!usage!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Drawbacks!of!the!bagLofLwords!model!

%%Introduc9on%

long!

big!

cold!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

%%Introduc9on%

smile!

!

!!

damn!

!

!!

preHy!

!

=>!!damn_good!

=>!!preHy_ugly!

=>!!sad_smile!

!

Drawbacks!of!the!bagLofLwords!model!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

IR!system!

IR System

Query String

Document corpus

Ranked Documents

1. Doc1 2. Doc2 3. Doc3 …

A!set!of!documents:!

Assume!it!is!a!

sta*c!collec*on!

for!the!moment!

Goal:!Retrieve!documents!with!informa*on!that!is!relevant!!

to!the!user’s!informa*on!need!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

how$trap$mice$alive$

IR!system:!Example!

Collection

User task

Info need

Query

Results

Search engine

Query refinement

Get rid of mice in a politically correct way

Info about removing mice without killing them

Misconception?

Misformulation?

Search!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

IE!system!

%%Introduc9on%

Goal:!Extract!informa*on!from!the!retrieved!documents!to!

help!the!user!complete!a!task!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

IE!system:!Example!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

What!this!course!covers!

!  Understanding!of!IR!&!IE!systems!

!  BeHer!usages!of!IR!&!IE!services!!  Improvement!of!exis*ng!systems!

!  Design!and!development!for!new!domains!

!  Innova*ons!in!IR!&!IE!!

%%Introduc9on%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

What!this!course!does!NOT!cover!

%%Introduc9on%

! Mul*modal!retrieval!

!  Image!and!video!

!  Audio!!  Advanced!NLP!topics!

!  Parsing!!  Ontologies!!  Anaphora!resolu*on!!  Named!en*ty!recogni*on!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

IR!&!IE!business!

!  So!many!tech!*tans!are!doing!IR!&!IE!

!  What!can!we!do!more/beHer?!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Mo*va*on!for!this!course?!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

IR!&!IE!business!

!  Google,!Yahoo,!Bing!know!how!to!build!a!search!

engine!

!  Do!we!know!?!!  Did!they!tell!us?!

!  They!never!will!(fully)!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Mo*va*on#1:!Acquire!knowLhow!

!  To!know!about!the!methods!Google,!

Bing,!Yahoo!hide!

!  To!customize!such!

algorithms!to!other!

businesses!

!  To!do!things!that!these!companies!

cannot!do!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Mo*va*on#2:!Build!something!new!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Mo*va*on#3:!Make!India!beHer!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Represen*ng!documents!

!  We!will!see!3!kinds!of!document!representa*on!

!  One!hot!encoding!!  Boolean!–!Yes!vs!No!

!  TfLidf!weigh*ng!scheme!

!  similar!to!one!hot!encoding!but!encodes!contextual!

informa*on!

!  Word!embeddings!

!  most!popular!method!and!works!beHer!than!the!rest!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

One!hot!encoding!

!  Vector!space!modeling!!

!  S1!–!{Informa*on!Retrieval!is!an!excellent!topic!to!study}!

!  Vocabulary!–!{Informa*on,!knowledge,!Retrieval,!student,!

class,!google,!index,!is,!an,!the,!excellent,!understanding,!

topic,!grasp,!course,!people,!web,!to,!study}!

!  One!hot!encoding!!  N!dimensional!vector!where!each!coLordinate!represents!one!

word!in!the!vocabulary!

!  X={x1!,!x2!,!……!,!xn}!!!

!  Xi!=!1!if!word!I%is!present!in!the!document!otherwise!Xi!=!0!!

!  S1={1,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,1,1}!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

TfLidf!weigh*ng!scheme!

!  TfLidf!

!  á!=!term!frequency!!

!  df!=!document!frequency!

!  idf!=!inverse!document!frequency!

!  Replace!one!hot!encoding!with!áLidf!weight!

!  á!=!á(t,!D)!=!log[freq(t,D)]+1!

!  Idf!=!idf(t)!=!log(n/N)!

!  N!=!Number!of!documents!

!  N!=!number!of!documents!contain!term!t!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Word!embeddings!

!  Embeddings!of!a!word!is!a!d!dimensional!vector!

which!encodes!seman*c!informa*on!

!  CBOW!aims!to!predict!a!context!word!given!a!word!w!!!!!!!!!!!!!(goal!is!to!calculate!the!embedding!of!the!word!w)%

!  You!can!build!your!own!!  Randomly!form!d!dimensional!word!embeddings!of!words!

!  Form!a!complex!neural!network!

!  Input!of!the!network!is!the!words!with!random!word!

embeddings!(goal!is!to!tune!these!embeddings!through!

training)!

!  Design!a!loss!func*on!!  Backpropagate!the!error!to!the!input!layer!to!train!the!network!

!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Word!embeddings!

!  Excellent!if!you!try!to!design!your!own!word!embeddings!method!

!  There!are!helper!func*ons,!libraries!available!in!python!

!  The!basic!is!backpropaga*on!!  Otherwise!use!Gensim!–!a!perfect!word2vec!tool!

which!uses!CBOW!model!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Word2vec!in!IR!

!  Forming!document!vectors!from!word!vectors!

!  Finding!similar!words!to!a!given!word!

!  Clustering!!  Analy*cal!inference!

!  vector('king')%,%vector('man')%+%vector('woman')!!=!vector(�queen�)%%

!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Word2vec !=!universal!solu*on?!

!  Word2vec!captures!seman*c!but!it!is!s*ll!word!coL

occurrence!based!

!  We!need!seman*cs,!e.g.,!knowledge!bases,!seman*c!

networks,!ontologies,!etc.!

!  WordNet!

!  ConceptNet!!  Sen*cNet!!  NELL!!  YAGO!!  Probase!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Introduc*on!to!

Informa(on)Retrieval)

!

!

Lecture!1:!Boolean!Retrieval!

!

GIAN!Course!

Big!Social!Data!Analysis!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Structured!(db)!vs.!unstructured!(txt)!data!

!  Structured!data!tends!to!refer!to!informa*on!in!

tables !

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000 Ivy Smith

Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Unstructured!data!

!  Which!plays!of!Shakespeare!contain!the!words!

Brutus!AND!Caesar!!but!NOT!Calpurnia?!!  One!could!grep!all!of!Shakespeare’s!plays!for!Brutus!and!Caesar,!then!strip!out!lines!containing!Calpurnia?!

!  Why!is!that!not!the!answer?!

!  Slow!(for!large!corpora)!!  Other!opera*ons!(e.g.,!find!the!word!Romans1near)countrymen)!not!feasible!

!  Ranked!retrieval!(best!documents!to!return)!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

TermLdocument!incidence!

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Incidence!vectors!

!  So!we!have!a!0/1!vector!for!each!term!

!  To!answer!query:!take!the!vectors!for!Brutus,1Caesar!and!Calpurnia!(complemented)!"!!bitwise!AND!

!  110100!AND!110111!AND!101111!=!100100!

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Answers!to!query!

! Antony and Cleopatra,!Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain

! Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Bigger!collec*ons!

!  Consider!N%=!1!million!documents,!each!with!about!

1000!words!

!  Avg!6!bytes/word!including!spaces/punctua*on!!!  6GB!of!data!in!the!documents!

!  Say!there!are!M%=!500K!dis9nct!terms!among!these!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Build!the!matrix!

!  500K!x!1M!matrix!has!halfLaLtrillion!0s!and!1s!

!  But!it!has!no!more!than!one!billion!1s!

!  matrix!is!extremely!sparse!

!  What’s!a!beHer!representa*on?!

!  We!only!record!the!1s!

Why?

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Inverted!index!

!  For!each!term!t,!we!must!store!a!list!of!all!documents!

that!contain!t!!  Iden*fy!each!by!a!docID,!a!document!serial!number!

!  Can!we!used!fixedLsize!arrays!for!this?!

Brutus

Calpurnia

Caesar 2 4 5 6 16 57

2 4 11 31 45 173

31 54 101

What!happens!if!the!word!Caesar!is!added!to!document!14?!!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Inverted!index!

!  We!need!variableLsize!pos*ngs!lists!

!  On!disk,!a!con*nuous!run!of!pos*ngs!is!normal!and!best!

!  In!memory,!can!use!linked!lists!or!variable!length!arrays!

!  Some!tradeoffs!in!size/ease!of!inser*on!

Dictionary Postings

Pos9ng%

Brutus

Calpurnia

Caesar 2 4 5 6 16 57

2 4 11 31 45 173

31 54 101

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Inverted!index!construc*on!

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend1

roman1

countryman1

2 4

2

13 16

1

Documents to be indexed.

Friends, Romans, countrymen.

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Ini*al!stages!of!text!processing!

!  Tokeniza*on!!  Cut!character!sequence!into!word!tokens!

! Deal!with!�John’s�,!a1state9of9the9art1solu:on1!  Normaliza*on!

! Map!text!and!query!term!to!same!form!!  You!want!U.S.A.!and!USA1to!match!

!  Stemming!

! We!may!wish!different!forms!of!a!root!to!match!! authorize,1authoriza:on1

!  Stopwords!! We!may!omit!very!common!words!(or!not)!

!  the,1a,1to,1of1

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Indexer!steps:!Token!sequence!

!  Sequence!of!(Modified!token,!Document!ID)!pairs!

I did enact Julius Caesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be with Caesar. The noble

Brutus hath told you Caesar was ambitious

Doc 2

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Indexer!steps:!Sort!

!  Sort!by!terms!

!  And!then!docID!!

Core)indexing)step)

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Indexer!steps:!Dic*onary!&!pos*ngs!

!  Mul*ple!term!entries!in!a!single!document!are!merged!

!  Split!into!Dic*onary!and!Pos*ngs!

!  Doc.!frequency!!(DF)!informa*on!is!added!

Why!frequency?!

Will!discuss!later!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

What!are!stored!in!index?!

Pointers

Terms!

and!

counts!

IR system implementation: •  How do we index efficiently? •  How much storage do we need?

Lists!of!

docIDs!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!index!we!just!built!

!  How!do!we!process!a!query?!!  Later!L!what!kinds!of!queries!can!we!process?!

!  So!we!have!a!0/1!vector!for!each!term!

!  To!answer!query:!take!the!vectors!for!Brutus,1Caesar!and!Calpurnia!(complemented)!"!!bitwise!AND!

!  110100!AND!110111!AND!101111!=!100100!!

Review: Process with incidence matrix

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Query!processing:!AND!

!  Consider!processing!the!query:!Brutus!AND!Caesar!!  Locate!Brutus!in!the!Dic*onary!

! Retrieve!its!pos*ngs!!  Locate!Caesar!in!the!Dic*onary!

! Retrieve!its!pos*ngs!!  Merge !the!two!pos*ngs!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

The!merge!

!  Walk!through!the!two!pos*ngs!simultaneously,!in!

*me!linear!in!the!total!number!of!pos*ngs!entries!

If the list lengths are x and y, the merge takes O(x+y) operations Crucial: postings sorted by docID

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Intersec*ng!two!pos*ngs!lists!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Boolean!queries:!Exact!match!

!  The!Boolean!retrieval!model!is!being!able!to!ask!a!

query!that!is!a!Boolean!expression:!

!  Boolean!Queries!are!queries!using!AND,%OR!and!NOT!to!join!query!terms!

!  Views!each!document!as!a!set!of!words!

!  Is!precise:!document!matches!condi*on!or!not!

!  Perhaps!the!simplest!model!to!build!an!IR!system!on!

!  Primary!commercial!retrieval!tool!for!3!decades!

!  Many!search!systems!you!s*ll!use!are!Boolean:!

!  Email,!library!catalog,!Mac!OS!X!Spotlight!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Example:!WestLaw!!!http://www.westlaw.com/

!  Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992; federated search added 2010)

!  Tens of terabytes of data; 700,000 users !  Majority of users still use boolean queries !  e.g. What is the statute of limitations in cases

involving the federal tort claims act? !  LIMIT! /3 STATUTE ACTION /S FEDERAL /2

TORT /3 CLAIM !  !: variant endings, space: disjunction !  /3 = within 3 words, /S = in same sentence

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Merging!

What!about!an!arbitrary!Boolean!formula?!

(Brutus!OR%Caesar)1AND%NOT1(Antony1OR%Cleopatra))!  Can!we!always!merge!in!“linear”!*me?!

!  Linear!in!what?!!  Can!we!do!beHer?!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Query!op*miza*on!

!  What!is!the!best!order!for!query!processing?!

!  Consider!a!query!that!is!an!AND!of!n!terms!

!  For!each!of!the!n!terms,!get!its!pos*ngs,!then!

AND!them!together!

Query:1Brutus!AND!Calpurnia!AND!Caesar1

Brutus

Calpurnia

Caesar 2 4 5 6 16 31

2 4 11 31 45 173

31 54 101

57

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Query!op*miza*on!example!

!  Process!in!order!of!increasing!freq:!!  start%with%smallest%set,%then%keep%cuWng%further!

This is why we kept document freq in dictionary

Thus,!execute!the!query!as!(Calpurnia!AND!Brutus)!AND%Caesar!

Brutus

Calpurnia

Caesar 2 4 5 6 16 31

2 4 11 31 45 173

31 54 101

57

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

More!general!op*miza*on!

!  e.g.,!(tangerine1OR%trees)%AND%(marmalade%OR%skies)%AND%(kaleidoscope%OR%eyes)!

!  Get!doc!freqs!for!all!terms!

!  Es*mate!the!size!of!each!OR!by!the!sum!of!its!

doc!freqs!(conserva*ve)!

!  Process!in!increasing!order!of!OR!sizes!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Exercise!

!  Recommend!a!query!

processing!order!for!

!  Which!two!terms!

should!we!process!

first?!

Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Phrase!queries!

!  Want!to!be!able!to!answer!queries!such!as!!!

stanford1university� –!as!a!phrase1!  Thus!the!sentence!“I%went%to%university%at%Stanford”%is!not!a!match!

!  The!concept!of!phrase!queries!has!proven!easily!understood!by!users;!one!of!the!few! advanced!search !

ideas!that!works!

!  Many!more!queries!are!implicit%phrase%queries!

!  For!this,!it!no!longer!suffices!to!store!only!

!!!<term%:!docs>!entries!

!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Jumping!to!the!Seman*cs!Curve!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

BoW!vs.!BoC!

cloud_compu*ng!!

!

!

!

!

pain_killer!!

=>!!!!cloud!!

=>!!!!pain,!killer!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Solu*on!1:!Biword!indexes!

!  Index!every!consecu*ve!pair!of!terms!in!the!text!

as!a!phrase!

!  For!example!the!text! Friends,!Romans,!

Countrymen !would!generate!the!biwords!

!  friends1romans1!  romans1countrymen1

!  Each!of!these!biwords!is!now!a!dic*onary!term!

!  TwoLword!phrase!queryLprocessing!is!now!immediate!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Longer!phrase!queries!

!  stanford1university1palo1alto1can!be!broken!into!the!Boolean!query!on!biwords:!

stanford1university1AND1university1palo1AND1palo1alto11

How%to%know%which%one%is%a%significa9ve%biword?%

!  Index!blowup!due!to!bigger!dic*onary!!  Infeasible!for!more!than!biwords,!big!even!for!them!

%

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Solu*on!2:!Posi*onal!indexes!

!  In!the!pos*ngs,!store,!for!each!term,1the!posi*on(s)!in!which!tokens!of!it!appear:!

<term,%number!of!docs!containing!term;!

doc1:!posi*on1,!posi*on2!…!;!

doc2:!posi*on1,!posi*on2!…!;!

etc.>!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Processing!a!phrase!query!

!  Extract!inverted!index!entries!for!each!dis*nct!term:!to,1be,1or,1not1

!  Merge!their!doc:posi9on!lists!to!enumerate!all!

posi*ons!with! to1be1or1not1to1be !

!  to:%%!  2:1,17,74,222,551;%4:8,16,190,429,433;!7:13,23,191;!...!

!  be:%%%!  1:17,19;!4:17,191,291,430,434;!5:14,19,101;!…!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Posi*onal!index!size!

!  A!posi*onal!index!expands!pos*ngs!storage!substan9ally%!  Even!though!indices!can!be!compressed!

!  Nevertheless,!a!posi*onal!index!is!now!standardly!used!because!of!the!power!and!usefulness!of!phrase!

and!proximity!queries!…!whether!used!explicitly!or!

implicitly!in!a!ranking!retrieval!system!

!  In!the!case!of!a!sta*c!collec*on,!we!have!to!do!it!only!once!anyway!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Rules!of!thumb!

!  A!posi*onal!index!is!2–4!as!large!as!a!nonLposi*onal!index!

!  Posi*onal!index!size!35–50%!of!volume!of!original!

text!

!  Caveat:!all!of!this!holds!for! EnglishLlike !languages!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Combina*on!schemes!

!  These!two!approaches!(biword!index,!posi*onal!index)!can!be!profitably!combined!

!  For!par*cular!phrases!(�Michael1Jackson�,1�Britney1Spears�)!it!is!inefficient!to!keep!on!merging!posi*onal!

pos*ngs!lists!

!  Even!more!so!for!phrases!like!�The1Who�1

!  Williams!et!al.!(2004)!evaluate!a!more!

sophis*cated!mixed!indexing!scheme!

!  A!typical!web!query!mixture!was!executed!in!¼!of!the!

*me!of!using!just!a!posi*onal!index!

!  It!required!26%!more!space!than!having!a!posi*onal!

index!alone!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Skip!pointers!

!  Walk!through!the!two!pos*ngs!simultaneously,!in!

*me!linear!in!the!total!number!of!pos*ngs!entries!

128

31

2 4 8 41 48 64

1 2 3 8 11 17 21

Brutus

Caesar 2 8

If the list lengths are m and n, the merge takes O(m+n) operations

Can we do better? Yes (if index isn t changing too fast)

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Skip!pointers!

128 2 4 8 41 48 64

31 1 2 11 17 21 31 11

41 128

Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings

8

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Compression!

!  Use!less!disk!space!!  Saves!a!liHle!money!

!  Keep!more!stuff!in!memory!

!  Increases!speed!!  Increase!speed!of!data!transfer!from!disk!to!memory!

!  [read!compressed!data!|!decompress]!is!faster!than!!!!!

[read!uncompressed!data]!

!  Premise:!Decompression!algorithms!are!fast!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Compression!

!  Dic*onary!!  Make!it!small!enough!to!keep!in!main!memory!

!  Make!it!so!small!that!you!can!keep!some!pos*ngs!lists!

in!main!memory!too!

!  Pos*ngs!files!!  Reduce!disk!space!needed!!  Decrease!*me!needed!to!read!pos*ngs!lists!from!disk!

!  Large!search!engines!keep!a!significant!part!of!the!pos*ngs!in!memory!

!  Compression!lets!you!keep!more!in!memory!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Index!parameters!vs.!what!we!index!

size of word types (terms)

non-positional postings

positional postings

dictionary non-positional index

positional index

Size (K) Size (K) Size (K)

Unfiltered 484 109,971 197,879 No numbers 474 100,680 179,158 Case folding 392 96,969 179,158 30 stopwords 391 83,390 121,858 150 stopwords 391 67,002 94,517 stemming 322 63,812 94,517

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Lossless!vs.!lossy!compression!

!  Lossless!compression:!All!informa*on!is!preserved!

!  What!we!mostly!do!in!IR!

!  Lossy!compression:!Discard!some!informa*on!

!  Several!of!the!preprocessing!steps!can!be!viewed!as!lossy!compression:!case!folding,!stopwords,!

stemming,!number!elimina*on!

!  Prune!pos*ngs!entries!that!are!unlikely!to!turn!up!in!the!top!k!list!for!any!query!!  Almost!no!loss!quality!for!top!k!list!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Vocabulary!vs.!collec*on!size!

!  Heaps’!law:!M%=%kTb%

!  M!is!the!size!of!the!vocabulary,!T!is!the!number!of!

tokens!in!the!collec*on!

!  Typical!values:!30!≤!k!≤!100!and!b!≈!0.5!!  In!a!logLlog!plot!of!vocabulary!size!M!vs.!T,!Heaps’!law!predicts!a!line!with!slope!about!½!

!  It!is!the!simplest!possible!rela*onship!between!the!two!

in!logLlog!space!

!  An!empirical!finding!(“empirical!law”)!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Vocabulary!vs.!collec*on!size!

!  xLaxis:!text!size!!  yLaxis:!number!of!dis*nct!vocabulary!elements!present!in!the!text!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Zipf’s!law!

!  Heaps’!law!gives!the!vocabulary!size!in!collec*ons!!  We!also!study!the!rela*ve!frequencies!of!terms!

!  In!natural!language,!there!are!a!few!very!frequent!terms!and!very!many!very!rare!terms!

!  Zipf’s!law:!The!ith!most!frequent!term!has!frequency!

propor*onal!to!1/i!

!  cfi! !1/i%=%K/i%where!K!is!a!normalizing!constant%

!  cfi!is!collec*on!frequency:!the!number!of!

occurrences!of!the!term!ti!in!the!collec*on!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Zipf’s!law!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Zipf!consequences!

!  Zipf's!law!states!that!given!a!natural!language!corpus,!the!frequency!of!any!word!is!inversely!propor*onal!to!

its!rank!in!the!frequency!table.!

!  Thus!the!most!frequent!word!will!occur!

approximately!twice!as!omen!as!the!second!most!

frequent!word,!three!*mes!as!omen!as!the!third!most!

frequent!word,!etc.!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Zipf!consequences!

!  For!example,!in!the!Brown!Corpus!"the"!is!the!

most!frequently!occurring!word,!and!by!itself!

accounts!for!nearly!7%!of!all!word!occurrences!

(69,971!out!of!slightly!over!1!million)!

!  True!to!Zipf's!Law,!the!secondLplace!word!"of"!accounts!for!slightly!over!3.5%!of!words!(36,411!

occurrences),!followed!by!"and"!(28,852)!

!  Only!135!vocabulary!items!are!needed!to!

account!for!half!the!Brown!Corpus!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Why!compress!the!dic*onary?!

!  Search!begins!with!the!dic*onary!!  We!want!to!keep!it!in!memory!

!  Memory!footprint!compe**on!with!other!

applica*ons!

!  Memory!footprint:!amount!of!memory!a!program!uses!

!  Embedded/mobile!devices!may!have!very!liHle!

memory!

!  Even!if!the!dic*onary!isn’t!in!memory,!we!want!it!

to!be!small!for!a!fast!search!startup!*me!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Dic*onary!storage!L!first!cut!

!  Array!of!fixedLwidth!entries!!  ~400,000!terms;!28!bytes/term!=!11.2!MB!

Terms Freq. Postings ptr.

a 656,265

aachen 65

…. ….

zulu 221

Dictionary search structure

20 bytes 4 bytes each

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

FixedLwidth!terms!are!wasteful!

!  Most!of!the!bytes!in!the!Term!column!are!

wasted!–!we!allot!20!bytes!for!1!leHer!terms!

!  And!we!s*ll!can’t!handle!supercalifragilis9cexpialidocious%or!hydrochlorofluorocarbons%

!  Ave.!dic*onary!word!in!English:!~8!characters!!  How!do!we!use!~8!characters!per!dic*onary!term?!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Dic*onary!as!a!string!

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

Total string length = 400K x 8B = 3.2MB

Pointers resolve 3.2M positions: log23.2M =

22bits = 3bytes

!  Store!dic*onary!as!a!(long)!string!of!characters:!!  Pointer!to!next!word!shows!end!of!current!word!!  Hope!to!save!up!to!60%!of!dic*onary!space!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Blocking!

!  Store!pointers!to!every!kth!term!string!

!  Need!to!store!term!lengths!(1!extra!byte)!

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

! Save 9 bytes " on 3 # pointers.

Lose 4 bytes on term lengths.

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Front!coding!

!  Sorted!words!commonly!have!long!common!prefix!

–!store!differences!only!(see!wildcard!queries!in!

next!lecture)!

8automata8automate9automa:c10automa:on1

→8automat*a1◊e2◊ic3◊ion

Encodes automat

Extra length beyond automat

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

RCV1!dic*onary!compression!summary!

Technique) Size)in)MB)

Fixed!width! 11.2!

Dic*onary!as!string!with!pointers!to!every!term! 7.6!

Blocking! 7.1!

Blocking!+!front!coding! 5.9!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Pos*ngs!compression!

!  We!store!the!list!of!docs!containing!a!term!in!

increasing!order!of!docID!

!  computer:!33,47,154,159,202!…!

!  Consequence:!it!suffices!to!store!gaps.!!  33,47,154,5,202!…!

!  Hope:!most!gaps!can!be!encoded/stored!with!far!

fewer!than!20!bits!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Three!pos*ngs!entries!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Unstructured!vs.!semiLstructured!data!

!  Typically!refers!to!free!text!

!  Allows!!  Keyword!queries!including!operators!

!  More!sophis*cated!

concept !queries!e.g.,!

!  find!all!web!pages!dealing!with!drug%abuse%

!  In!fact!almost!no!data!!

is! unstructured !

!  E.g.,!this!slide!has!dis*nctly!iden*fied!

zones!such!as!the!Title!and!Bullets%

!  Facilitates! semiL

structured !search,!e.g.,!

!  Title!contains!data!AND!Bullets!contain!search!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Takeaways!

!  TermLdocument!incidence!

!  Inverted!index!!  Boolean!query!

!  Merging!

!  Query!op*miza*on!

!  Phrase!queries!!  Op*miza*on!

!  Skip!pointers!!  Compression!

%%%Lecture%%1%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Introduc*on!to!

Informa(on)Retrieval)

!

!

Lecture!2:!Tolerant!Retrieval!

!

GIAN!Course!

Big!Social!Data!Analysis!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Parsing!a!document!

!  What!format!is!it!in?!

!  pdf/word/excel/html?!

!  What!language!is!it!in?!

!  What!character!set!is!in!use?!

!  e.g.,!CP1252,!UTFL8!

Each of these is a classification problem, which we will study later in the course

But these tasks are often done heuristically …

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Complica*ons:!Format/language!

!  Documents!being!indexed!can!include!docs!from!many!different!languages!

!  A!single!index!may!contain!terms!of!several!

languages!

!  Some*mes!a!document!or!its!components!can!contain!mul*ple!languages/formats!

!  e.g.,!French!email!with!a!German!pdf!aHachment!

!  There!are!commercial!and!open!source!

libraries!that!can!handle!a!lot!of!this!stuff!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Complica*ons:!What!is!a!document?!

! We!return!from!our!query! documents !but!there!are!omen!interes*ng!ques*ons!of!grain!

size,!e.g.,!!

! What!is!a!unit!document?!

!  A!file?!!  An!email?!!An!email!with!5!aHachments?!

!  A!group!of!files!(e.g.,!PPT!or!LaTeX!as!HTML!pages)!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Tokeniza*on!

!  Input:! Friends,1Romans1and1Countrymen !

!  Output:!Tokens!!  Friends1!  Romans1!  Countrymen1

!  A!token!is!an!instance!of!a!sequence!of!characters!!  Each!such!token!is!now!a!candidate!for!an!index!entry,!amer!further!processing!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Tokeniza*on!

!  Issues!in!tokeniza*on:!!  Finland’s1capital1→1111111Finland1AND!s?1Finlands?1Finland’s?!!  HewleL9Packard!→!HewleL!and!Packard!as!two!tokens?!

!  state9of9the9art:!break!up!hyphenated!sequence.!!!!  co9educa:on1!  lowercase,!lower9case,!lower1case?!!  It!can!be!effec*ve!to!get!the!user!to!put!in!possible!hyphens!

!  San1Francisco:!one!token!or!two?!!!!  How!do!you!decide!it!is!one!token?!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Numbers!

!  3/20/91 1 1 11Mar.112,119911 1 1 120/3/911!  551B.C.1!  B9521!  My1PGP1key1is1324a3df234cb23e1!  (800)1234923331

!  Omen!have!embedded!spaces!

!  Older!IR!systems!may!not!index!numbers!

!  But!omen!very!useful:!think!about!things!like!looking!up!error!codes/stacktraces!on!the!Web!

!  (One!answer!is!using!nLgrams:!Lecture!3)!

!  Will!omen!index! metaLdata !separately!

!  Crea*on!date,!format,!etc.!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Tokeniza*on:!Language!issues!

!  French:!e.g.!L'ensemble!→!one!token!or!two?!

!  L1?!L� ?!Le1?!! Want!l�ensemble!to!match!with!un1ensemble1

! Un*l!at!least!2003,!it!didn t!on!Google:!

Interna*onaliza*on!!

!  German!noun!compounds!are!not!segmented!

!  LebensversicherungsgesellschaZsangestellter1!  life!insurance!company!employee !

!  German!retrieval!systems!benefit!greatly!from!a!

compound)spliLer)module!

!  Can!give!a!15%!performance!boost!for!German!!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Tokeniza*on:!Language!issues!

!  Chinese!and!Japanese!have!no!spaces!between!words:!

!  � � �!  Not!always!guaranteed!a!unique!tokeniza*on!!

!  Further!complicated!in!Japanese,!with!mul*ple!

alphabets!intermingled!

!  Dates/amounts!in!mul*ple!formats!

������500������������$500K(6,000��)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Tokeniza*on:!Language!issues!

!  Arabic!(or!Hebrew)!is!wriHen!right!to!lem,!but!with!certain!items!like!numbers!wriHen!lem!to!right!

!  Words!are!separated,!but!leHer!forms!within!a!word!

form!complex!ligatures!

!  !!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!←!!→!!!!←!→!!!!!!!!!!!!!!!!!!!!!!←!start!

!  Algeria!achieved!its!independence!in!1962!amer!132!

years!of!French!occupa*on !

!  With!Unicode,!the!surface!presenta*on!is!complex,!but!the!

stored!form!is!!straigháorward!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Stopwords!

!  With!a!stop!list,!you!exclude!from!the!dic*onary!

en*rely!the!commonest!words.!Intui*on:!

!  They!have!liHle!seman*c!content:!the,%a,%and,%to,%be%

!  There!are!a!lot!of!them:!~30%!of!pos*ngs!for!top!30!words!

!  But!the!trend!is!away!from!doing!this:!

!  Good!compression!techniques!means!the!space!for!including!

stopwords!in!a!system!is!very!small!

!  Good!query!op*miza*on!techniques!mean!you!pay!liHle!at!query!

*me!for!including!stop!words!

!  You!need!them!for:!

!  Phrase!queries:! King!of!Denmark !

!  Various!song!*tles,!etc.:! Let!it!be ,! To!be!or!not!to!be !

!  Rela*onal !queries:! flights!to!London !

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Normaliza*on!

!  We!need!to!“normalize”!words!in!indexed!text!as!

well!as!query!words!into!the!same!form!

!  We!want!to!match!U.S.A.!and!USA1

!  Result!is!terms:!a!term!is!a!(normalized)!word!type,!

which!is!an!entry!in!our!IR!system!dic*onary!

!  We!most!commonly!implicitly!define!equivalence!

classes!of!terms!by,!e.g.,!!

!  dele*ng!periods!to!form!a!term!

!  U.S.A.,!USA1→1USA1

!  dele*ng!hyphens!to!form!a!term!

!  an:9discriminatory,1an:discriminatory11→11an:discriminatory1

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Normaliza*on:!Other!languages!

!  Accents:!e.g.,!French1résumé!vs.!resume)!  Umlauts:!e.g.,!German:!Tuebingen!vs.!Tübingen1

!  Should!be!equivalent!!  Most!important!criterion:!

!  How!do!your!users!like!to!write!their!queries!for!these!words?!

!  Even!in!languages!that!standardly!have!accents,!users!omen!may!not!type!them!

!  Omen!best!to!normalize!to!a!deLaccented!term!

!  Tuebingen,1Tübingen,1Tubingen1→1Tubingen!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Normaliza*on:!Other!languages!

!  Normaliza*on!of!things!like!date!forms!

!  7�30� vs. 7/30 !  Japanese use of kana vs. Chinese characters!

!  Tokeniza*on!and!normaliza*on!may!depend!on!the!

language!and!so!is!intertwined!with!language!

detec*on!

!  Crucial:!Need!to! normalize !indexed!text!as!well!as!

query!terms!iden*cally!

Morgen will ich in MIT … Is this

German mit ?

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Case!folding!

!  Reduce!all!leHers!to!lowercase!!  excep*on:!upper!case!in!midLsentence?!

!  e.g.,!General1Motors1!  Fed!vs.!fed1!  SAIL!vs.!sail1

!  Omen!best!to!lower!case!everything,!since!users!will!use!lowercase!regardless!of!‘correct’!capitaliza*on…!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Thesauri!and!Soundex!

!  Do!we!handle!synonyms!and!different!spellings?!

!  E.g.,!by!handLconstructed!equivalence!classes!!  car!=!automobile 11color!=!colour1

!  We!can!rewrite!to!form!equivalenceLclass!terms!

!  When!the!document!contains!automobile,!index!it!under!car1as!well!(and!viceLversa)!

!  Or!we!can!expand!a!query!!  When!the!query!contains!automobile,!look!under!car!as!well!

!  What!about!spelling!mistakes?!

!  One!approach!is!Soundex,!which!forms!equivalence!classes!of!words!based!on!phone*c!heuris*cs,!e.g.,!!!!!!!!

c!u!l8r!=>!/siː/juː/ˈleɪtəʳ/!=>!see!you!later!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Lemma*za*on!

!  Reduce!inflec*onal/variant!forms!to!base!form!

!  E.g.,!!  am,%are,!is%→!be!

!  car,%cars,%car's,!cars'!→!car%

!  the%boy's%cars%are%different%colors!→!the%boy%car%be%different%color%

!  Lemma*za*on!implies!doing! proper !reduc*on!

to!dic*onary!headword!form!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Stemming!

!  Different!wordforms!

!  e.g.,!automate(s),1automa:c,1automa:on!

!  Reduce!terms!to!their! roots !before!indexing!

!  Stemming !suggest!crude!affix!chopping!

!  language!dependent!!  e.g.,!automates,1automa:c,1automa:on!all!reduced!to!automat!

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress and compress ar both accept as equival to compress

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Lemma*za*on!pros!

!

!

!

Lemma*za*on!!!!!!!!!!!!!!!!!!!!!!!!!!!!vs.!!

democrat%=>%democrat%democrats%=>%democrat%%

democra9c%=>%democra9c%%%

democra9ze%=>%democra9ze%democra9zed%=>%democra9ze%democra9zing%=>%democra9ze%democra9se%=>%democra9ze%!

Stemming!!

democrat%=>%democrat%democrats%=>%democrat%%

democra9c%=>%democrat%%

democra9ze%=>%democrat%democra9zed%=>%democrat%democra9zing%=>%democrat%democra9se%=>%democrat%%

!  Preserves!POS!tags,!e.g.,!democrat!(Noun)!vs.!

democra*c!(Adj)!vs.!democra*ze!(Verb)!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Lemma*za*on!pros!

eat_burger%

eats_burger!

ate_burger!

ea*ng_burger!

eat_burgers!

eat_the_burger!

!  Improves!concept!extrac*on!through!text!

normaliza*on,!e.g.,!noun/verb!inflec*on!elimina*on!

eaten_burger!

ea*ng_of_burgers!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Lemma*za*on!cons!

cross!

get_beHer!

!window!

!kick_ball!

=>!!!!get_the_beHer_of!

!

=>!!!!Windows!

=>!!!!crossing!

!

!  Lemma*za*on!may!lead!to!misunderstanding!

=>!!!!kick_balls!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Porter’s!algorithm!

!  Most!popular!algorithm!for!stemming!English!

!  Results!suggest!it’s!at!least!as!good!as!other!stemming!

op*ons!

!  Example!rules!

!  sses!→!ss % % %(caresses%→!caress)%

!  ies!→!i % % %(ponies%→!poni)%

!  (m>1)%ement%→ !(replacement!→!replac;!cement!→!cement)!

!  Conven*ons!+!5!phases!of!reduc*ons!!  phases!applied!sequen*ally!

!  each!phase!consists!of!a!set!of!commands!

!  sample!conven*on:!Of%the%rules%in%a%compound%command,%select%the%one%that%applies%to%the%longest%suffix%

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Exercise!

!  What!is!the!purpose!of!including!the!following!rule?!

!  ss!→!ss!

!  [Hint]!s!→!nothing%

!  e.g.,!boss!→!boss!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

WildLcard!queries!

!  mon*:!find!all!docs!containing!any!word!beginning!mon !

!  Easy!with!binary!tree!(or!BLtree)!lexicon:!retrieve!all!words!in!range:!mon1≤1w1<1moo1

!  *mon:1find!words!ending!in! mon :!harder!

!  Maintain!an!addi*onal!BLtree!for!terms!backwards!

Can!retrieve!all!words!in!range:!nom1≤1w1<1non%

Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

WildLcard!queries!

!  How!can!we!handle!*!in!the!middle!of!query!term?!

!  co*:on1!  Solu*on!1:!Look!up!co*!in!a!BLtree!and!find!all!terms!

ending!with! *on !

!  Solu*on!2:!We!could!look!up!co*!AND!*:on!in!a!!!!!!!BLtree!and!intersect!the!two!term!sets!

!  Both!are!expensive!!  Solu*on!3:!transform!wildLcard!queries!so!that!!!!!!!

the!*!occurs!at!the!end!

!  Permuterm!Index!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Permuterm!index!

!  For!term!hello,!index!under:!!  hello$,1ello$h,1llo$he,1lo$hel,1o$hell1where)$)is)a)special)symbol)

!  Queries:!!  X!!!!lookup!on!X$ )))))X*)))lookup!on!!!$X*)!  *X)))lookup!on!X$* !!!!!*X*!!lookup!on!!!X*)!  X*Y!lookup!on!Y$X* !!

Query = hel*o X=hel, Y=o

Lookup o$hel*

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Permuterm!query!processing!

!  Rotate!query!wildLcard!to!the!right!!  Now!use!BLtree!lookup!as!before!!  Permuterm%problem:%≈%quadruples%lexicon%size%

Empirical observation for English

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Processing!wildLcard!queries!

!  As!before,!we!must!execute!a!Boolean!query!for!

each!enumerated,!filtered!term!

!  WildLcards!can!result!in!expensive!query!execu*on!

(very!large!disjunc*ons…)!

!  pyth*!AND!prog*!!  If!you!encourage! laziness !people!will!respond!!

Search Type your search terms, use * if you need to. E.g., Alex* will match Alexander.

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Spell!correc*on!

!  Two!principal!uses!!  Correc*ng!documents!being!indexed!

!  Correc*ng!user!queries!to!retrieve! right !answers!

!  Two!main!flavors:!

!  Isolated!word!!  Check!each!word!on!its!own!for!misspelling!

!  Will!not!catch!typos!resul*ng!in!correctly!spelled!words!

!  !e.g.,!from1→1form1or1sue1→1use1

!  ContextLsensi*ve!!  Look!at!surrounding!words,!!!  e.g.,!I1flew1form1Heathrow1to1Narita1

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Spell!correc*on!

!  Based!on!syntax!!  Related!to!intrinsic!property!of!words!

!  Based!on!probabili*es!!  Related!to!probabili*es!of!words!to!appear!in!a!specific!context!(neighboring!words)!

!  Based!on!preference!!  Related!to!clicks/choices!of!users!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Document!correc*on!

!  Especially!needed!for!OCR!and!ASR!!  Correc*on!algorithms:!e.g.,!“rn”!vs.!“m”!or!“write”!vs.!“right”!

!  Can!use!domainLspecific!knowledge!

!  E.g.,!OCR!can!confuse!O!and!D!more!omen!than!it!would!confuse!O!

and!I!(adjacent!on!the!QWERTY!keyboard,!so!more!likely!

interchanged!in!typing)!

!  But!also:!web!pages!and!even!printed!material!has!

typos!

!  Goal:!the!dic*onary!contains!fewer!misspellings!

!  But!omen!we!don’t!change!the!documents!but!aim!to!

fix!the!queryLdocument!mapping!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Query!misspellings!

!  We!can!either!

!  Retrieve!documents!indexed!by!the!correct!spelling,!OR!

!  Return!several!suggested!alterna*ve!queries!with!the!correct!spelling!

!  Did%you%mean%…%?%

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Isolated!word!correc*on!

!  Fundamental!premise!–!there!is!a!lexicon!from!

which!the!correct!spellings!come!

!  Two!basic!choices!for!this!!  A!standard!lexicon!such!as!

!  Webster s!English!Dic*onary!

!  An! industryLspecific !lexicon!–!handLmaintained!

!  The!lexicon!of!the!indexed!corpus!!  E.g.,!all!words!on!the!web!!  All!names,!acronyms!etc.!(including!the!misspellings)!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Isolated!word!correc*on!

!  Given!a!lexicon!and!a!character!sequence!Q,!return!the!words!in!the!lexicon!closest!to!Q!

!  What’s!“closest”?!

!  We’ll!study!several!alterna*ves!

!  Edit!distance!(Levenshtein!distance)!!  Weighted!edit!distance!

!  nLgram!overlap!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Edit!distance!

!  Given!two!strings!S1!and!S2,!the!minimum!number!

of!opera*ons!to!convert!one!to!the!other!

!  Opera*ons!are!typically!characterLlevel!!  Insert,!Delete,!Replace!(e.g.,!distance!=!1)!!  Transposi*on!(e.g.,!distance!=!2)!

!  E.g.,!the!edit!distance!from!dof!to!dog!is!1!!  From!cat!to!act!is!2!!!!(Just!1!with!transpose)!!  from!cat!to!dog!is!3!

!  Generally!found!by!dynamic!programming!!  See!hHp://www.csse.monash.edu.au/~lloyd/*ldeAlgDS/Dynamic/Edit/!!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Levenshtein!distance!

!  levcats,fast(4,4)!=!min!!

!  levcats,fast(4,3)!+!1!(inser*on)!!  levcats,fast(3,4)!!+!1!(dele*on)!!  levcats,fast(3,3)!+!1!(replacement)!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Levenshtein!distance!

0 1 2 3 4

1 1 2 3 4

2 2 1 2 3

3 3 2 2 2

4 4 3 2 3

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Exercise:!Levenshtein!distance!

!  Compute!edit!distance!between! NTU !and! NUS’!

filling!out!the!following!table!for!Levenshtein!

distance!algorithm.!

N! T! U!

0!

N!

U!

S!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Weighted!edit!distance!

!  As!above,!but!the!weight!of!an!opera*on!depends!on!the!character(s)!involved!

!  OCR!errors:!e.g.!O)–)D!or!O!–!I!!  Keyboard!errors:!e.g.!O!–!D!or!O)–)I)!  This!may!be!formulated!as!a!probability!model!

!  Requires!weight!matrix!as!input!

!  Modify!dynamic!programming!to!handle!weights!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Edit!distance!to!all!dic*onary!terms?!

!  Given!a!(misspelled)!query!–!do!we!compute!its!

edit!distance!to!every!dic*onary!term?!

!  Expensive!and!slow!!  Alterna*ve?!

!  How!do!we!cut!the!set!of!candidate!dic*onary!terms?!

!  One!possibility!is!to!use!n,gram!overlap!for!this!

!  This!can!also!be!used!by!itself!for!spelling!correc*on!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

nLgram!overlap!

!  Enumerate!all!the!nLgrams!in!the!query!string!as!

well!as!in!the!lexicon!

!  Use!the!nLgram!index!to!retrieve!all!lexicon!terms!

matching!any!of!the!query!nLgrams!

!  Threshold!by!number!of!matching!nLgrams!

!  Variants!–!weight!by!keyboard!layout,!etc.!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Example!with!trigrams!

!  Suppose!the!text!is!november1!  Trigrams!are!nov,%ove,%vem,%emb,%mbe,%ber!

!  The!query!is!december1!  Trigrams!are!dec,%ece,%cem,%emb,%mbe,%ber!

!  So!3!trigrams!overlap!(of!6!in!each!term)!

!  How!can!we!turn!this!into!a!normalized!measure!

of!overlap?!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

One!op*on!–!Jaccard!coefficient!(J)!

!  A!commonlyLused!measure!of!overlap!

!  Let!X!and!Y!be!two!sets;!then!J!is!

!  Equals!1!when!X!and!Y!have!the!same!

elements!and!zero!when!they!are!disjoint!

!  X!and!Y!don’t!have!to!be!of!the!same!size!

!  Always!assigns!a!number!between!0!and!1!

!  Now!threshold!to!decide!if!you!have!a!match!

!  E.g.,!if!J!>!0.8,!declare!a!match!!

YXYX ∪∩ /

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Calcula*ng!J!and!dJ!

!  Given!A!and!B,!each!with!n!binary!aHributes,!!!!!!!!!!the!Jaccard!coefficient!is!a!useful!measure!of!the!

overlap!that!A!and!B!share!with!their!aHributes !!

M11:!number!of!aHributes!where!A!and!B!both!have!a!value!of!1!

M01:!number!of!aHributes!where!the!aHribute!of!A!is!0!and!the!aHribute!of!B!is!1!

M10:!number!of!aHributes!where!the!aHribute!of!A!is!1!and!the!aHribute!of!B!is!0!!

M00:!number!of!aHributes!where!A!and!B!both!have!a!value!of!0!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

ContextLsensi*ve!spell!correc*on!

!  Text:!I1flew1from1Heathrow1to1Narita1!  Consider!the!phrase!query!�flew1form1Heathrow�1!  We’d!like!to!respond!

! !Did!you!mean! flew1from1Heathrow ?!

because!the!query!probably!didn’t!match!any!

document!and!the!alterna*ve!is!“sta*s*cally!more!

plausible”!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

ContextLsensi*ve!correc*on!

!  Need!surrounding!context!to!catch!this!!  First!idea:!retrieve!dic*onary!terms!close!(in!

weighted!edit!distance)!to!each!query!term!

!  Now!try!all!possible!resul*ng!phrases!with!one!word! fixed !at!a!*me!

!  flew1from1heathrow11!  fled1form1heathrow1!  flea1form1heathrow1

!  HitTbased)spelling)correc(on:)Suggest!the!alterna*ve!that!has!lots!of!hits!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

ContextLsensi*ve!correc*on!

!  Generaliza*on!(through!lemma*za*on)!

!  FLY!L>!from1!  fly1from1!  flying1from1!  flies1from1!  flown1from1!  flew1from!

!  See!also,!fly_to,!fly_through,!fly_away,!etc.!

!  But!“fly”!can!be!also!a!NOUN!or!an!ADJ…!!  Need!for!POS!tags!!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Another!approach!

!  Break!phrase!query!into!a!conjunc*on!of!biwords!!  E.g.! flew!form ,! form!Heathrow !

!  Look!for!biwords!that!need!only!one!term!corrected!

!  X!form ,! form!Heathrow !

!  flew!Y ,! Y!Heathrow !

!  flew!form ,! form!Z !

!  Enumerate!phrase!matches!and!…!rank!them!!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

General!issues!in!spell!correc*on!

!  We!enumerate!mul*ple!alterna*ves!for! Did!you!

mean? !

!  Need!to!figure!out!which!to!present!to!the!user!!  Use!heuris*cs!

!  The!alterna*ve!hiÄng!most!docs!

!  Query!log!analysis!!!  SpellLcorrec*on!is!computa*onally!expensive!

!  Avoid!running!rou*nely!on!every!query?!!  Run!only!on!queries!that!matched!few!docs!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Soundex!

!  Class!of!heuris*cs!to!expand!a!query!into!phone*c!equivalents!

!  Language!specific!–!mainly!for!names!

!  E.g.,!chebyshev!(English)!→!tchebycheff1(French)!

!  Invented!for!the!U.S.!census!…!in!1918!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Soundex!

!  Turn!every!token!to!be!indexed!into!a!4Lcharacter!reduced!form!

!  E.g.,!Herman!becomes!H655!

!  Do!the!same!with!query!terms!

!  Build!and!search!an!index!on!the!reduced!forms!

!  (when!the!query!calls!for!a!soundex!match)!

!

!  hHp://www.crea*vyst.com/Doc/Ar*cles/SoundEx1/SoundEx1.htm!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Soundex!–!typical!algorithm!

1.  Retain!the!first!leHer!of!the!word!!2.  Change!all!occurrences!of!the!following!leHers!

to!'0'!(zero):!!!'A',!E',!'I',!'O',!'U',!'H',!'W',!'Y'.!!

3.  Change!leHers!to!digits!as!follows:!!!  B,!F,!P,!V!→!1!

!  C,!G,!J,!K,!Q,!S,!X,!Z!→!2!

!  D,T!→!3!

!  L!→!4!

!  M,!N!→!5!

!  R!→!6!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Soundex!–!typical!algorithm!

4.  Remove!all!pairs!of!consecu*ve!digits!

5.  Remove!all!zeros!from!the!resul*ng!string!

6.  Pad!the!resul*ng!string!with!trailing!zeros!and!return!the!first!four!posi*ons,!which!will!be!of!the!

form!<uppercase!leHer>!<digit>!<digit>!<digit>!

!

E.g.,!Herman!becomes!H655!

Will hermann generate the same code?

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Beyond!Soundex!

!  Soundex!is!the!classic!algorithm,!provided!by!most!

databases!(Oracle,!Microsom,!…)!

!  How!useful!is!soundex?!!  Not!very!–!for!informa*on!retrieval!

!  Okay!for! high!recall !tasks!(e.g.,!Interpol),!though!biased!to!names!of!certain!na*onali*es!

!  Zobel!and!Dart!(1996)!show!that!other!algorithms!

for!phone*c!matching!perform!much!beHer!in!the!

context!of!IR!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

!  Pronuncia*on!more!consistent!than!orthography!

# Phone9c,based!approach!to!normaliza*on!

!  Interna*onal!Phone*c!Alphabet!(IPA)!

IPA!normaliza*on!

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

IPA!normaliza*on!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Do!not!reinvent!the!wheel!!

!  Stemmers,!lemma*zers,!and!spell!correctors!are!

widely!and!freely!available!on!the!Web!in!all!

major!programming!languages!

!  For!Python,!I!recommend:!

!  NLTK!Lemma*zer!

!  hHp://nltk.org!!  Peter!Norvig’s!spell!corrector!

!  hHp://norvig.com/spellLcorrect.html!

%%%Lecture%%2%

GIAN%Course%,%Big%Social%Data%Analysis% !! !!

Takeaways!

!  Tokeniza*on!!  Linguis*c!analysis!

!  Normaliza*on!

!  Stemming!

!  Lemma*za*on!

!  Query!!  WildLcard!queries!

!  Spell!correc*on!!  Soundex!

%%%Lecture%%2%