Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data,...

38
© 2013 IBM Corporation Big Data, NLP and Industry Applications L Venkata Subramaniam IBM Research India

Transcript of Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data,...

Page 1: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

© 2013 IBM Corporation

Big Data, NLP and Industry Applications L Venkata Subramaniam

IBM Research India

Page 2: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Content

§  Introduction: – What is Big Data? – Map Reduce – Noise in Text – ways of handling it

§  Application in Commerce – Processing internal and external data

2

Page 3: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Big Data – Big Analytics

Traditional / Non-traditional data sources

Analytics delivery

Powerful Analytics

Algo Trading

Telco churn predict

Smart Grid

Cyber Security

Government / Law enforcement

ICU Monitoring

Environment Monitoring

Velocity Insights in microseconds

Volume Terabytes per second Petabytes per day

Variety All kinds of data All kinds of analytics

Veracity Data in Doubt Inconsistent, ambiguous

Think Big

Page 4: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

What is Hadoop?

An Open-Source Software , batch-offline oriented, data & I/O intensive general purpose framework for creating distributed applications that process huge amounts of data.

HUGE

- Few thousand machines

- Peta-bytes of data

- Processing thousands of job each week

Page 5: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

What’s so special about Hadoop?

Storage MapReduce

•  Distributed •  Reliable •  Commodity gear

•  Parallel Programming •  Fault Tolerant

Scalable Affordable Flexible Fault Tolerant

New nodes can be added on the fly

Massively parallel computing on

commodity servers

Hadoop is schema-less – can

absorb any type of data

Through MapReduce software framework

Page 6: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Map-Reduce

§  Compute Avg of B for each distinct value of A

A B C

R1 1 10 12

R2 2 20 34

R3 1 10 22

R4 1 30 56

R5 3 40 17

R6 2 10 49

R7 1 20 44

MAP 1

MAP 2

(1, 10) (2, 20) (1, 10)

(1, 30) (3, 40) (2, 10) (1, 20)

(1, 17.5)

(2, 15) (3, 40)

(1, [10, 10, 30, 20])

(2, 10) (3, 40)

Reducer 1

Reducer 2

Task Map (break task into small parts)

Reduce (many results to a single result set)

Page 7: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

What is Noise

7

Page 8: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Big Data, Fast Data, Noisy Data

30% world population on the internet and increasing fast

There are more social networking accounts than people in the world

Social Networking overtakes Search: Facebook becomes the most visited website ahead of Google

I’ll see ya tomo RIP Jackson

I’m lookie out 4 a car 2 burn rubber on the streets of LA What should I buy?? A mini laptop with Windows

OR a Apple MacBook!??!

Social Media Communication is meant for Friends

Noisy, Informal, Implicit and Contextual Conversations

Big Data: More video content was uploaded onto YouTube in the past two months than all the new content ABC, CBS and NBC have been entering 24/7 since 1948.”

Type of Text WER

SMS (texting) 50%

Tweets 35%

ASR 30%

Web queries 15%

OCR 5%

Newswire Text (WSJ, Reuters, NYT)

0.005%

Upt

o 10

,000

tim

es m

ore

nois

y

55 million Tweets per day

Lead Generation, Disaster Tracking

Large Dimensional, uncertain, unverified

8

Page 9: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

SMS

§  0 there – there §  1 aint – are not §  2 no – no §  3 doubt – doubt §  4 there – there §  5 hon – honey §  6 im – I am §  7 gonna – going §  8 be – be §  9 takin – taking §  10 it – it §  11 4 – for §  12 life – life §  13 u – You §  14 wont – wont §  15 b – be §  16 rida – rid of §  17 me – me §  18 lol – laugh out loud §  19 Ray – (NAME)

Texting Language: Over 50% of the words are written in non standard ways Spontaneous Language: Use of slang, ungrammatical, no punctuations, no case information Mixing of Languages: Many SMS contain text in a mix of two or more languages

Type of Noise %

Deletion of Characters

48%

Phonetic Substitution

33%

Abbreviations 5%

Dialectical Usage

4%

Deletion of Words

1.2%

101 SMSes 52% words were non standard (Contractor et al., 2010)

9

Page 10: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

What is Noisy Text?

§ Any kind of difference in the surface form of an electronic text from the intended, correct or original text (Knoblock et al., 2007)

§ Noise can be at the lexical level {b4, before, befour} • Resulting in substitution, insertion, deletion, transposition,

run-on, and split.

§ Noise can be at morphological, syntactic, discourse level {I can hear u, I can hear you, I can here you}

• Resulting in substitution, insertion, deletion, transposition of words and the introduction of out of vocabulary words.

10

Page 11: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Classifying Noise

Lexical Errors (Subramaniam et al., 2009)

§ Missing characters {before > bef}

§ Extra characters {raster > raaster}

§ Phonetic substitution {before > b4, late > l8}

§ Abbreviations {laugh out loud > lol, United Nations > UN}

Syntactical Errors (Kukich, 1992; Foster et al., 2007)

§ Missing Word {What are the subjects? > What the subjects?}

§ Extra word {Was that in the summer? >Was that in the summer it?}

§ Real word spelling errors {She could not comprehend. > She could no comprehend.}

§ Agreement {She steered Melissa round a corner. > She steered Melissa round a corners.}

§ Dialectical usage {I’m going to be there > I’m gonna be there}

11

Page 12: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Techniques for Automatically Detecting Lexical Errors (Kukich 92)

§ Efficient methods to detect strings that do not appear in a given word list, dictionary or lexicon

– Nonword error detection

§ Two approaches – N-gram

•  Look up each n-gram in an input string in a precompiled table to ascertain either its existence or its frequency. Nonexistent or infrequent n-grams (shj, iqn) are identified as possible misspellings.

•  Good for identifying errors made by OCR devices •  But unusual/foreign language valid words will be marked and nice-looking

mistakes will be marked valid

– Dictionary based •  Input string appears in a dictionary? If not, the string is flagged as a

misspelled word. •  But nearly two-thirds of the words in a dictionary did not appear in an eight

million word corpus of New York Times text, and, conversely two-thirds of the words in the text were not in the dictionary (1986 study)

12

Page 13: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Techniques for automatically Detecting Incorrect (Syntax) Grammar (Foster et al., 2007)

§ Efficient methods to detect word sequences that do not form a grammatical sentence

§ Three Approaches – N-gram

• Classifies a sentence as ungrammatical if it contains an unusual part of speech sequence

– Precision-grammar • Classifies a sentence using a parser and a broad-coverage

hand-written grammar

– Probabilistic-parsing •  Finds sentences with parsing error

13

Page 14: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Spelling Correction (Kukich, 1992)

§  Isolated Word Correction – Minimum edit distance techniques – Similarity key techniques – Probabilistic techniques – N-gram-based techniques – Rule-based techniques – Will not catch typos resulting in correctly spelled words {form, from} – Estimates put real word errors at 30% of all word errors

§ Context-Dependent Word Correction – Parsing – Language models – Can errors be ignored and still meaningful interpretation be done? {I

am coming with you, I comes with you}

14

Page 15: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Tomorrow never dies!!!

§ 2moro (9) §  tomoz (25) §  tomoro (12) §  tomrw (5) §  tom (2) §  tomra (2) §  tomorrow (24) §  tomora (4)

§  tomm (1) §  tomo (3) §  tomorow (3) § 2mro (2) § morrow (1) §  tomor (2) §  tmorro (1) § moro (1)

dis is n eg 4 txtin lang

This is an example for Texting language

§ Extreme corruption of words and sentences

§ Models for SMS language are lacking

SMS Text Normalization

15

Page 16: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Finding Canonical Sets (Acharyya, 2009)

§ Learn mappings

§ How can we do it in an unsupervised way ?

§ Find some invariant, that does not change in spite of corruptions

§ Buckets of context seem invariant! – <..Back Bucket....> sceam <..Front Bucket...> – sceam : sms(2) new(5) recharge(4) tel-provider(2) about(3)

– <..Back Bucket...> scheme <..Front Bucket...> – scheme : sms(4) new(2) activate(3) tel-provider(2) about(1) recharge(1)

costmer, castumar, kustamar,

coustomber

customer

16

Page 17: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Whom does it matter

§  Research Community J

§  Business Community - New tools, new capabilities, new infrastructure, new business models etc.,

Financial Services..

Page 18: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Customer 360

q Understand Customer Needs: Target individual customers better by understanding their spend patterns, spend locations, intent, sentiment, life events and propensity

q Combine internal and external data for a 360 view of the customer q For each customer determine their product propensity q Predict where, when and what spend is likely for a customer q Roll out offers based on customer propensities

•  By combining internal and external data from enterprise, social and mobile, determine where, when and what her next spend will be

Food

books

Entertainment

For each customer determine his/her behavior or signature in terms of movement patterns, hangouts, intent and spends

Arti Mehra

18

Page 19: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Entity (people, products, events) Insights The problem Solution

What are the key product interests of person A?

Over time learn about the person’s product interests from her social media postings

What is the location and trajectory of person B?

Gives the current location and locations in the past

What life events happened in person A’s life in the past x months?

List significant events like marriage, birth of a child, relocation, etc.

What are the events of interest happening in a given location?

Lists the top events in a given geography

What is the sentiment on a given product?

Gives the sentiment on a product

Key Sustained Value Factor:

Understand customers wants and needs better

IBM Confidential

Analyze social data in the context of enterprise data to build entity and event profiles and establish linkages between them for online and offline analysis

What 360 comtext means? §  Builds an entity’s complete profile by aggregating data about the entity from social and enterprise data

sources. Here an entity refers to people, products, brands and events.

360 Context

intent to purchase for

customers

Application Domains

Social Data

Enterprise Databases

User Domains

propensities/ sentiment/intent •  event Detection •  entity Linkages •  sentiment

core customer view/transactions •  event Profiles •  entity Profiles

Smarter Commerce

Smarter Cities

real-time public safety events

Page 20: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

20

Enterprise and Social Media Integration

Personal Attributes •  Identifiers: name, address, age, gender, occupation… •  Interests: beaches, cuisines, historical architects etc •  Life Cycle Status: marital, parental

Relationships •  Personal relationships: family, friends and roommates… •  Business relationships: co-workers and work/interest network…

Travel Interests •  Personal preferences of class, travel interests, social influence, cuisines, hotels •  Travel history

Social Media

Facebook Twitter

Life Events •  Life-changing events: relocation, having a baby, getting married, getting divorced, buying a house…

Profile: Name, Address, Designation, Contact Number, Membership Status

Transaction History: Purchase Date, Time, Store, Item, Amount

Timely Insights •  Intent to travel various destinations, airline preference, travel class, business •  Last travel

CRM: Request Id, Complaint, sentiments

360 View Creation, Querying, Analysis and Reporting

Internal Profile

@Ram Sarin: I need a smart phone that’s smarter than my old phone…suggestions?

Store: Pit Stop, Delhi Airport Amount: Rs 1,650 Date: 17:35, 3rd June 2014

Assemble an entity view of the customer, aggregate data from thousands of different documents spread across internal and external, structured and unstructured, text and non-text sources

Movement Patterns •  Hangouts (weekend, weekday, evening, etc.) •  Spatio temporal signatures

Name: Ram Sarin Shukla Address: Bhogal, New Delhi Corporate: IBM

Page 21: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

21

Social Media Analytics

Spatio-Temporal Hangout Analysis

360 View from Multiple Documents Spend Analytics

Page 22: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Influencing Online Search using Social Analytics

22

Signed Customer Guest

Enter Search Term

Customer-Specific Search Results

Trending Products

Social  Influenced  Search:  v Customers  open  the  commerce  site  v Customers  are  shown  a  list  of  trending  products  based  on  the  social  feed  v Once  they  are  logged  in,  they  see  more  specific  products  and  brands  associated  with  themselves  and  their  network.  Benefits:  v Allows  customers  to  see  what  is  hot  not  just  within  a  merchant’s  site,  but  instead  in  a  more  social  context  

Page 23: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Offer Generation using Social Analytics

23

Customer

Trigger: Life event, Current Spend, Spend Behaviour

Customer-Specific Offers

Social  Influenced  Offers:  v Live  offers  are  ranked  using  customer  propensity  based  on  life  events,  spend  behaviour,  hangout  analysis  v Customers  are  sent  offers  with  highest  likelihood  of  acceptance  v Right  Dme  offers,  like  when  the  customer  is  in  the  mall  shopping  Benefits:  v Allows  customers  to  be  presented  with  precisely  relevant  offers  based  on  specific  needs  

Demo: http://malhar.irl.in.ibm.com:8080/MDM_MovieAnalysis/

Page 24: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

© 2013 IBM Corporation

Data Analytics Pipeline

Data Ingest and Prep

Extract Buzz, Intent , Sentiment

Entity Analytics: Profile Resolution

Real time analytics. Pre-defined views and charts

Dashboard

Stream Computing and Analytics

BigInsights System and Analytics

Online flow: Data-in-motion analysis

Offline flow: Data-at-rest analysis

Pre-defined Workbooks and Dashboards

Social Media Data + Enterprise Data

Extract Buzz, Intent , Sentiment And Consumer Profiles

Entity Analytics and Integration

Comprehensive Customer Profiles

Social Media

Enterprise

CRM

Tx

Spatio-Temporal Analytics and Integration

Extract Spatial Location & Temporal Presence

Page 25: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Advanced Text Analytics and Entity Resolution

§  Input – For each user, his/her social media data collected over time that represents

his/her online content in terms of text message postings, product mentions, check-ins

§  Output – Social Media Profile of the User

•  Sentiment on different products, likes, dislikes •  Intent to purchase, purchases and product mentions •  Location details in terms of where certain activities were carried out •  Extracted Life events

– Matching of social media profile to Enterprise profiles

§  Novelty – Works on BigInsights Map-Reduce framework with high Scalability –  Identification and categorization of key concepts from noisy social media data – Entity Resolution across Enterprise customer data with Social Media profiles

using sparsely populated attributes. Uses Name, Context clues like customer clues, product purchases, different granularities of location clues gathered from Social Media

25

Page 26: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Example Analysis : Extraction from Twitter messages Remove Spam Extract intent, interests, life events and micro segmentation attributes

I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others http://4sq.com/gbsaYR

 @silliesraghu good!!! U shouldnt! Think about the important stuff, like ur birthday ;) btw happy birthday Raghu ;)

@rakonturdelhi im moving to delhi in 3 months. i look foward to the new lifestyle

I had an iphone, but it's dead @JoaoVianaa. (I've no idea where it's) !Want a blackberry now !!!

Monetizable Intent

Relocation

Location

Name, Birth Day

Subtle Spam, Advertising

Sarcasm, Wishful Thinking

While accounting for less relevant messages

I think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on itunes

http://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile

@purplepleather Gotta do more research my Versace term paper 2day. Before I die, I want a versace purple diamond tiara. Im just sayin&gt;lol

had so much fun today! I want to buy a million dollar house with a wrap around porch ... ... wading river on the long island sound, ha i wish!

Page 27: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

27

Prior Business Transactions

Customer ready to buy a DSLR camera today, possibly at a nearby mall

Buying a DSLR today !

Thrza gr8 deal on ZX-550 @ the mall

Go for the best, DP-2000

Michael’s online friends offer lots of advice

Wifey’s birthday tomorrow, looking for a killer dslr

Sarcasm, Wishful Thinking

Maybe I should buy her that purple roadster, while I’m at it. ;-) lol

Text Analytics used to extract intent from Social Media Married, Male, Spouse Birthdate, Gift Type, Intent to Purchase, Timeframe

Intent to Purchase, Gift Type?

Potential Locations and

Activity

In NYC area this w/e, any good malls nearby?

Region & City Location, Timeframe, Intent to Shop

§  Resultant fact base contains billions of facts, and is incrementally updated §  Fact segmentation or clustering is rapid enough to drive a business decision

More data: Customer intent extracted from social media provides context

Entity Extraction, Fact Discovery, Intent & Sentiment

Social Data

450M+ tweets/day Millions of tweets yield one company-specific fact

Influencers

Buying DSLR today!

Intent

27

Page 28: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

28 IBM Research 28 28 28

Example Analysis: Linking Social Profiles with Customer Database

§  Identify candidate matches based on name and location §  Advanced text analysis provides strong supporting evidence that is leveraged by the matching

algorithm: –  Explicit customer interaction through social media customer support –  Mention of company product use

§  Customer Identity confirmed through additional disambiguation techniques –  E.g., uniqueness of customer information (name, products owned) in specific geography

Name: Tarun Rana Screen name: @tarunrana Extracted Location: Delhi

Messages    This stupid <bankname> app never works for me !!!!

@BankSupport well how long is my online deposit gonna take it has been like three days … i might as well have gone to the atm

Cust.# 12345

Name Tarun Rana

Address DX 40, Kendriya Vihar

City Gurgaon

State Haryana

Customer DB

All names and identifiers are fictitious

Semantic Name Variations Tarun Rana vs. Rana, Tarun Kumar K. Singh vs. Karan (Happy) Singh

Geo Proximity Gurgaon, Haryana vs. New Delhi UP vs. Delhi

Job Role Disambiguation

“Software sales manager at IBM…” vs. “Managing SPSS Sales for Canada…”

Page 29: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

29 IBM Research 29 29 29

Geo-located documents

Textual clues

Example Analysis: Inferring Location from Multiple Clues Social  Media  Profile   Screen name : @tarunrana

Location: Delhi Name: Tarun Rana

Name: Tarun Rana Screen name: @tarunrana Location: Delhi Description: just a Nor-Cal gal trying to fall in love with Florida

Messages    

I'm at Barista, SDA (New Delhi) http://4sq.com/SZ3yjj

I'm at S.o.G (New Delhi) http://4sq.com/UCweM5

Gotta love wc football #wc #brazil http://instagr.am/p/QOHPqabdYt/

I'm at Eats American Grill (New Delhi) http://4sq.com/O2a1Jm

Check out my blog about #food in #ChandniChowk http://www.citybythebay.com

Who's watching the #brazil tonight? (from 27.27989014,-82.34825406)

Fusion libraries: •  Confidence: place mentions vs. geo-codes •  Analysis of location time-series

Check-ins

Metadata    

Fusion libraries: •  Confidence:

metadata vs. content

Disambiguation, fusion of partial information

Temporary location

Permanent location

All names and identifiers are fictitious

Page 30: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Advanced spatio-temporal analytics

§  Input – For each user, his/her spatio-temporal data collected over time that represents his/her

behavior or signature in terms of movement, hangouts, spend patterns (from card swipes, mobile location data)

§  Output – Given a specific time in future, the possible location(s) at which the user could be

present along with possible spends – Given specific location(s), the possible time in future at which the user could visit this

location(s) and likely spends at the location(s) – Given a specific location and/or time, the likely spend

§  Novelty – Spend Prediction using Periodicity from User’s Behavior: We use the concept of

periodicity to predict the location and related spend of a user. E.g. – “When you are in the mall what are you likely to purchase?”

– Spatio-Temporal Signature: We use the concepts of time-series mining for predicting the probability of future temporal occurrence of a user at a given set of spatial location/s. E.g., - “When will user U next be in the vicinity of Red Fort?”

– Offer Propensity: We statistically model the customer to predict his spend behaviour based on location, time, past spends, etc.

30

Page 31: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

The Concept of a ST- Card Swipe Signature

Loc 2

8:30-8:40

Loc 3

8:40-9:00

Loc 4 9:00-1

800

Loc 5 18:00-18:30

Loc 1 20:30-8:30

Day 1

Loc 2

8:30-8:40

Loc 3

8:40-9:00

Loc 4 9:00-1

800

Loc 3

18:00-18:20

Loc 2

18:20-18:30

Loc 1 18:30-8:30

Day 2

Loc 5 8:30-9:00

Loc 4 9:00-1800

Loc 5 18:00-18:

05

Loc 1 18:05-8:3

0

Day 3

Loc 2

8:30-8:40

Loc 3

8:40-9:00

Loc 4 9:00-1

800

Loc 5 18:00-18:30

Loc 1 18:30-8:30

Day 4

Loc 5 8:30-9:

00

Loc 4 9:00-1

800 Loc 3

18:00-18:20

Loc 2

18:20-18:30

Loc 1 18:30-8:30

Day 5

Loc 6

12:30-12:40

Loc 7 12:40-14:

00

Loc 6

14:00-14:10

Loc 1 18:30-12:

30

Day 6

Loc 6

12:30-12:40

Loc 7 12:40-20:

00

Loc 6

20:00-20:10

Loc 1 14:20-12:

30

Day 7

… Card

owner’s swipes

Loc1  

2

3

4  

5

7  

6  

MetaData (location type, merchant type, amount, etc.)

ST-Signature

31

Page 32: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

32

Example Analysis: Predicting Next Purchase Location/Time

Signature  Generator  

Spa=al  Predic=on  Query:  When  will  user  swipe  his  card  at  locaDon  X  again?  

Temporal  Predic=on  Query:  Where  will  user  swipe  his  card  next?  

Spatio-­‐Temporal  Prediction  

Predicted  Temporal  Values  

Predicted  Spa=al  Loca=ons  

Page 33: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

33

Example Analysis: Identifying Similar Users For Recommendations

Signature  Generator  

Signature  Generator  

       

Similarity  Framework  

Pruning   Similarity  Computa=on  

Similarity  Visualiza=on  

Page 34: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Example Analysis: Categorization of Spending Patterns from Spatio-temporal Behavior

Signature  Generator  

Customer DB

Spending Patterns

Permanent Address

Current Address

Vacation Location

Social Media

Furnishings

Clothing & Apparel

Entertainment

Page 35: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Advanced Temporal Analytics

▪  Input – For each user, his/her spending patterns (card swipes, online payments, etc.)

over time

▪ Output – For different times in future, different product recommendations based on

spending patterns

▪ Novelty –  Captures contextual parameters for a better prediction –  Allows personalized recommendations of temporal nature –  Makes use of domain ontologies to generalize the product purchases required

for better regression model –  Allows predictions for non-periodically bought products –  Identifying interesting package deals (tie-in sales)

10

Page 36: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Harry Potter 1

Harry Potter 2

Reebok Shoes

iPhone 3 iPhone 4

PanteneShampoo

Dove Conditioner

PanteneShampoo

Dove Conditioner

Head & shoulders Shampoo Dove

Conditioner

Books à Fiction, Medium price, Latest release, India

Electronics, High Price, Latest release, US

Books à Fiction, Medium price, Latest release, India

Electronics, High Price, Latest release, US

Hair Care product, low price, available for long, India

?

TODAY 1 MONTH later

?

Candidate: Harry Potter 3

Pepsodent Toothbrush

Pepsodent toothpaste

Colgate toothpaste

Pepsodent toothpaste

Oral B toothbrush

Shoes, iPhone 5

Reebok Shoes

Example Analysis: Predicting Next Purchase, Location, Time

Offline Component

Modeling purchase patterns

Parameter Extractor User

History

For every customer, for every product, extract quantity,

location, high/low end status, release date (latest/old), and

so on

SystemML models can be used here. Regress for each product as well as general

category of each product.

Regression Analysis

Online Component

Current Product Catalog

Recommender

Alerts, Reminders and updates

For predicting the recommendations, use the categorical periodicity and other parameters

for best matches

The temporal product catalog to be used

Prediction Timestamp

Page 37: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Conclusion

§ Aggregate information from social media, enterprise and transaction data – a lot of information in a timely manner from everywhere about customers

– Social Media is used extensively by people to express intent, sentiment, product preferences, life events, etc.

– Spatio-Temporal information can be extracted from multiple sources like social media check-ins, mobility data, etc.

– 360 Customer view enables determining when, where and what the next spend will be

37

Page 38: Big Data, NLP and Industry Applicationsmausam/courses/csl772/autumn2014/lectures/l… · Big Data, Fast Data, Noisy Data 30% world population on the internet and increasing fast There

Thank You J

38