Tova Milo on Crowd Mining

37
Crowd Mining Tova Milo

Transcript of Tova Milo on Crowd Mining

Page 1: Tova Milo on Crowd Mining

Crowd Mining

Tova Milo

Page 2: Tova Milo on Crowd Mining

Data Everywhere

The amount and diversity of Data being generated and collected is exploding

Web pages, Sensors data, Satellite pictures, DNA sequences, …

Crowd Mining 2

Page 3: Tova Milo on Crowd Mining

From Data to Knowledge

Crowd Mining

Buried in this flood of data are the keys to- New economic opportunities- Discoveries in medicine, science and the humanities- Improving productivity & efficiency

However, raw data alone is not sufficient!!!We can only make sense of our world by

turning this data into knowledge and insight.3

Page 4: Tova Milo on Crowd Mining

The research frontier

Crowd Mining

• Knowledge representation. • knowledge collection, transformation, integration, sharing. • knowledge discovery.

We will focus today on human knowledge

Think of humanity and its collective mind expanding…

4

Page 5: Tova Milo on Crowd Mining

The engagement of crowds of Web users for data procurement

11/6/2014 5

Background - Crowd (Data) sourcing

5Crowd Mining

Page 6: Tova Milo on Crowd Mining

General goals of crowdsourcing

• Main goal – “Outsourcing” a task (data collection & processing)

to a crowd of users

• Kinds of tasks– Can be performed by a computer, but inefficiently– Can be performed by a computer, but inaccurately– Tasks that can’t be performed by a computer

Crowd Mining

So, are we done?

6

Page 7: Tova Milo on Crowd Mining

MoDaS (Mob Data Sourcing)

• Crowdsourcing has tons of potential, yet limited triumph…– Difficulty of managing huge volumes of data and users of questionable

quality and reliability.

• Every single initiative had to battle, almost from scratch, the same non-trivial challenges

• The ad hoc solutions, even when successful, are application specific and rarely sharable

=> Solid scientific foundations for Web-scale data sourcingWith focus on knowledge management using the crowd

MoDaS

EU-FP7 ERC

MoDaSMob Data Sourcing

7

Page 8: Tova Milo on Crowd Mining

Challenges – (very) brief overview

• What questions to ask?[SIGMOD13, VLDB13, ICDT14, SIGMOD14]

• How to define & determine correctness of answers?

[ICDE11, WWW12]

• Who to ask? how many people? How to best use the resources?

[ICDE12, VLDB13, ICDT13, ICDE13]

Crowd Mining

Data Mining

Data Cleaning

Probabilistic Data

Optimizations and Incremental Computation

EU-FP7 ERC

MoDaSMob Data Sourcing

8

Page 9: Tova Milo on Crowd Mining

A simple example – crowd data sourcing (Qurk)

• ?

Crowd Mining 9

PicturenameLucy

Don

Ken

……

The goal:Find the names of all the women in the people table

SELECT nameFROM people pWHERE isFemale(p)

isFemale(%name, %photo)Question: “Is %name a female?”,

%photoAnswers: “Yes”/ “No”

9

Page 10: Tova Milo on Crowd Mining

A simple example – crowd data sourcing

Crowd Mining 1010

PicturenameLucy

Don

Ken

……

10

Page 11: Tova Milo on Crowd Mining

Crowd Mining: Crowdsourcing in an open world

• Human knowledge forms an open world

• Assume we want to find out what is interesting and important in some domain area

Folk medicine, people’s habits, …

• What questions to ask?

Crowd Mining 11

Page 12: Tova Milo on Crowd Mining

Back to classic databases...

• Significant data patterns are identified using data mining techniques.

• A useful type of pattern: association rules

– E.g., stomach ache chamomile

• Queries are dynamically constructed in the learning process

• Is it possible to mine the crowd?

Crowd Mining 12

Page 13: Tova Milo on Crowd Mining

Turning to the crowdLet us model the history of every user as a personal database

• Every case = a transaction consisting of items

• Not recorded anywhere – a hidden DB– It is hard for people to recall many details about many transactions!

– But … they can often provide summaries, in the form of personal rules

“To treat a sore throat I often use garlic”

Crowd Mining

Treated a sore throat with garlic and oregano leaves…

Treated a sore throat and low fever with garlic and ginger …

Treated a heartburn with water, baking soda and lemon…

Treated nausea with ginger, the patient experienced sleepiness……

13

Page 14: Tova Milo on Crowd Mining

Two types of questions• Free recollection (mostly simple, prominent patterns)

Open questions

• Concrete questions (may be more complex)

Closed questions

We use the two types interleavingly.

When a patient has bothheadaches and fever, how often do you use a willow tree bark infusion?

Tell me about an illness and how you treat it

“I typically treat nausea with ginger infusion”

Crowd Mining 14

Page 15: Tova Milo on Crowd Mining

Contributions (at a very high level)

• Formal model for crowd mining; allowed questions and the answers interpretation; personal rules and their overall significance.

• A Framework of the generic components required for mining the crowd

• Significance and error estimations.[and, how will this change if we ask more questions…]

• Crowd-mining algorithms

• [Implementation & benchmark. both synthetic & real data.]

Crowd Mining 15

Page 16: Tova Milo on Crowd Mining

The model: User support and confidence• A set of users U

• Each user u ∈ U has a (hidden!) transaction database Du• Each rule X Y is associated with:

usersupport

userconfidence

Crowd Mining 16

Page 17: Tova Milo on Crowd Mining

Model for closed and open questions• Closed questions: X ? Y

– Answer: (approximate) user support and confidence

• Open questions: ? ? ?– Answer: an arbitrary rule with its user support and confidence

“I typically have a headache once a week. In 90% of the

times, coffee helps.

Crowd Mining 17

Page 18: Tova Milo on Crowd Mining

Significant rules

• Significant rules: Rules were the mean user support and confidence are above some specified thresholds Θs, Θc.

• Goal: identifying the significant rules while asking the smallest possible number of questions to the crowd

Crowd Mining 18

Page 19: Tova Milo on Crowd Mining

Framework components• Generic framework for

crowd-mining• One particular choice of

implementation of each black boxes

Crowd Mining 19

Page 20: Tova Milo on Crowd Mining

• Treating the current answers as a random sample of a hidden distribution , we can approximate the distribution of the hidden mean

• μ – the sample average• Σ – the sample covariance• K – the number of collected samples

• In a similar manner we estimate thehidden distribution

11/6/2014 20

Estimating the mean distribution

20Crowd Mining

Page 21: Tova Milo on Crowd Mining

• Define Mr as the probability mass above both thresholds for rule r

• r is significant if Mr is greater than 0.5

• The error prob. is the remaining mass

• Estimate how error will change if another question is asked• Choose rule with largest error reduction

11/6/2014 21

Rule Significance and error probability

21Crowd Mining

Page 22: Tova Milo on Crowd Mining

Completing the picture (first attempt…)

• Which rules should be considered?

Similarly to classic data mining (e.g. Apriori)Start with small rules, then expend to rules similar to significant rules

• Should we ask an open or closed question?

Similarly to sequential samplingUse some fixed ratio of open/closed questions to balance the tradeoff between precision and recall

Crowd Mining 22

Page 23: Tova Milo on Crowd Mining

Semantic knowledge can save work

Given a taxonomy of is-a relationships among items, e.g. espresso is a coffee

frequent({headache, espresso}) ⇒ frequent({headache, coffee})

Advantages

• Allows inference on itemset frequencies

• Allows avoiding semantically equivalent itemsets{espresso}, {espresso, coffee}, {espresso, beverage}…

Crowd Mining 23

Page 24: Tova Milo on Crowd Mining

Completing the picture (second attempt…)

How to measure the efficiency of Crowd Mining Algorithms ???

• Two distinguished cost factors:– Crowd complexity: # of crowd queries used by the algorithm– Computational complexity: the complexity of computing the crowd

queries and processing the answers

[Crowd comp. lower bound is a trivial computational comp. lower bound]

• There exists a tradeoff between the complexity measures– Naïve questions selection -> more crowd questions

Crowd Mining 24

Page 25: Tova Milo on Crowd Mining

Complexity boundaries

• Notations:– |Ψ| - the taxonomy size– |I(Ψ)| - the number of itemsets (modulo equivalences)– |S(Ψ)| - the number of possible solutions– Maximal Frequent Itemsets (MFI), Minimal Infrequent Itemsets (MII)

Crowd Mining

≤ ≤

25

Page 26: Tova Milo on Crowd Mining

Now, back to the bigger picture…

Crowd Mining

“I’m looking for activities to do in a child-friendly attraction in New York,

and a good restaurant near by”

“You can go bike riding in Central Park and eat at Maoz Vegetarian.Tips: Rent bikes at the boathouse”

“You can go visit the Bronx Zoo and eat at Pine Restaurant.Tips: Order antipasti at Pine.Skip dessert and go for ice cream across the street”

The user’s question in natural language:

Some of the answers:

26

Page 27: Tova Milo on Crowd Mining

Pros and Cons of Existing Solutions

• Web Search returns valuable data– Requires further reading and filtering

• Not all restaurants are appropriate after a sweaty activity– Can only retrieve data from existing records

• A forum is more likely to produce answers that match the precise information need– The #of obtained answers is typically small– Still requires reading, aggregating, identifying consensus…

• Our new, alternative approach: crowd mining!

Crowd Mining 27

Page 28: Tova Milo on Crowd Mining

Additional examples

Crowd Mining

To answer these questions, one has to combine • General, ontological knowledge

– E.g., the geographical locations of NYC attractions• And personal, perhaps unrecorded knowledge about people’s

habits and preferences– E.g., which are the most popular combinations of attractions and

restaurants matching Ann’s query

A dietician may wish to study the culinary preferences in some population, focusing on food dishes that are rich in fiber

A medical researcher may wish to study the usage of some ingredients in self-treatments of bodily symptoms, which may be related to a particular disease

28

Page 29: Tova Milo on Crowd Mining

ID Transaction Details

1 <Football> doAt <Central_Park>

2 <Biking> doAt <Central_Park>.<BaseBall> doAt <Central_Park><Rent_Bikes> doAt <Boathouse>

3 <Falafel> eatAt <Maoz Veg.>

4 <Antipasti> eatAt <Pine>

5 <Visit> doAt <Bronx_Zoo>.<Antipasti> eatAt <Pine>

Page 30: Tova Milo on Crowd Mining

Crowd Mining Query language (based on SPARQL and DMQL)

• SPARQL-like where clause is evaluated against the ontology.– $ indicates a variable.– * used to indicate path of

length 0 or more • Satisfying clause are fact

sets to be mined from the crowd– MORE indicates that other

frequently co-occuringfacts can also be returned

– SUPPORT indicatesthreshold

Crowd Mining

SELECT FACT-SETSWHERE

{$w subclassOf* Attraction.$x instanceOf $w.$x inside NYC.$x hasLabel “child friendly”.$y subclassOf* Activity.$z instanceOf Restaurant.$z nearBy $x.}

SATISFYING{$y+ doAt $x.[] eatAt $z.MORE}

WITH SUPPORT = 0.4

“Find popular combinations of an activity in a child-friendly attraction in NYC, and a good restaurant near by (and any other relevant

advice).”

30

Page 31: Tova Milo on Crowd Mining

(Open and Closed) Questions to the crowd

Crowd Mining

How often do you eat at Maoz Vegeterian?

{ ( [] eatAt <Maoz_Vegeterian> ) }

How often do you swim in Central Park?

{ ( <Swim> doAt <Central_Park> ) }

Supp=0.1

Supp=0.3

What do you do in Central Park ?

{ ( $y doAt <Central_Park>) }

$y = Football, supp = 0.6

What else do you do when biking in Central Park?

{ (<Biking> doAt <Central_Park>) , ($y doAt <Central_Park>) }

$y = Football, supp = 0.6

31

Page 32: Tova Milo on Crowd Mining

Example of extracted association rules

• < Ball Games, doAt , Central_Park > (Supp:0.4 Conf:1)

< [ ] , eatAt , Maoz_Vegeterian > (Supp:0.2 Conf:0.2)

• < visit , doAt , Bronx_Zoo > (Supp: 0.2 Conf: 1)

< [ ] , eatAt , Pine_Restaurant > (Supp: 0.2 Conf: 0.2)

Crowd Mining 32

Page 33: Tova Milo on Crowd Mining

Can we trust the crowd ?

Crowd Mining 333333

Common solution: ask multiple times

We may get different answers Legitimate diversity Wrong answers/lies

33

Page 34: Tova Milo on Crowd Mining

34

Can we trust the crowd ?

Crowd Mining 34

Page 35: Tova Milo on Crowd Mining

Things are non trivial …

Crowd Mining 35353535

• Different experts for different areas

• “Difficult” questions vs. “simple” questions

• Data in added and updated all the time

• Optimal use of resources… (both machines and human)

Solutions based on- Statistical mathematical models- Declarative specifications- Provenance

35

Page 36: Tova Milo on Crowd Mining

Summary

Crowd Mining

The crowd is an incredible resource!

Many challenges:• (very) interactive computation• A huge amount of data• Varying quality and trust

“Computers are useless, they can only give you answers”- Pablo Picasso

But, as it seems, they can also ask us questions!

36

Page 37: Tova Milo on Crowd Mining

Thanks

Crowd Mining

Antoine Amerilli, Yael Amsterdamer, Moria Ben Ner, Rubi Boim, Susan Davidson, Ohad Greenshpan, Benoit Gross, Yael Grossman, Ana Itin, Ezra Levin, Ilia Lotosh, Slava Novgordov, Neoklis Polyzotis, Sudeepa Roy, Pierre Senellart, Amit Somech, Wang-Chiew Tan…

EU-FP7 ERC

MoDaSMob Data Sourcing

37