UMAP 2013 - Link, Like, Follow, Friend: The Social Element in User Modeling and Adaptation

188
Delft University of Technology Link, Like, Follow, Friend: The Social Element in User Modeling and Adaptation UMAP, Rome, June, 2013 Geert-Jan Houben Web Information Systems, TU Delft

Transcript of UMAP 2013 - Link, Like, Follow, Friend: The Social Element in User Modeling and Adaptation

Delft University of Technology

Link, Like, Follow, Friend: The Social Element in User Modeling and Adaptation UMAP, Rome, June, 2013

Geert-Jan Houben Web Information Systems, TU Delft

2

Social Web & UMAP

3

Social Web & UMAP

We observe, reflect, speculate, and raise discussion about evolutions and opportunities for UMAP to make a difference.

Triggered by the social element in UMAP and other conferences

& our own experience in the field.

4

Social Web in UMAP: a number of mentions of ‘social’, and a small number of ‘social web’ in the papers.

New U (in UMAP), new users

And we see more. We see how the Social Web mirrors people, mirrors users.

What we learn at the Social Web, learn (more) about users and for user modeling

and adaptation.

5

UMAP in the new Web world

What we learn at the Social Web allows us to reconsider UMAP in the Web.

It brings new opportunities for us as researchers.

Perhaps it brings new needs.

Surely, these are opportunities that we can position within our

UMAP research agenda and UMAP application portfolio.

6

SWUMAP: 1 + 1 = 3

Experience shows to combine: Understanding & Creating UM & AP Machines & Humans Arrive at a body of knowledge for turning insights about users and usage into added value in society and economy.

7

UMAP systems are Web systems

Lessons tell us to reconsider our system concept. On the Web systems are ‘in vivo’: open and dynamic.

•  Users & data are not (longer) ‘inside the system’. •  Users & data change, move (more) quickly.

This impacts understanding and creating of systems. This also impacts the systems’ architecture. With the (Social) Web as our laboratory, this also impacts our research discipline.

8

APPLICATION

HUMANS FOR AUGMENTATION

USERS DOMAIN

DOMAIN Augmented with Web Semantics

USERS Augmented with Web Semantics

REAL DOMAIN

REAL USERS

9

10

Inspiring domain

11

Domain: Incidents and emergencies

In literature we see a fair attention for the domain of incidents and emergencies. Our own experience from several years is situated in that domain. It has given us a good feeling for what is needed and how UMAP research can be part of a bigger effort to solve real-world problems.

12

Domain: Incidents and emergencies

In literature, most attention is directed towards understanding and detecting. Sometimes we see further objectives in responding, creating situational awareness (specially in massive incidents), and prevention. Most used in these studies is Twitter.

13

2011 Tohoku eartquakes 1200 tweets/minmin

14

2011 Pukkelpop storm 570 tweets/min

15

Twitter

With Twitter, we have a whole new reflection of what is happening in the world. A whole new source of digital data that reflects the (real) world. We need to understand that reflection to understand the world and help the world. Two challenges: 1.  Understand the world, and 2.  Understand its reflection in the Social Web.

16

CrowdSense BV

http://twitcident.org http://tno.nl/twitcident http://twitcident.com

Twitcident spin-off collaboration

Our real-world lab

17

400+ million tweets per day •  Netherlands ranks #1 in Twitter penetration

Twitter users publish about “anything” •  Work/private life •  Interesting events •  Etc.

Twitter tells us a lot about the world. And its users can be seen to act as social sensors and citizen journalists.

Monitoring Twitter

18

Train accident Train driver got a stroke

19

Train hits a block First eyewitness

20

1 min. later Eyewitness

21

15 min. later Eyewitnesses

22

1:17 hour later News media

23

30 min. later Entertainment

wrong photo

24

A new source of knowledge

An example of the speed and the nature of knowledge that Twitter provides and what it does to provide knowledge about what really happened. Also, it shows what we need to know and understand to use and interpret this effectively.

25

1.  Early warning •  Twitter users publish early signals that might indicate an increased

risk or potential incident.

2.  Crisis management •  (Eye-witness) Twitter users disseminate information about incidents

which can support operational emergency services.

3.  Post evaluation •  Post analyzing incident data (in retrospect) to measure the

effectiveness of emergency services.

Twitcident goals

26

• Emergency services •  Law enforcement, fire fighters, governments

• Big event organizers •  Festival security companies

• Utility organizations •  Public transport, energy supply, other vital infrastructures

Stakeholders

27

Pukkelpop 2011 Storm incident with casualties

28

80.000 tweets in 4 hours

29

570 tweets per min.

30

Could we see this impact coming?

Semantics 25 minutes before incident

1.  Weather: storm, cloud-burst, wind, …. 2.  Locations: Brussel, Gent, Hasselt, …

3.  Intensity: heavy, crazy, massive… 4.  Impact: hail balls, falling trees…

Impact storm Why is there a peak?

“ ”

31

Damage reports from incident site

32

Real-time intelligence by photos

33

Example festival disaster

The research into this example created a lot of knowledge about what is possible and what is desired. It was also a good example to follow and approach new use cases to build more general understanding and theory.

34

Dutch national rail infrastructure company

Example public infrastructure

35

Social Weather Map

36

Twitcident processes 100k tweets/day

The social weather map provides ProRail with a timely and accurate overview of citizen observations. In addition to other sources of knowledge.

Value

37

Big Events New Year’s Eve Serious Request Elections Lowlands Summer Carnaval Fantasy Island Queen’s Day

38

Crowd Control Room

39

Social media monitoring was done with 1-3 security officers

Violence, riots, fires, fireworks, crowds, ..

40

Not only monitoring

The previous examples are not only about monitoring Twitter to know what is happening out there.

41

She was about to turn 16…

42

So she invited some friends …

43

Which invited their friends…

44

She pulled back her invitation, but…

45

45

46

200.000 invited, 40.000 “going”

47

Atmosphere turned hostile

48

Teenagers vs. police

49

Alcohol & violence

49

50

Massive damage at local stores

50

51

Officials cannot ignore social media

52

Mayor of Haren resigns after Haren-debacle

53

Recommendations by Cohen

•  Clear communication strategy •  Planning & organizing in advance •  Social media monitoring •  Clear intervention policy

54

Why social media monitoring?

Content tells what to expect

55

Finding the needle

56

Recommendation from experience

Let us go and find the needle that tells us what appears to be happening out there But let us also think about how to support the action to make the world out there a better one.

57

Meaningful and actionable

Twitcident has learned us how information obtained from Twitter needs to be meaningful and actionable.

58

“Polling meaningful information” “Sifting thousands of tweets during hurricane Irene” “Getting situational awareness” “Finding the eye’s on the ground”

“Finding actionable information” “Providing timely reaction” “

“Volunteers are great” “But we need hybrid approaches to monitor social media”

Patrick Meier

Today’s challenges

59

Hybrid approach

Twitcident has also shown us how these problems ask for a hybrid approach with humans in the loop that handle and interpret the knowledge derived from the Social Web. Big Data is available from the Social Web, but Small Interpretations are needed, to get it right!

60

Human interpretation inside

The nature of these problems makes that solutions are not fully automatic.

They involve users of systems

that help the interpretation and decision taking.

It is a special kind of users that we (as UMAP) can consider

and that is fast growing and in urgent need of support.

61

Take home from experience

Learn from concrete cases: •  Case-based experimental approaches bring specific understanding and

experience necessary for general understanding and theory.

•  Cases can have great value for stakeholders.

It is all about correct and actionable interpretation:

•  Make information meaningful and actionable in the context.

•  Employ hybrid, human-enhanced approaches for the context.

62

APPLICATION

HUMANS FOR AUGMENTATION

USERS DOMAIN

DOMAIN Augmented with Web Semantics

USERS Augmented with Web Semantics

REAL DOMAIN

REAL USERS

63

64

Technology for sense-making

65

Challenge: Making sense of Twitter

Inspired by different applications and domains, researchers have given attention to underlying technology for making sense of Twitter. ‘Finding the needle’ as the research challenge.

66

Technology for making sense

The sense-making usually relies on application and domain specific knowledge and researchers investigate how to do it effectively. Semantics and interactivity prove to be important ingredients. In fact, it turns out that sense-making, i.e. finding the needle, is a combination of many things that need to be coming together.

67

Technology for making sense

68

Semantics for filtering and search

69

Semantics for filtering and search

In [HT2012] we considered what is needed as first steps in processing tweets, before we can ‘analyze’ them.

70

1.  (Automatic) Filtering: Given an incident, how can one automatically identify those tweets that are relevant to the incident?

2.  Search & Analytics: How can one improve search and analytical capabilities so that users can explore information in the streams of tweets?

Twitter streams

Challenges

Filtering

topic

Search & Analytics

information need

71

Dataset

• Twitter corpus (TREC Microblog Track 2011) • 16 million tweets (Jan. 24th – Feb. 8th, 2011 ) • 4,766,901 tweets classified as English • 6.2 million entity-extractions

• News (Same time period) • 62 RSS News Feeds • 13,959 News Articles • 357,559 entity-extractions

72

Filtering evaluation

!"#$%!"&'% !"&&%

!"$'%

!"#(%

!"&)%!"$*%!"#)%

!"&#%

!"'&%

!")#%

!"+$%

!%

!"&%

!"#%

!"$%

!"+%

!")%

!"'%

!"(%

,-./012%3456-7408%%

,-./012%3456-7408%946:%

;-9<%=>06-?6@/54A/1>0%

B/<-540-C%D-E9>7F%3456-7408%

GHI%

IJ&!%

IJ$!%

K-2/55%

Semantic strategies outperform the keyword-based filtering regarding all metrics.

73

Filtering evaluation

The semantic strategy is more robust and achieves higher precisions for complex topics.

1 2 3 4number of entities extracted from inital topic description

0

0.2

0.4

0.6

0.8

1

Prec

isio

n@30

and

Rec

all

Precision@30Recall

1 2 3 4 5number of words in the inital topic description

0

0.2

0.4

0.6

0.8

1

Prec

isio

n@30

and

Rec

all

Precision@30Recall

74

Faceted search evaluation

!"#$%

!"&'%!"'#%

!%

!"(%

!")%

!"'%

!"*%

+%

,-./0.1234567.8%,62.9.8%7.6-2:%

:67:96;4567.8%,62.9.8%7.6-2:%

:67:96;4567.8%<.3=>-8%7.6-2:%

!"#$%&"

'()*+'#,%&#$

-%.!

&&/%+

0%1#*2"1%(1"3

%

with semantic enrichment without semantic enrichment

The semantic faceted search strategy improves the search performance by 34.8% and 22.4%.

75

Faceted search evaluation

Strategies with semantic enrichment outperform those without in predicting appropriate facet-values.

3 Adaptive Faceted Search on Twitter

!"#$% !"#&%!"#'%

!"'(%

!"#&% !")'%!"#(%

!"'*%!"#+% !"#)%!",+%

!"',%

!%!"!+%!"'%

!"'+%!",%

!",+%!"#%

!"#+%!")%

!")+%

-./0123456.7%

89.:0.2058;.%

</.=>.2?@%

A30AB3C:D30.7%

EF+%

EF'!%

GHH%

with semantic enrichment without semantic enrichment

76

Lessons

The context: a (Twitcident-inspired) framework for filtering, searching, and analyzing information about incidents that people publish on Twitter. We have seen how to obtain • better filtering of Twitter messages for a given incident, • better search for relevant information about an incident within the filtered messages.

For these first steps in processing Twitter messages, the semantic interpretation is the key element that we need to understand for the given context.

77

Semantics for enrichment and linking

78

Semantics for enrichment and linkage

In [ESWC2011] we focused more on the semantics for enrichment and linkage to connect the tweets to background knowledge and thus enhance what we can learn from them.

79

SI Sportsman of the year: Surprise French Open champ Francesca Schiavone Thirty in women's tennis is primordially old, … news article

topic:Sports topic:Sports

topic:Tennis

person:Francesca_Schiavone

oc:SportsGame

event:FrenchOpen

francesca is becoming #sport idol of the year!

microblog post

user

enrichment enrichment

user modeling

linkage

Profile Topics of interest: - topic:Tennis - topic:Sports People of interest: - person:Francesca_Schiavone Events of interest: - event:FrenchOpen

Example: Semantic enrichment of Twitter posts

80

SI Sportsman of the year: Surprise French Open champ Francesca Schiavone Thirty in women's tennis is primordially old, … news article

francesca is becoming #sport idol of the year!

microblog post

user linkage

How?

Goal  in  this  linkage  discovery  is  to  iden3fy  news  resources  that  are  related  to  a  given  Twi8er  message:  1.  Web  resource  has  to  be  related  to  the  given  tweet  2.  Web  resource  has  to  be  related  to  news  

 

Linkage discovery

81

Francesca Schiavone is sportsman of the year #sport #tennis

Content-based

SI Sportsman of the year: Surprise French Open champ Francesca Schiavone Thirty in women's tennis is primordially old…

Francesca Schiavone is sportsman of the year #sport #tennis

Hashtag-based Petkovic & Goerges leading German tennis revival there are signs that German tennis is…

The image cannot be displayed.

Linkage discovery strategies

82

nice! http://bit.ly/eiU33c URL-based

SI Sportsman of the year: Surprise French Open champ Francesca Schiavone Thirty in women's tennis is primordially old…

news article URL

Entity-based

Olympic champion and world number nine Elena Dementieva announced her retirement The 29-year-old Russian delivered the shock news after losing to Francesca Schiavone in the group stages of the season-ending tournamen …

news article

Entity-based

Francesca Schiavone is sportsman of the year #sport #tennis temporal constraint

Old news L publish date

publish date

•  URL-based (Strict): only consider content of the Twitter message

•  URL-based (Lenient): also consider reply or re-tweet messages

Linkage discovery strategies

83

Evaluation on linkage discovery

!"#!#$%

!"&!'$%

!"&'()%

!")#$$%

!")*+%

!"*!(,%

!% !"#% !"'% !"$% !"&% !"(% !"+% !")% !"*% !",%

-./01/0234516%78492.:2;.<65=%

>45?049234516%

@/A0B234516%7CD0?.E0%01FG.<4H%I./50<4D/05=%

@/A0B234516%

JKL234516%7H1/D1/0=%

JKL234516%750<DI0=%

!"#$%&%'()

URL-based strategies offer good linkage.

84

Analysis on linkage discovery and semantic enrichment

•  URL-based strategies: more than 10 tweet-news relations for c.a. more than 1000

•  Entity-based strategy: found a far more higher number of tweet-news relations

•  Hashtag-based strategy failed for more than 79% of the users because of the limited usage of hashtags

•  Combination of all strategies: higher than 10 tweet-news relation found for more than 20% of the users

Entity-based URL-based

Hashtag-based

Combination

Combined strategies perform better.

85

Lessons

There is good background knowledge out there, if we are able to understand how it connects to the domain and context we are considering. Many applications can share the same enrichment and linking, but not all. With common descriptions of the problem, we can share enrichment and linking (more) effectively.

86

Social Web for profiles

87

Challenge: Social web for profiles

An ambition often seen in conferences like this one is to exploit the semantic enriched social web knowledge for the purpose of creating or enhancing user profiles. These profiles can then be used for adaptation and personalization.

88

Components for profiling

For applications such as personalized news recommendation, like in our [UMAP2011] work, components for profiling can be carefully selected and assembled. It can also help the development of the deeper understanding and theory about how to link the data to background knowledge and thus make sense of the data.

89

Library

GeniUS [JIST2011] is a topic and user modeling software library that

• produces semantically meaningful profiles, to enhance the interoperability of profiles between applications;

• provides functionality for aggregating relevant information about a user from the Social Web;

• generates domain-specific user profiles according to the information needs of different applications;

• is flexible and extensible to serve different applications.

90

GeniUS: Generic Topic and User Modeling Library for the Social Semantic Web

Item Fetcher Enrichment Weighting

Function

RDF Repository

Filter

Modeling Configuration

RDF Serialization

Social Web

Semantic Web

user data items

enriched items

semantic data

user profiles

interested in:

location product

91

(a) hashtag-based (b) entity-based (c) topic-based

2. Profile Type

1. Temporal Constraints

3. Semantic Enrichment

4. Weighting Scheme

(a) time period (b) temporal patterns

(a) tweet-based (b) further enrichment

(a) concept frequency

User Modeling Building Blocks

92

User modeling with rich semantics: interested in:

people topics events … linkage user profile construction

#sport

person:Francesca_Schiavone

topic:Sports

event:FrenchOpen

topic:Tennis

time

weekday weekend

Profile types

• hashtag-based • topic-based • entity-based

enrichment • tweet-only • exploitation of external news resources

temporal patterns

• specific time period • temporal pattern • No constrains

User profile construction

93

RDF Gears UI

94

RDF Gears Plugin Architecture

95

1 10 100 1000user profiles

0

10

100

1000

10000

entit

ies

per u

ser p

rofil

e

News-basedTweet-based

1 10 100 1000user profiles

0

10

dist

inct

topi

cs p

er u

ser p

rofil

e

News-basedTweet-based

Entity-based profiles Topic-based profiles

profiles enriched with external news resource

profiles enriched with external news resource

By exploiting the linkage between tweets and news articles, we get more distinct entities / topics (semantics)!

Richer semantics through linking strategies.

Analysis of profile characteristics

96

Lessons

For profiles, we observed: • Semantic enrichment allows for richer user profiles. • Profiles change over time (hashtag-based more): fresh profiles seem to better reflect current user demands.

• Temporal patterns: weekend profiles differ significantly form weekday profiles (more than day/night).

For personalized news recommendation, we learned: • Best user modeling strategy:

Entity-based > topic-based > hashtag-based. • Semantic enrichment improves recommendation quality. • Adapting to temporal context helps for topic-based strategy.

97

Social Web for augmentation

98

Augment with what is there

Systems can use technology to augment their knowledge with data from the Social Web. Lessons learned show that for adaptive systems on the Social Web there is a lot of knowledge (easily) available, from other systems and other domains. Understanding how to leverage it, even to a basic level, can bring a lot.

99

Cross-system augmentation

100

Cross-system profiles

An example to show the added value of ‘cross-system’ on the Social Web is the work in [UMUAI 2013] where interweaving of public profiles is studied.

101

User data on the Social Web

Cross-system user modeling on the Social Web

102

Google  Profile  URI    h.p://google.com/profile/XY    

4.  enrich  data  with  seman?cs    

WordNet®  

Seman'c  Enhancement  

Profile  Alignment  

3.  Map  profiles  to  target  user  model  

FOAF   vCard  

Blog  posts:  

Bookmarks:  

Other  media:  

Social  networking  profiles:  

2.  aggregate    public  profile    

data    

Social  Web  Aggregator  

1.  get  other  accounts    of  user    

SocialGraph  API  

Account  Mapping  

Aggregated,    enriched  profile  (e.g.,  in  RDF  or  vCard)  

Analysis  and  user  modeling  

5.  generate  user  profiles  

Interweaving public user data with Mypes

103

1.  Characteristics of distributed tag-based profiles: •  Overlap of tag-based profiles, which an individual user creates at

different services, is low •  Aggregated profiles reveal significantly more information

(regarding entropy) than service-specific profiles

2.  Performance of cross-system user modeling for cold-start recommendations: •  Cross-system UM leads to tremendous (and significant)

improvements of the tag and bookmark recommendation quality •  To optimize the performance one has to adapt the cross-system

strategies to the concrete application setting

http://persweb.org

Lessons

104

Location estimation

Another nice example follows from our work in the ImREAL project on augmentation (of adaptation) with the Social Web.

105

Improved location estimation by mixing Social Web streams

+ =

external data sources:

Enriching the image’s textual meta-data with the user’s tweets improves the accuracy of the location estimation.

106

Accuracy of social web metadata

This work has also raised attention for the accuracy of Social Web metadata. There are many reasons why this data cannot be taken as the universal truth. In application and domain specific contexts, we need to understand the accuracy of social metadata. Also, the work of [Rout et al. 2013] on location estimation based on social ties, shows the feasibility as well as the context-dependency.

107

Linked Open Data for augmentation

108

LOD and cross-system

With these results in hand, in our [ICWE2012] work, we considered cross-system modeling with Linked Open Data. With the aim to understand how Linked Open Data background knowledge can be leveraged for cross-system and cross-domain augmentation.

109

Johannes Vermeer

dbpedia:Louvre Looking forward to visit Paris next week!

dbpedia:Paris

The lacemaker

The astronomer

Recommending Points of Interest

110

c1  

c4  

c5  

c6  

weigh'ng  strategies  

Applica'on  that  demands  user    

interest  profile  regarding                    -­‐concepts  

c2  

c3  cx  

cy  

c9  

User  Profile  concept      weight  

0.4  

0.1  

0.2  

c1  

c2  

c3  …  …  

concepts  that  can  be  extracted  from  the  user  data    

user  data  

Social  Web  

background  knowledge    (graph  structures)  

Linked  Data  

LOD-based User Modeling

111

tags: girl with pearl earring geo: The Hague

dbpedia:Girl_with_pearl_earring

A  

Artifact

B  

The lacemaker

C  

The astronomer

…  

rdf:type

Johannes Vermeer foaf:maker

foaf:maker

Strategies for exploiting the RDF-based background knowledge graph

dbpedia:The_Hague

dbpedia:Louvre dbpprop:location locatedIn

112

Lessons With LOD-based user modeling on the Social Web, different strategies for exploiting RDF-based background knowledge are possible. Findings: • Combination of different user data sources (Flickr & Twitter) is beneficial for the user modeling performance.

• User modeling quality increases the more background knowledge one considers.

• Combination of strategies achieves the best performance. To investigate further: dependency of strategies of entities and relationships, and temporal effects (eg temporal relationships or upcoming trends).

113

Interlinked online society

If you take a semantic technology perspective, then strong interlinking could be the direction to go. [Passant et al. 2009] studies applying semantic technologies to social media, creating a Web where data is socially created and maintained through end-user interactions, but is also machine-readable and therefore open towards sophisticated queries and large-scale information integration. "Social Semantic Information Spaces”, where any social data is a component in a worldwide collective intelligence ecosystem.

114

Origin of semantics

These social semantic spaces can trigger us in UMAP to articulate where we see the role and origin of semantics. Making all social data available ‘with semantics’ or observing that a lot of semantics is (only) effective in a specific domain or application? Experience showing the fine-grained nature of effects suggests the latter.

115

Human-enhanced

116

Humans & adaptive faceted search

An important element in the process of sense-making is its hybrid nature: humans involved in the sense-making. The control rooms have shown us that the human aspect in search is crucial, for judgment and interpretation. In our [ISWC2011] work, we looked at adaptive faceted search.

117

Adaptive faceted search framework

Adaptive Faceted Search

Twitter posts

Semantic Enrichment

User and Context Modeling

user

How to adapt the facet-value pair ranking to the

current demands of the user?

How to represent the content of a

tweet? facet extraction

118

Facet extraction and semantic enrichment

@bob: Julian Assange got arrested

Julian Assange

Julian Assange Tweet-based enrichment

Julian Assange arrested Julian Assange, the founder of WikiLeaks, is under arrest in London…

Link-based enrichment

Julian Assange

London

WikiLeaks

Julian Assange Julian Assange

London WikiLeaks

powered by

119

Impact of Link-based enrichment

Representation of tweets:

significantly more facets per tweet with link-based

enrichment

120

Faceted search strategies

Goal: most relevant facet-value pair should appear at the top of the ranking Faceted Search Strategies:

1.  Occurrence frequency: count occurrence frequencies of FVP 2.  Personalization: adapt ranking to user profile (eg user tweeting history) 3.  Diversification: increase variety among the top-ranked FVPs 4.  Time-sensitivity: adapt FVP ranking to temporal context

Semantic enrichment: (i) tweet-based and (ii) link-based enrichment

Locations 1.  Aachen 2.  Aalborg 3.  Aalesund 4.  Aarhus … 2145. Eindhoven

Locations 1.  Eindhoven 2.  Delft 3.  Amsterdam 4.  Rotterdam 5.  London …

Link-based enrichment and occurrence-based and personalized rankings have large effect.

121

Twitcident.com Twitter-based crisis management system

1.

2.

3. 4.

Semantic enrichment allows for: 1.  Grouping tweets

into incidents 2.  Faceted search 3.  Thematic Views 4.  Analysis

122

Lessons

Semantic enrichment allows for structured representation of the content of tweets: a good basis for faceted search.

Faceted search performs significantly better than hashtag-based keyword search

Different building blocks for making faceted search on Twitter adaptive improve the search quality:

•  Link-based enrichment: more discoverable tweets, better search performance.

•  Personalization leads to significant improvements. •  Time-sensitivity improves performance as well.

123

Redundancy reduction

124

Duplicate detection

Important for reducing the volume of social data, is to categorize the social chatter and reduce redundancy in information. In our [WWW2013] work we have considered duplicate detection.

125

Twitter is more like a news media. How do people search on Twitter? [Teevan et al. 2011] has shown how this is characterized by repeated queries & monitoring for new content.

Problems:

•  Short tweets è lots of similar information. •  Few people produce contents è many retweets, copied content.

Search and retrieval on Twitter

126

Near-duplicates in Twitter search

Analysis of the Tweets2011 corpus (TREC microblog track) [WWW2013]

1.89%&

9.51%&

21.09%&

48.71%&

18.80%&

Exact&copy&

Nearly&exact&copy&

Strong&near;duplicate&

Weak&near;duplicate&

Low&overlapping&

•  For the 49 topics (queries), 2,825 topic-tweet pairs are relevant.

• We manually labeled 55,362 tweet pairs

• We found 2,745 pairs of duplicates in different levels.

127

Twinder Framework Search infrastructure

Feature'Extrac+on'''''''

Relevance(Es+ma+on(

Social(Web(Streams(

Feature(Extra

c+on

(Task(

Broker(

Cloud Computing

Infrastructure

Index(

Keyword?based(Relevance(

messages

Twinder Search Engine

feature extraction

tasks

Search(User(Interface(

query results

feedback

users

Duplicate'Detec+on'and'Diversifica+on'

Seman+c?based(Relevance(

Seman+c(Features(Syntac+cal(Features(

Contextual(Features( Further(Enrichment(

128

Lessons Analyzing duplicate content in Twitter, we inferred a model for categorizing different levels of duplicity. We developed a near-duplicate detection framework for microposts and for categorizing duplicity of tweet pairs. Given the duplicate detection framework, we perform extensive evaluations and analyses of different duplicate detection strategies. Our approach enables search result diversification, also good to avoid ‘bubble effects’, and analyzes the impact of the diversification on the search quality. Follow Twinder progress: http://wis.ewi.tudelft.nl/twinder/

129

Take home from technology research

With semantics and humans, Social Web can help: •  Semantics beneficial for filtering & search and enrichment & linking. •  Semantic-enriched tweets beneficial for profiles and adaptation. •  Social Web & Linked Data beneficial for cross-system augmentation. •  Adaptive faceted search and duplicate detection beneficial for human-

enhanced processing. For adaptive systems that rely on profiling, Social Web is a fertile source for more knowledge. ImREAL research & experiences elegantly show principles, as well as the detailed work in domain & application:

•  Social Web & LOD usage is context-specific. •  Big Data in need of Small Interpretations.

130

APPLICATION

HUMANS FOR AUGMENTATION

USERS DOMAIN

DOMAIN Augmented with Web Semantics

USERS Augmented with Web Semantics

REAL DOMAIN

REAL USERS

131

Take home from technology research

The human intelligence is to be arranged differently: •  We have moved from a priori understanding the system, to on the fly

understanding the system. •  We have moved from careful manual analysis before, to machines doing the

analysis on the fly. •  The critical and context-specific approach to (small) data, about domain

and users, is a part of process and system we now need to (re-)include. •  This task of the designer has now shifted to a task for the human interpretation

inside the hybrid system: human monitoring inside.

132

133

Challenges with sense-making

134

Not one truth

135

In reality, not one truth

In the beginning, social systems like Twitter were used as ‘the’ semantic source of knowledge with an implicit assumption that Twitter is one voice. Over time, researchers have begun to investigate how to identify and interpret different voices and viewpoints in such a source. Differences in viewpoints and opinions are subject of study, but until now leverage is limited

136

Diversity and beliefs

[Flock et al. 2011] study the different backgrounds, mindsets and biases of Wikipedia contributors, to understand the effects - positive and negative – of this diversity on the quality of the Wikipedia content, and on the sustainability of the overall project. • Analysis and approach for diversity-minded content

management within Wikipedia. [Bhattachanya et al. 2012] estimate beliefs from posts made on social media, to monitor the level of belief, disbelief and doubt related to specific propositions.

137

Include the negative

Diversity of viewpoints and opinions also suggests to include negative links in the approach. [Symeonidis et al. 2010] give an example of how to include negative links into friend recommendation approaches, but this goes much further. The effect they observe on improving accuracy can be held as a principle where accuracy improvement can be gained using information about positive and negative edges.

138

ViewS

Modelling Viewpoints in User Generated Content

Text processing

Viewpoint extraction

(attention focus)

Ontology (activity aspects

to analyse) Semantic

enrichment Viewpoint

exploration

139

Viewpoints in YouTube Examples viewpoints in user comments on job interview videos

Comparing the viewpoints around ‘anger’ of young users (left) and old users (right)

140

Not the truth

141

Truth is not always truth

Just like this source of knowledge is not a single one, it is also clear that it might not be consisting of ‘true’ knowledge alone.

142

Malicious profiles

For example, profiles can be suspicious and made for the wrong reasons. In a context of online dating, [Pizzato et al. 2012] have observed the need to gain understanding of the sensitivity of recommender algorithms to scammers. With people being the items to recommend, fraudulent profiles can be having a serious impact on recommender algorithms. Identifying and detecting fraudulent profiles is a new challenge for us.

143

Identity theft

Another aspect to ‘wrong profiles’ relates to identity disambiguation and theft.

[Rowe et al. 2010] consider malevolent web practices such as identity theft and lateral surveillance. They study techniques for web users to identify all web resources which cite them and if necessary, remove the sensitive information.

144

Credibility of social content

The credibility of messages in social networks is for example studied in [Seth et al. 2010] on stories from Digg. Their model is based on theories developed in sociology, political science and information science. [Cramer et al. 2008] have nicely brought attention for trust. The study of social content credibility and trust are important, and ask for cross-discipline effort.

145

Privacy

A lot can be said about privacy in these networks, for example Facebook. [Bachrach et al. 2012] shows how users’ activity on Facebook (related to privacy) relates to their personality, as measured by the standard Five Factor Model. Nice example of understanding how Facebook features relate to interesting aspects of users and usage.

146

Cultural variations

147

Cultural diversity

Studying diversity is not just relevant for understanding how Twitter content is to be interpreted. It is also relevant for understanding how the Social Web is used and can be used with a purpose. Cultural diversity is here one of the most interesting aspects and perhaps also one of the most challenging ones.

148

Cultural diversity

A subject addressed in ImREAL. Components are made available as services in ImREAL for augmented user modeling, e.g. for simulation designers.

149

150

Hofstede’s cultural dimensions

Describes stereotypical cultural characteristics of nationalities, with scores relative to other nationalities Five core dimensions:

•  Individualism versus Collectivism (IDV) •  Power Distance (PDI) •  Masculinity versus Femininity (MAS) •  Uncertainty Avoidance (UAI) •  Long-Term Orientation (LTO)

geert-hofstede.com

151

Analysis

• Datasets •  Microblog data collected over a period of three months •  22 million microposts from Sina Weibo and 24m from Twitter •  a sample of 2616 Sina Weibo users and 1200 Twitter users

• Analyze and compare user behavior •  on two levels (i) the entire user population and (ii) individual users •  from different angles (i) syntactic, (ii) semantic, (iii) sentiment and (iv) temporal analysis

152

0% 20% 40% 60% 80% 100%users

0

0.01

0.1

1

avg

. num

ber o

f ha

shta

gs/U

RLs

per

pos

t

Hashtag-WeiboURL-WeiboHashtag-TwitterURL-Twitter

Hashtags and URLs are less frequently applied on Sina Weibo than on Twitter.

Users on Twitter are more triggered by hashtags and URLs when propagating information than on Sina Weibo.

Syntactic analysis

high collectivism in Weibo, a high individualism in Twitter

153

Semantic analysis

The topics that users discuss on Sina Weibo are to a large extent related to locations and persons. In contrast to Sina Weibo, users on Twitter are talking more about organizations (such as companies, political parties).

0% 20% 40% 60% 80% 100%users

0

0.001

0.01

0.1

1

10

avg.

num

ber o

f ent

ities

per

pos

t

WeiboTwitter

low employee commitment to an organization in China - high long term orientation.

154

Sentiment analysis

Sina Weibo users have a stronger tendency to publish positive messages than Twitter users.

0% 20% 40% 60% 80% 100%users

0%

20%

40%

60%

80%

100%

ratio

of p

ositv

e po

sts Weibo

Twitter

more negative posts

more positive posts

high long term orientation.

155

Combined semantic sentiment analysis

The difference is amplified when discussing ‘people’ or ‘location’, with Sina Weibo users even more positive and Twitter users more negative.

more longterm orientation in Weibo, more shortterm orientation in Twitter

156

Temporal analysis

Twitter users repost messages faster than Sina Weibo users.

time distance = trepost - toriginal post

0% 20% 40% 60% 80% 100%users

0

0.1

1

10

100

1000

time

dist

ance

(in

hour

s) WeiboTwitter

large degree of power distance in Weibo, small one in Twitter

157

Cultural differences in tagging

Other work confirms the findings. And the consistency with theories of cultural differences between Asian and Western cultures. [Dong et al. 2011] look at cultural differences in a tagging system and find that American and Chinese subjects differed in many ways: • the number and types of tags they applied; • the extent to which they applied suggested tags or entered new tags of their own; and

• how often they applied tags that originated from a different culture.

158

Cultural variations for Social Q&A

Another example is given by [Yang et al. 2011] that looks at cultural differences in people’s social question asking behaviors across the United States, the United Kingdom, China, and India. They analyzed the questions people ask via social networking tools, and their motivations for asking and answering questions online. Results reveal culture as a consistently significant factor in predicting people’s social question and answer behavior.

159

Real-time variations

160

Understand the source

When using the knowledge from Twitter as a semantic source, specially if it is the only semantic source, there are a few things one needs to consider that relate to the real-time nature of social contributions. The ‘knowledge’ is not unambiguous: inconsistency, moods, etc. Real-time knowledge spreads and evolves fast.

161

Inconsistency & moods

Twitter is used as semantic sensor, sometimes as the only semantic sensor, but consistency in user contributions like ratings is a concern. [Said et al. 2012] shows how users are inconsistent in their ratings and tend to be more consistent for above average ratings. [De Choudhury et al. 2012] report on the relation between moods and social activity, social relations and participatory patterns like link sharing and conversational engagement.

162

Understanding over time

While Twitter and the like were used in the beginning as ‘fixed’ sources of knowledge, researchers have become interested in the evolution over time. The nature and speed of the flow of content over time have become great objects of study. Two domains that in this light have received fair attention is that of diseases and (political) news.

163

Flow in disease information

Domain of diseases and outbreaks is getting fair attention. Works by [Gomide et al. 2011] on Dengue and [Diaz-Aviles et al. 2012] on EHEC, show how the people’s behavior on Twitter can be used for surveillance and tasks such as early warning and outbreak investigation.

164

Flow of news

From [Naveed et al. 2011] we learn how retweets reflect what the Twitter community considers interesting on a global scale. In [Backstrom et al. 2011] we see the differences between communication and observation in Facebook: communication involves a much higher focus of attention than observation activities. We see in [Lerman et al. 2010] how network structure affects dynamics of how interest in news stories spreads among social networks in Digg and Twitter

165

Flow in political news

Coming back to our observation of the multiple truths, political news is a great domain to look at. For the contact of political speech, [Metaxas et al. 2010] discuss how the real-time nature of Twitter provides disproportionate exposure to personal opinions, fabricated content, unverified events, lies and misrepresentations, with viral spread as a consequence. To act upon that, [Lumezanu et al. 2012] identify extreme tweeting patterns that could characterize users who spread propaganda (political propagandists), e.g. sending high volumes of near-duplicate messages.

166

Temporal effects

In our [WebSci2011] work, we have considered how user interests are manifest over time. Most users, who are interested into the news topic, become interested within a few days. Lifespan of users’ interest: • Long-term adopters - continuously interested • Short-term adopters - interested only for a short period in time (and influenced by “global trends”)

High overlap between early adopters and long-term adopters.

167

Temporal effects

On Twitter the importance of entities for a topic varies over time (long-term vs. short-term entities). In terms of user interests over time, the majority of users becomes quickly (few days) interested in a topic. When using Twitter-based profiles for personalization, time-sensitive user modeling improves recommendation quality. Also, the selection of user modeling strategy should take the type of user into account: • Long-term adopters: hashtag-based • Short-term adopters: entity-based

168

Twitter-based Trend and User Modeling Framework

Twitter posts

current tweets

of Twitter

community

news recommender ?

Profile Semantic

Enrichment

Profile Type

Aggregation

Weighting Scheme

trends

time

user’s interests

169

Temporal effects with trends

For the domain of personalized news recommendations, We have combined trend and user modeling in our framework. • We have seen how user profiles change over time, under the influence of trends.

• Appropriate concept weighting strategies allow for the discovery of local trends.

• Time sensitive weighting function is best for generating trend profiles.

Aggregation of trend and user profile can improve the performance of recommendations.

170

Validation

171

Check with the user

With all profiles based on augmentation, it becomes (even more) vital to follow the lessons of checking with the user. By engaging with the user in a common process of validating the profile and the assumptions based on it.

172

Perico

Dialogue for Modelling Cultural Exposure using Linked Data

Initial User Model

•  Visited Countries •  Estimated Cultural

Exposure

Social Web

Sensors

Perico Dialogue Agent

Cultural Fact Extractor

Quiz Generator

User Profile Generator Dialogue Planner

Updated User Model

•  Verified Visited Countries

•  Enhanced Cultural Exposure Score

173

Perico

Dialogue for Modelling Cultural Exposure using Linked Data

Initial User Model

•  Visited Countries •  Estimated Cultural

Exposure

Social Web

Sensors

Perico Dialogue Agent

Cultural Fact Extractor

Quiz Generator

User Profile Generator Dialogue Planner

Updated User Model

•  Verified Visited Countries

•  Enhanced Cultural Exposure Score

174

Inspect and control

[Knijnenburg et al. 2012] consider how users of social recommender systems may want to inspect and control how their social relationships influence the recommendations they receive: friends are not always “nearest neighbors”. The results show that high inspectability and control indeed increase users’ perceived understanding of and control over the system, their rating of the recommendation quality, and their satisfaction with the system, and thus an overall better user experience.

175

Communities

176

Understanding communities

Attention is given to communities and their dynamics. [Chan et al. 2010] proposes a method for analysing user communication roles in discussion forums.

[Schwagereit et al. 2011] study governance in web communities.

[Karnstedt et al. 2011] considers the relation between a user's value within a community - constituted from various user features - and the probability of a user churning.

[Yang et al. 2010] analyze users’ activity lifespan in online knowledge sharing communities: acknowledgement of contributions leads to user survival.

177

Involvement in communities

In order to understand how people behave in Social Web and in communities, it is relevant to understand their engagement and involvement in more detail. [Lehmann et al. 2012] study how users engage with online services, and how to measure this engagement. [Freyne et al. 2009] look at how social networking sites rely on the contribution and participation of their members: focus on early interventions for engagement.

178

Communities and expertise

Understanding communities is also relevant as these communities can act as additional resource. From finding evidence for profiles, we have seen recent attention shift towards finding people and expertise. For example, to enable active engagement of people. For using expertise in UMAP, it is also important to be able to specify expertise, to enable reasoning about the expertise’s quality and fit.

179

Take home from challenges

The (Social) Web tells many stories: •  Acknowledge multiple truths, opposing truths, and bad intentions. •  Acknowledge multiple audiences and viewpoints. •  Acknowledge cultural variations.

The (Social) Web moves fast:

•  Acknowledge the real-time nature of Web and applications. •  Analyze and understand the flow of information. •  Analyze and understand the nature of communities.

The (Social) Web includes people:

•  Involve the users actively in validation. •  Involve (communities of) users in interpretation.

180

181

Social, Web & UMAP

182

Social & UMAP

Huge economic and societal potential for added value. Social Web is a fertile source of knowledge for augmentation.

•  Semantics can be beneficial for social-based augmentation.

•  Hybrid, human-enhanced approaches can be beneficial.

•  Technological feasibility of augmentation.

Research from specific cases towards general theory. Next on the agenda:

•  Describe added value for stakeholders, describe goals.

•  Share and compare research challenges and evaluations.

183

Web & UMAP

UMAP systems are Web systems: •  The (Social) Web tells many stories. •  The (Social) Web moves fast. •  The (Social) Web includes people.

The Web is the real laboratory for UMAP systems. Next on the agenda:

•  Share and compare solutions, components, and systems. •  Support more uniformity in methods and practices.

184

UMAP & Web

On the (Social) Web, systems are being made: •  Take positions or prepare to take positions about bad

intentions. •  Take responsibility and recommend about future

architectures. On the (Social) Web, many systems are small:

•  Do (also) consider the specific problems of small and medium sized stakeholders: bring UMAP into practice.

185

UMAP & Social

In SWUMAP, human intelligence is arranged differently:

•  From careful manual analysis a priori, to machine analysis on the fly.

•  Critical and context-specific approach to data is part of the ‘in vivo’ system.

•  Human interpretation of data is inside the hybrid system.

It makes for a new type of system, and one of great value. And plenty of fun and diverse challenges for UMAP.

186

APPLICATION

HUMANS FOR AUGMENTATION

USERS DOMAIN

DOMAIN Augmented with Web Semantics

USERS Augmented with Web Semantics

REAL DOMAIN

REAL USERS

187

APPLICATION

HUMANS FOR AUGMENTATION

USERS DOMAIN

DOMAIN Augmented with Web Semantics

USERS Augmented with Web Semantics

SWUMAP

188

Thanks

Slides made with input from many,

including Alessandro, Claudia, Fabian, Ilknur, Jan, Jasper, Ke, Qi, and Richard from WIS in Delft,

and friends from ImREAL, Net2, SEALINCMedia, and Twitcident.