Ph.D. defense: semantic social network analysis
-
Upload
guillaume-ereteo -
Category
Technology
-
view
11.159 -
download
1
description
Transcript of Ph.D. defense: semantic social network analysis
SEMANTIC SOCIAL NETWORK ANALYSIS
Guillaume Erétéo
Ph.D. Thesis defense supervisors:
Michel Buffa, Kewi/I3S, UNSA/CNRS Fabien Gandon, Edelweiss, INRIA Sophia Antipolis Patrick Grohan, Orange Labs
OUTLINE
1. Context and Scientific Objectives 2. State of the Art on Social Network Analysis & Semantic Social
Networks 3. SemSNA: Analysing Social Networks with Semantic Web
Frameworks 4. Community Detection: SemTagP, Semantic Tag Propagation
2
CONTEXT ISICIL: Information Semantic Integration through Communities of Intelligence onLine
3
enterprise 2.0 semantic web business intelligence
pluridisciplinary: ergonomists, sociologists, mathematicians, ontologists, computer scientists
ANR-08-CORD-011
SEMANTIC INTRANET OF PEOPLE
"the use of emergent social software platforms within companies, or between companies and their partners or customers"
[McAfee 2006] 4
represent, exchange and analyse data accross applications to deliver information in a way that matters to people and to their communities.
[Berners-Lee et al., 2001]
SCIENTIFIC OBJECTIVES extend social network analysis with semantic formalisms to reveal and exploit the rich social structures embedded in the emerging social data of web 2.0 applications:
how to represent, link and access online social networks accross applications?
how to enable classical operators of social network analysis to consider the semantics of these networks?
how this semantics could be exploited to create new algorithms?
5
OUTLINE
1. Context and Scientific Objectives 2. State of the Art on Social Network Analysis & Semantic
Social Networks 3. SemSNA: Analysing Social Networks with Semantic Web
Frameworks 4. Community Detection: SemTagP, Semantic Tag Propagation
6
SOCIAL NETWORK ANALYSIS graph algorithms to characterize the structure of a social network, strategic positions/actors, and the distribution of networking activities.
applications:
monitor information flow foster communication focus notifications in information systems create project teams identify experts
7
SOCIAL NETWORKS AND GRAPHS actors are represented by nodes and relations by edges G=(V, E), n=|V|, m=|E|
8
1
0.5
1 2
3
4
1,5
collaborate
colleague
manages
follows
manages
sameInterest
follows
NETWORK STRUCTURE e.g. density and diameter highlight cohesion of the network
9
€
diam(G) = length(g(e1,e2));∀e3,e4 ∈ EG;length(g(e3,e4 )) ≤ length(g(e1,e2))
[Scott 2000]
[Zachary 1977]
STRATEGIC POSITIONS & ACTORS directed degree differenciates support and influence
11
€
Din (y) = x;∃ x,y( )∈ E{ }[Nieminem1973]
STRATEGIC POSITIONS & ACTORS betweenness centrality reveals intermediaries & brokers
13
[Freeman1977]
highly strategic position in communication [Shimbel 1953] [Cohn & Marriott 1958] [Burt 1992]
ONLINE SOCIAL DATA ARE MORE COMPLEX TO REPRESENT multiple & spread roles, context, profile, etc. distributed across applications
15
LINK STRUCTURE IS NOT ENOUGH who has the best betweeness centrality?
knows in passing
has met
has met
works With
works With
has supervisor
16
SEMANTICS MATTER! how can we consider different types of relations?
knows in passing
has met
has met
works with
works with has supervisor
17
RESOURCE DESCRIPTION FRAMEWORK make assertions and describe resources with triples (subject, predicate, object) like "the subject, verb and object of an elementary sentence“ [Berners-Lee 2001]
18
ONTOLOGY
"a set of representational primitives with which to model a domain of knowledge or discourse. The representational primitives are typically classes (or sets), attributes (or properties), and relationships (or relations among class members). The definitions of the representational primitives include information about their meaning and constraints on their logically consistent application”
[Gruber 1993] [Gruber 2009]
19
RESOURCE DESCRIPTION FRAMEWORK SCHEMA
20
set of primitives to define the classes of a domain knowledge, taxonomical relations, and classes of resource that apply to properties
SPARQL PROTOCOL AND RDF QUERY LANGUAGE query language, protocol and format to send queries and exchange results across the web
21
PREFIX foaf: < http://xmlns.com/foaf/0.1/> SELECT ?person ?name WHERE { ?person rdf:type foaf:Agent ?person foaf:firstName ?name }
CLASSIC SNA ON SEMANTIC WEB rich graph representations reduced to simple un-typed graphs for analysis
[Paolillo & Wright 2006]
foaf:knows
foaf:interest
[San Martin & Gutierrez 2009]
23
(guillaume)=3
parent sibling
mother father brother sister
colleague
knows Gérard
Fabien Mylène
Michel Yvonne
cow
orke
r
<family> d
26
directed network
weighted network
labelled network
parameterized operators network size
Graph Theory ✔ ✔ ✔ 106 nodes 107 edges
[Brandes 2009] ✔ ✔ ✔ 104 nodes
[Paolillo & Wright 2006] ✔ ✔ ~ 104 nodes
~ 105 edges
[San Martin & Gutierrez 2009] ✔ ✔ ~ 104 nodes
~ 105 edges
27
OUTLINE
1. Context and Scientific Objectives 2. State of the Art on Social Network Analysis & Semantic Social
Networks 3. SemSNA: Analysing Social Networks with Semantic Web
Frameworks 4. Community Detection: SemTagP, Semantic Tag Propagation
28
SEMANTIC SNA FRAMEWORK exploit the semantic of social networks and parameterize SNA operators
29
parameterized SNA operators
SPARQL formalization of operators
SemSNA ontology: annotate social data with results of analyses
PARAMETERIZED DENSITY proportion of the maximum possible number of properties of type <rel> (or subtype)
30
number of actors of a given type (or subtype)
number of pairs of resources linked by a property of type <rel> (or subtype)
PARAMETERIZED N-DEGREE number of paths of properties of type <rel> (or subtype) having y at one end and with a length smaller or equal to dist
31
parameterized path: a list of nodes of a graph G each linked to the next by a relation of type <rel> (or subtype)
PARAMETRIZED DIAMETER length of the longest geodesic in the network for a property of type <rel> (or subtype)
32
geodesic: a shortest path between two resources for a given relation of type <rel> (or subtype)
SPARQL FORMALIZATION OF PARAMETERIZED OPERATORS
SPARQL is designed to query RDF data
CORESE semantic search engine implementing semantic web languages using graph-based representations
Automatic processing of semantic inference (e.g. subsumption)
Graph querying extension (e.g. paths) [Corby et al 2004] [Corby 2008]
33
SPARQL FORMALIZATION parameterized density
34
SELECT merge count(?x) as ?nbactor WHERE{ ?x rdf:type param[type] }
SELECT cardinality(?p) as ?card WHERE { { ?p rdf:type rdf:Property filter(?p ^ param[rel]) } UNION { ?p rdfs:subPropertyOf ?parent filter(?parent ^ param[rel]) }
}
SPARQL FORMALIZATION parameterized n-degree
35
SELECT ?y count(?x) as ?degree WHERE { {?x (param[rel])*::$path ?y filter(pathLength($path) <= param[dist])} UNION {?y param[rel]::$path ?x filter(pathLength($path) <= param[dist])}
} GROUP BY ?y
SPARQL FORMALIZATION parameterized diameter
36
SELECT pathLength($path) as ?length WHERE { ?y s (param[rel])*::$path ?to
} ORDER BY desc(?length) LIMIT 1
component
in-degree
diameter
closeness Centrality
betweenness Centrality
number of geodesics between from and to
degree
number of geodesics between from and to going through b
37
ANALYSED DATASET ipernity.com dataset extracted in RDF:
61 937 actors & 494 510 relationships:
– 18 771 family links between 8 047 actors
– 136 311 friend links implicating 17 441 actors
– 339 428 favorite links for 61 425 actors
– 2 874 170 comments from 7 627 actors
– 795 949 messages exchanged by 22 500 actors
39
INTERPRETATIONS OF RESULTS validated with managers of ipernity.com
friendOf, favorite, message, comment small diameter, high density family as expected: large diameter, low density favorite: highly centralized around Ipernity
animator. friendOf, family, message, comment: power law
of degrees and betweenness centralities, different strategic actors knows: analyze all relations using subsumption
40
PERFORMANCES & LIMITS Knows! 0.71 s ! 494 510!Favorite! 0.64 s ! 339 428!Friend! 0.31 s ! 136 311!Family! 0.03 s ! 18 771!Message! 1.98 s ! 795 949!Comment! 9.67 s ! 2 874 170!Knows! 20.59 s ! 989 020!Favorite! 18.73 s ! 678 856!Friend! 1.31 s ! 272 622!Family! 0.42 s ! 37 542!Message! 16.03 s ! 1 591 898!Comment! 28.98 s! 5 748 340!
Shortest paths used to calculate
Knows! Path length <= 2: 14m 50.69s !Path length <= 2: 2h 56m 34.13s Path length <= 2: 7h 19m 15.18s !
100 000!1 000 000!2 000 000!
Favorite! Path length <= 2: 5h 33m 18.43s! 2 000 000!Friend! Path length <= 2: 1m 12.18 s !
Path length <= 2: 2m 7.98 s!1 000 000!2 000 000!
Family! Path length <= 2 : 27.23 s!Path length <= 2 : 2m 9.73 s!Path length <= 3 : 1m 10.71 s!Path length <= 4 : 1m 9.06 s!
1 000 000!3 681 626!1 000 000!1 000 000!
time projections
41
colleague
mother isDefinedForProperty
4
fath
er
Philippe
hasCentralityDistance 2
colleague
colleague
supervisor
Degree
Guillaume
Gérard
Fabien
Mylène
Michel
Yvonne
Ivan Peter
45
Directed networks
Weighted networks
Labelled network
Parametrized operators Network size
Graph Theory ✔ ✔ ✔ 106 nodes 107 edges
[Brandes 2009] ✔ ✔ ✔ 104 nodes
[Paolillo & Wright 2006] ✔ ✔ ~ 104 nodes
~ 105 edges
[San Martin & Gutierrez 2009] ✔ ✔
~ 104 nodes ~ 104 - 105
edges
SEMSNA ✔ … ✔ ✔ 104 nodes ~ 105 edges
46
SEMSNA: CONCLUSION • directed typed graph structure of RDF/S
well suited to represent social knowledge & socially produced medata accross applications and networks
• parameterized SNA operators & SPARQL formalization enable us to exploit the diversity and the semantic structure of social data
• SemSNA Ontology organize and structure social data
47
OUTLINE
1. Context and Scientific Objectives 2. State of the Art on Social Network Analysis & Semantic Social
Networks 3. SemSNA: Analysing Social Networks with Semantic Web
Frameworks 4. Community Detection: SemTagP, Semantic Tag Propagation
48
COMMUNITY DETECTION helps understanding the repartition of actors and activities in a social network
50
SOA algorithms strategy mine linking structure in order to detect densely connected group of actors
HIERARCHICAL ALGORITHMS output a dendrogram: a hierarchical tree of denser and denser communities from top to bottom.
• agglomerative algorithms start from the leaves, and group nodes in larger and larger communities: [Donetti & Munoz 2004] [Zhou & Lipowsky 2004] [Xu et al 2007] [Newman 2004]
• divisive algorithms start from the root of the tree, and group nodes in denser and denser communities: [Girvan & Newman 2002] [Radicchi et al 2004]
51
HEURISTIC BASED ALGORITHMS heuristics related to the community structure of networks and to community characteristics:
• similarity with electrical networks [Wu 2004]
• random walk [Dongen 2000] [Pons et al 2005]
• label propagation [Raghavan et al 2007] 52
MODULARITY MEASURES COMMUNITY PARTITION QUALITY fraction of the edges that fall within communities minus the expected such fraction if edges were distributed at random
With: • m be the number of edges of the network • d<i> the degree of vertex i • Aij the number of edges between i and j • ci the community of i,
53
[Newman 2004]
€
Q =1m
[Aij −d< i>d< j>
m]
i, j∈V , ci =c j
∑
LABEL PROPAGATION / RAK (1) assigns a unique random label to each node. (2) each node n replaces its label by the label most used by its neighbours. (3) if at least one node changed its label, go to step 2 (4) else nodes that share the same label form a community.
54
[Raghavan et al 2007]
opportunity replace random labels by tags in order to exploit not only the link structure but also the semantics of actors’ vocabulary!
FOLKSONOMIES each tag may represent a community of interest
55
pollution
pollutions du sol
has narrower
polluant énergie
related related
flat folksonomie thesaurus social tagging
[Limpens 2010]
TAG PROPAGATION exploit folksonomy for label assignement
a d f
e
g c
b
wiki
sweetwiki
isicil
inria
isicil
mediawiki
56
isicil
"interaction creates similarity, while similarity creates
interaction" [mika 2005]
TAG PROPAGATION wiki:1, sweetwiki: 1, mediawiki: 1
a d f
e
g c
b
wiki
sweetwiki
isicil
isicil
inria
isicil
mediawiki
57
SEMANTIC TAG PROPAGATION wiki:3, sweetwiki: 1, mediawiki: 1
a d f
e
g c
b
wiki
sweetwiki
wiki
isicil
inria
isicil
mediawiki
58
wiki
sweetwiki mediawiki
skos:narrower
SEMANTIC TAG PROPAGATION 2 communities labelled with wiki & isicil
a d f
e
g c
b
wiki
wiki
wiki
isicil
isicil
isicil
wiki
59
wiki
sweetwiki mediawiki
skos:narrower
ALGORITHM SEMTAGP
Algorithm SemTagP(RDFGraph network, Type relation) 1. DO 2. old_network = network 3. //propagate tags (i.e. compute new partitions) 4. FOREACH user IN network.users 5. user.tag = mostUsedNeighborTag(user, relationType) 6. END FOREACH 7. WHILE modularity(network) > modularity(old_network) 8. RETURN old_network
60
PARAMETRIZED SPARQL QUERY delegate all the semantic processing to a semantic graph engine to exploit semantic relations between tags and to parameterize the analyzed relation
61
SELECT ?user ?tag ?y WHERE{ ?user param[rel] ?neighbor {{?neighbour scot:hasTag ?tag } UNION {?neighbour scot:hasTag ?tag2 ?tag skos:narrower ?tag2
filter(exists{?x scot:hasTag ?tag})} } ORDER BY ?user ?tag
SOLUTION user control to disable semantic relations with given tags, which stengthen others narrower tags
63
nanotechnology
APPLIED TO ADEME PH.D. NETWORK 1,853 agents 1,597 academic supervisors 256 ADEME engineers.
13,982 relationships 10,246 rel:worksWith 3,736 rel:colleagueOf
6,583 tags 3,570 skos:narrower
relations between 2,785 tags
64
MODULARITY LIMITS • “the ‘optimal partition’, imposed by mathematics, does not
necessarily capture the actual community structure of the network” confirmed by experiments
• modularity optimization might miss important substructures when:
• modules are very fuzzy • modules have more than edges (which is the case for
half of ADEME’s detected communities)
• perspectives: measuring the average quality of each community
[Fortunato & Barthélemy 2007] 66
€
2m
RESULT 1. pollution
2. sustainable development
3. energy
4. chemistry
5. air pollution
6. metals
7. biomass
8. wastes
67
• engineer • supervisor • community node size = degree
SEMTAGP: CONCLUSION • SemTagP: semantic community detection and controlled labelling
• applied to reveal the repartition of ADEME Ph.D fundings
• many perspectives to integrate more semantics: • investigate other semantics, e.g. skos:related, skos:closematch • propagate tags through different types of relations • propagate multiple tags and detect overlapping communities
69
CONTRIBUTIONS
• leveraging online social networks to ontology-based representations
• extending social network analysis to ontology-based representations
• semantic community detection and labelling
71
PERSPECTIVES scaling to large network
sampling, parallel, iterative algorithms
considering temporal data in the analysis representing and analysing temporal data
enrich social activities with SemSNA results better management of resources and relationships
72
QUESTIONS
International conference Erétéo G., Gandon F., Corby O., Buffa M., “Analysis of a Real Online Social Network Using Semantic Web Frameworks”. ISWC2009, Washington D.C., USA. Erétéo G., Gandon F., Corby O., Buffa M., “Semantic Social Network Analysis”. Web Science 2009, Athens, Greece.
Book chapter Erétéo, G., Buffa, M., Gandon, F., Leitzelman, M., Limpens, F., Sanders, P., “Semantic Social Network Analysis, a concrete case”. Handbook of Research on Methods and Techniques for Studying Virtual Communities: Paradigms and Phenomena. A book edited by Ben Kei Daniel, IGI Global 2011. National conference Leitzelman M., Erétéo, G., Grohan,, P., Herledan, F., Buffa, M., Gandon, F., “De l'utilité d'un outil de veille d'entreprise de seconde génération”. poster in IC2009, Hammamet, Tunisia.
Workshop Erétéo, G., Buffa, M., Gandon, F., Leitzelman, M., Limpens, F., "Leveraging Social data with Semantics", W3C Workshop on the Future of Social Networking, Barcelona, Spain. Erétéo, G., Buffa, M., Gandon, F., Grohan, P., Leitzelman, M., Sander, P., "A State of the Art on Social Network Analysis and its Applications on a Semantic Web", SDoW2008, Karlsruhe, Germany.
73