Jürgens diata12-communities
-
Upload
pascal-juergens -
Category
Education
-
view
105 -
download
0
description
Transcript of Jürgens diata12-communities
![Page 1: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/1.jpg)
Identifying Communities on Twitter: Time, Topics & Clusters
Pascal Jürgens (@pascal)Dept. of Communication, U of Mainz, Germany
1
![Page 2: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/2.jpg)
Overview
Relevance / Why it’s interesting
The Basic Idea / Why it works
Limitations / When it works
Algorithms / How it works
Evaluations / How to tell whether it works
2
Overview
![Page 3: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/3.jpg)
What Science are we in,anyways?“The antireductionist catch-phrase, “the whole is more than the sum of its parts,” takes on increasing significance as new sciences such as chaos, systems biology, evolutionary economics, and network theory move beyond reductionism to explain how complex behavior can arise from large collections of simpler components.”
Mitchell, 2009 — Complexity: A Guided Tour
3
![Page 4: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/4.jpg)
What Science are we in, anyways?
Interdisciplinary territory with distinct influences
20th century Sociology — small-scale social network analysis
Econometrics — time-series analysis, predictions & forecasting
Mass Communication — media effects
Theoretical Physics — abstract, high-level descriptions of networks; large-scale network analysis (Why is this even here?)
4
![Page 5: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/5.jpg)
What is Community Detection?“Communities are groups of vertices which probably share common properties and/or play similar roles within the graph.”
Fortunato & Castellano 2009 — Community Structure in Graphs in the Encyclopedia of Complexity and Systems Science
An exploratory method for partitioning a network into smaller pieces.In many ways it is comparable to cluster analysis.
5
![Page 6: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/6.jpg)
(Caveat Emptor)CD is a complex, fairly new set of statistical methods for exploratively building groups from data
So why not use simpler, better-known methods such as clustering?
By all means, use simple methods!(but they do something different)
6
![Page 7: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/7.jpg)
RelevanceNetworks are a fundamental structure of the world
There are global properties of networks (diameter &c.)
There are properties of nodes (centrality &c.)
However — Networks are almost never homogenous!
There is a structure hidden within the whole
7
![Page 8: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/8.jpg)
8
Group A
Group C
Group B
![Page 9: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/9.jpg)
ApplicationsIdentify separate groups within relevant population for further description
Captures “public sphere” better than aggregates such as #hashtags(Users who share a #tag might have nothing in common)
Investigate relationship of communities (mesoscopic graph)
In general: more accurate, delivers more details
9
![Page 10: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/10.jpg)
TerminologyGraph: A network, consisting of
nodes (or vertices) such as twitter users- with degree = number of connections
links (or edges) such as relationships via @-messageswith weight = intensity of links
Partition: one way to split a network into a set of communities
(k-) Clique: a set of k completely connected nodes
10
![Page 11: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/11.jpg)
The Basic IdeaCommunities: Local structures within a network that differ in their structure from the surroundings
A good starting point: communities are better connected among themselves than with other communities
Opens up two obvious methods:
Add links between close nodes until some condition is met
Remove links between distant nodes until some condition is met
11
![Page 12: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/12.jpg)
12possible Partitions
![Page 13: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/13.jpg)
The Edge Betweenness Algorithm (Girvan / Newman)
Edge betweenness: the number of shortest paths between any two nodes that go through one edge
High EB: the link is very important to fast information flow
Low EB: the link can easily be replaced by using another way
The algorithm simply eliminates the links with the highest EB step by step
An optimal cut can be selected from the sequence of partitions
13
![Page 14: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/14.jpg)
14small network example
![Page 15: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/15.jpg)
15small network example — edge betweenness cluster
![Page 16: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/16.jpg)
Limitations — TechnicalThe number of potential ways to divide a network grows super-exponentially with the number of nodes (!)
Two critical performance parameters of algorithms: runtime (“Big-O”-notation) and memory
Networks up to 100s of nodes and/or edges — usually OK
Networks up to 10 000s of nodes and/or edges — buy a lot of memory (8GB upwards) and prepare to wait
Bigger networks: Ask a computer scientist
16
![Page 17: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/17.jpg)
Limitations — MethodologicalQuality of partitions — algorithms don’t guarantee best results
Instability of partitions — algorithms can be non-deterministic and very sensitive to small changes
Evaluation / Comparison of partitions is near-impossible
Sometimes result is not one best but a whole set of partitions
Nodes can only belong to one community (!)
17
![Page 18: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/18.jpg)
18large network example — edge betweenness cluster
![Page 19: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/19.jpg)
Notable AlgorithmsThe Edge Betweenness Algorithm (Girvan / Newman)
Markov Cluster Algorithm (MCL, van Dongen, this one is in gephi)
Clique Percolation (CPM, Palla et al.)
Information theoretical Algorithm (Roswall & Bergstrom, does hierarchies and works with communities of very different sizes)
19
![Page 20: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/20.jpg)
A Word about MCLMarked as experimental in gephi and hard to use (clustering panel needs to be open before loading dataset), but the only clustering algorithm available
Based on probability of link use - simulates flow through the network
Often sub-stellar results
Connection probabilities seem an odd predictor for empirical connection habits
DEMO
20
![Page 21: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/21.jpg)
Clique PercolationOne among several new algorithms that address shortcomings
Intuitive mechanism
Nodes can be in several communities!
Works rather well for dense networks!
21
![Page 22: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/22.jpg)
Clique PercolationIdea: Find k-cliques in the network
Try to “move” the cliques until they reach a bottleneck that they can’t fit through
All the nodes covered by this “trail” are assigned to a community
Rather easy to implement in software (igraph) plus free implementation available (CFinder, cfinder.org)
DEMO
22
![Page 23: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/23.jpg)
Clique Percolation by Example
23
![Page 24: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/24.jpg)
EvaluationExploratory methods are notoriously difficult to assess (beyond rule-of-thumb judgements).
Two ways allow rigorous examination:
Comparison of two partitions
Comparison of a partition agains a baseline model (zero model)
Effectively unfeasible for non-mathematicians: Pick a good algorithm and treat results with care
24
![Page 25: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/25.jpg)
What about user attributes?What happens when we use empirical attributes to group users?
Example of the German General Election 2009: Measured party affiliation (wahlgetwitter hashtag +/- convention)
Turns out, users don’t cluster by party affiliation
But careful: this approach means measuring with two loose ends
Clustering baseline needs to be really, really solid
25
![Page 26: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/26.jpg)
TakeawayCommunity detection used to be hard but is pretty usable now
Think about the design and scope of a collected network beforehand! (timeframe, directed, size/scope etc.)
Watch the outliers (Justin Bieber will sink your analysis)
Choose an algorithm that
uses directed & weighted links, is understandable, robust
and one that produces meaningful, simple results!
26
![Page 27: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/27.jpg)
Thanks!
27
![Page 28: Jürgens diata12-communities](https://reader035.fdocuments.in/reader035/viewer/2022081518/54c643774a7959b07d8b46bc/html5/thumbnails/28.jpg)
Literature
28
Fortunato, Santo and Castellano, Claudio (2009): Community Structure in Graphs. In: Meyers, Robert A. (Ed.): Encyclopedia of Complexity and Systems Science. Springer.
Lancichinetti, Andrea and Fortunato, Santo: Community detection algorithms: a comparative analysis. Phys. Review E.
Mitchell, Melanie (2009): Complexity: A Guided Tour
Palla, Gergely; Barabási, Albert-László and Vicsek, Tamás (): Quantifying social group evolution