Topic-aware Network Visualisation toExplore Large Email...

4
Topic-aware Network Visualisation to Explore Large Email Corpora Tim Repke Hasso Plattner Institute Potsdam, Germany [email protected] Ralf Krestel Hasso Plattner Institute Potsdam, Germany [email protected] ABSTRACT Nowadays, more and more large datasets exhibit an intrinsic graph structure. While there exist special graph databases to han- dle ever increasing amounts of nodes and edges, visualising this data becomes infeasible quickly with growing data. In addition, looking at its structure is not sufficient to get an overview of a graph dataset. Indeed, visualising additional information about nodes or edges without cluttering the screen is essential. In this paper, we propose an interactive visualisation for social networks that positions individuals (nodes) on a two-dimensional canvas such that communities defined by social links (edges) are easily recognisable. Furthermore, we visualise topical relatedness be- tween individuals by analysing information about social links, in our case email communication. To this end, we utilise document embeddings, which project the content of an email message into a high dimensional semantic space and graph embeddings, which project nodes in a network graph into a latent space reflecting their relatedness. 1 INTRODUCTION In our modern information society we produce substantial amounts of data each day. A large portion of it comes from the communi- cation on social media platforms or through emails. Special graph databases enable the efficient storage of these large communica- tion networks and provide interfaces to query or analyse the data. Visualising networks in their entirety on the other hand is a very challenging task. Users investigating a communication network want to find information about when does who communicate with whom about what. These kind of networks can be found in many different shapes. Modern social networks, such as Twitter or Facebook exhibit similar structures as classic, offline social networks [20]. We investigate another type of social network: a collection of emails. Given the communication data over a year or more, it is prac- tically impossible to gain an overview or quick insights into the latent network structure with a basic approach as shown in 1. Also, in such a traditional network visualisation, information about the content of messages sent between individuals is lost. Besides these traditional systems, more exotic approaches use the metaphor of geographical maps [17] to visualise networks, for example using topology to reflect connectivity of densely connected social communities. The map analogy can also be used to visualise the contents of documents by embedding them into a high dimensional semantic space [15] and projecting it on the map as a document landscape. In order to highlight how relation- ships form and change based on the interactions, the metaphor of a growing tree ca be used (ContactTrees [18]). Although this reflects temporal aspects of dynamic networks well, it focuses on © 2018 Copyright held by the owner/author(s). Published in the Workshop Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. one person as the root, thus an overview of the entire network is lost. CactusTrees [6] on the other hand represent hierarchical structures with the goal of untangling overlaid bundles of inter- secting edges, making distant connections more apparent. As higher order dependencies may get lost in traditional visualisa- tions, HoNVis [21] adds nodes to encode dependencies in chains of interactions. Usually, a communication network has many nodes and overlapping connections already, so Yang et al. [23] rather focus on discovering overlapping cores to improve the identification of community boundaries to highlight global latent structures. Similarly, Gronemann et al. [11] use the metaphor of islands and hills to visualise clustered graphs, making densely connected communities clearly noticeable. The edges are bundled and follow valleys of the resulting topology, thus making rela- tionships between other communities hard to follow. MapSets [7] assume a graph that was laid out using embeddings reflecting communities. An algorithm then draws regions around clusters of nodes, such that the bounding shapes are contiguous and non-overlapping, but yet abstract. Another approach to visualise networks at full scale is to aggregate nodes based on their spa- tial distribution and thereby allowing for a simple exploration with contour lines and heatmap overlays to emphasise latent structures as proposed by Hildenbrand et al. [13]. Document visualisation aims to visualise the content, such that users gain quick insights into topics, latent phrases, or trends. Tiara [22] extracts topics and derives time-sensitive keywords to depict evolving subjects over time as stacked plots. Other approaches project documents into a latent space, using topic models or embeddings. Creating scatter-plots of embedded docu- ments of a large corpus may result in a very dense and unclear layout, so Chen et al. [4] developed an algorithm to reduce over- full visualisations by picking representative documents. A dif- ferent approach is taken by Fortuna et al. [8], who do not show documents directly, but generate a heatmap of the populated canvas and overlay it with salient phrases at more densely popu- lated areas from the underlying documents in that region. Friedl et al. [10] extend that concept by drawing clear lines between regions and colouring them. They also add edges between salient phrases based on co-occurrences in the texts. Most recently Car- tograph [19] was proposed, which is visually very similar to previous approaches, but uses pre-rendered information of dif- ferent resolution and map technology to enable a responsive interactive visualisation. Regions are coloured based on underly- ing ontologies from a knowledge-base. Our goal is to merge approaches for network and document visualisations in one interactive user interface. This means to integrate multiple dimensions of email datasets including time, interactions, users and topics into a 2D map representation. Giv- ing an overview over latent structures and topics in one map may significantly improve the exploration of a corpus by users 1

Transcript of Topic-aware Network Visualisation toExplore Large Email...

Page 1: Topic-aware Network Visualisation toExplore Large Email ...bigvis2018.imis.athena-innovation.gr/papers/BigVis_2018_paper_5.pdf · Nowadays, more and more large datasets exhibit an

Topic-aware Network Visualisation toExplore Large Email Corpora

Tim RepkeHasso Plattner InstitutePotsdam, [email protected]

Ralf KrestelHasso Plattner InstitutePotsdam, [email protected]

ABSTRACTNowadays, more and more large datasets exhibit an intrinsicgraph structure. While there exist special graph databases to han-dle ever increasing amounts of nodes and edges, visualising thisdata becomes infeasible quickly with growing data. In addition,looking at its structure is not sufficient to get an overview of agraph dataset. Indeed, visualising additional information aboutnodes or edges without cluttering the screen is essential. In thispaper, we propose an interactive visualisation for social networksthat positions individuals (nodes) on a two-dimensional canvassuch that communities defined by social links (edges) are easilyrecognisable. Furthermore, we visualise topical relatedness be-tween individuals by analysing information about social links, inour case email communication. To this end, we utilise documentembeddings, which project the content of an email message intoa high dimensional semantic space and graph embeddings, whichproject nodes in a network graph into a latent space reflectingtheir relatedness.

1 INTRODUCTIONIn ourmodern information societywe produce substantial amountsof data each day. A large portion of it comes from the communi-cation on social media platforms or through emails. Special graphdatabases enable the efficient storage of these large communica-tion networks and provide interfaces to query or analyse the data.Visualising networks in their entirety on the other hand is a verychallenging task. Users investigating a communication networkwant to find information about when does who communicatewith whom about what. These kind of networks can be found inmany different shapes. Modern social networks, such as Twitteror Facebook exhibit similar structures as classic, offline socialnetworks [20]. We investigate another type of social network: acollection of emails.

Given the communication data over a year or more, it is prac-tically impossible to gain an overview or quick insights intothe latent network structure with a basic approach as shown in1. Also, in such a traditional network visualisation, informationabout the content of messages sent between individuals is lost.Besides these traditional systems, more exotic approaches usethe metaphor of geographical maps [17] to visualise networks,for example using topology to reflect connectivity of denselyconnected social communities. The map analogy can also be usedto visualise the contents of documents by embedding them intoa high dimensional semantic space [15] and projecting it on themap as a document landscape. In order to highlight how relation-ships form and change based on the interactions, the metaphorof a growing tree ca be used (ContactTrees [18]). Although thisreflects temporal aspects of dynamic networks well, it focuses on

© 2018 Copyright held by the owner/author(s). Published in the WorkshopProceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permittedunder the terms of the Creative Commons license CC-by-nc-nd 4.0.

one person as the root, thus an overview of the entire networkis lost. CactusTrees [6] on the other hand represent hierarchicalstructures with the goal of untangling overlaid bundles of inter-secting edges, making distant connections more apparent. Ashigher order dependencies may get lost in traditional visualisa-tions, HoNVis [21] adds nodes to encode dependencies in chainsof interactions. Usually, a communication network has manynodes and overlapping connections already, so Yang et al. [23]rather focus on discovering overlapping cores to improve theidentification of community boundaries to highlight global latentstructures. Similarly, Gronemann et al. [11] use the metaphor ofislands and hills to visualise clustered graphs, making denselyconnected communities clearly noticeable. The edges are bundledand follow valleys of the resulting topology, thus making rela-tionships between other communities hard to follow. MapSets [7]assume a graph that was laid out using embeddings reflectingcommunities. An algorithm then draws regions around clustersof nodes, such that the bounding shapes are contiguous andnon-overlapping, but yet abstract. Another approach to visualisenetworks at full scale is to aggregate nodes based on their spa-tial distribution and thereby allowing for a simple explorationwith contour lines and heatmap overlays to emphasise latentstructures as proposed by Hildenbrand et al. [13].

Document visualisation aims to visualise the content, such thatusers gain quick insights into topics, latent phrases, or trends.Tiara [22] extracts topics and derives time-sensitive keywordsto depict evolving subjects over time as stacked plots. Otherapproaches project documents into a latent space, using topicmodels or embeddings. Creating scatter-plots of embedded docu-ments of a large corpus may result in a very dense and unclearlayout, so Chen et al. [4] developed an algorithm to reduce over-full visualisations by picking representative documents. A dif-ferent approach is taken by Fortuna et al. [8], who do not showdocuments directly, but generate a heatmap of the populatedcanvas and overlay it with salient phrases at more densely popu-lated areas from the underlying documents in that region. Friedlet al. [10] extend that concept by drawing clear lines betweenregions and colouring them. They also add edges between salientphrases based on co-occurrences in the texts. Most recently Car-tograph [19] was proposed, which is visually very similar toprevious approaches, but uses pre-rendered information of dif-ferent resolution and map technology to enable a responsiveinteractive visualisation. Regions are coloured based on underly-ing ontologies from a knowledge-base.

Our goal is to merge approaches for network and documentvisualisations in one interactive user interface. This means tointegrate multiple dimensions of email datasets including time,interactions, users and topics into a 2D map representation. Giv-ing an overview over latent structures and topics in one mapmay significantly improve the exploration of a corpus by users

1

Page 2: Topic-aware Network Visualisation toExplore Large Email ...bigvis2018.imis.athena-innovation.gr/papers/BigVis_2018_paper_5.pdf · Nowadays, more and more large datasets exhibit an

Figure 1: Traditional basic visualisation of a communication graph from 2000 emails using force layout

unfamiliar with the domain and terminology. Also domain ex-perts could benefit from such an overview, e.g. by easily beingable to identify global patterns in the data.

A specific application scenario that could benefit from suchintegrated, interactive visualisations is the analysis of large, un-structured, heterogeneous data collections. Data-driven journal-ism [5] often has to deal with leaked, unstructured, very hetero-geneous data, e.g. in the context of the Panama Papers, wherejournalists needed to untangle and order huge amounts of infor-mation, search entities, and visualise found patterns [3]. Similardatasets are of interest in the context of computational foren-sics [9]. Auditing firms and law enforcement need to sift throughhuge amounts of data to gather evidence of criminal activity,often involving communication networks and documents [14].

2 INTERACTIVE VISUALISATIONSystems for document exploration largely vary in what theydisplay and how users interact with them. This depends partlyon the available raw data, but also on information extracted frompre-processing or enrichment with external sources. 1 showsa basic visualisation of the network graph extracted from anemail corpus. Although it is an improvement over only listingconnections, large densely connected graphs quickly becomehard to read and information about the email contents is lost.

Exploring document collections can be seen as a top-downapproach, where the system provides abstract overviews of theentire document collection and users incrementally refine thesearch, narrowing the results to just a few documents of interest.Such a top-down approach may help users without prior knowl-edge to get a sense for the data by visualising high level latentstructures of communication networks or the topical distribu-tions.

In the scope of this work we primarily consider documents tobe emails or data attached to them. The sender, recipients, time,and content can directly be extracted from the raw data. We callthese – and results from further processing – dimensions that canbe visualised. From the contents one may infer named entities,topics, embeddings, or salient phrases, while the communicationnetwork spanned by sender-recipient pairs can be used to detectsalient structures and hierarchies. The temporal informationenables the previously mentioned data to be analysed over timeto detect evolving or changing patterns.

There are numerous ways to visualise each dimension onits own or in combination with others. The requirement of adimension and its priority in a visualisation is dictated by thesystem objective. From the wide range of possibilities, we strivefor a system which supports the exploration of a large collection

of documents without any prior knowledge about its content andindividuals involved.

In our system, we use the names and email addresses of sendersand recipients (individuals), communication network, semanticvector representations of email contents, and as part of an overlaythe timestamps of emails and propose a graph layout over adocument landscape that visually describes who talks with whomabout what at a given time period.

3 SYSTEM ARCHITECTUREVisualising communication networks in a topic-aware fashion toexplore documents and salient structures is not straightforwardDifferent layout objectives may produce contradicting results andthe challenges of processing big data need to be addressed [2]. Inthis section, we describe algorithmic approaches behind the sys-tem we are working on. For a discussion of engineering aspectson how to store, serve, and render the map-like data, we refer tothe Cartograph stack [19], as we will focus on the process howto get the information that the map is generated from.

We visualise the embedded emails as dots in a two-dimensionallandscape in which individuals are placed as nodes connected byedges. All emails between two individuals are reduced into oneedge reducing visual complexity and making it easier to detectsalient structures. However, that comes with the trade-off thatnodes and edges cannot be perfectly placed in the landscape tocover all semantic aspects of the communication between them,but rather an estimate. Our very early prototype placed someindividuals with no dominant topic in a crowded area in thecentre of the landscape as shown in 2, where colours of opaquedots for emails correspond to that of the sender. Although thenetwork visualisation at this point does not make connectionsmore clear than in 1, users can already distinguish individualswith similar or unrelated topics.

Our proposed algorithm to find a stable network layout hasthree stages, namely an (i) initialisation phase which createsthe landscape and roughly places nodes and connections, an(ii) update phase which iteratively updates the node placementtowards a better fit, and finally a (iii) post processing phase whereedges become splines to make latent structures more clear and amap topology is added.

Initialising the Landscape. To generate the document land-scape, we first process the network graph to roughly determineregions, where documents will be placed. Therefore we applynode2vec [12] to the communication network and embed each in-dividual’s node. We separate the graph into communities Pi ⊂ Pusing the kernel density of the resulting populated space at

Page 3: Topic-aware Network Visualisation toExplore Large Email ...bigvis2018.imis.athena-innovation.gr/papers/BigVis_2018_paper_5.pdf · Nowadays, more and more large datasets exhibit an

0 03 35 56 68 89 90 03 3

14-14

5 56 68 89 9

20-20

3 35 56 68 89 930 303 33 3

33-33

34-34

35 3536 363 338 3839 390 0

41-41

3 3

45-45

6 68 89 950 505 55 553 535 555 5556 565 558 5859 5960 606 66 6

63-63

64-64

65 6566 666 668 6869 690 03 35 56 68 89 980 808 88 883 838 8

85-85

86 868 888 8889 8990 909 99 993 93

94-94

95-95

96 969 9

98-98

99 9900 000 00 003 030 005 0506 060 0

108-108

09 090 0

111-111

3 35 56 6

118-118

9 90 03 3

124-124

5 56 6

127-127

8 89 930 30

131-131

3 333 333 335 3536 363 338 3839 390 0

143-143

145-145

6 68 89 950 505 55 553 535 555 5556 565 558 5859 5960 606 6

162-162

63 636 665 6566 666 668 6869 690 03 3

175-175

176-176

8 89 980 80

181-181

8 883 838 885 8586 868 888 8889 8990 90

191-191

9 993 939 995 9596 969 998 9899 9900 000 00 003 030 005 0506 060 008 08

209-209

210-210

213-213

5 56 6

218-218

9 90 0

221-221

3 35 56 68 89 930 30

231-231

3 333 333 335 3536 363 338 3839 390 03 35 56 6

247-247

8 89 950 505 55 5

253-253

5 555 5556 56

257-257

58 5859 5960 606 66 663 63

264-264

65 65

266-266

267-267

68 6869 690 03 35 56 68 89 980 808 88 883 838 885 8586 868 888 8889 8990 909 99 993 93

294-294

95 9596 969 998 9899 99300 30030 3030 30303 30330 30

305-305

306 306

307-307

308 308309 3093 0 3 03 3

312-312

3 3 3 33 33 5 3 53 6 3 6

317-317

3 8 3 83 9 3 93 0 3 03 3

322-322

3 3 3 33 33 5 3 53 6 3 6

327-327

3 8 3 83 9 3 9330 33033 3333 33333 33333 33335 335336 33633 33338 338339 3393 0 3 03 3

342-342

3 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 9350 35035 3535 35353 35335 35355 355356 35635 35358 358359 359360 36036 3636 36363 36336 36365 365366 36636 36368 368369 3693 0 3 03 33 33 3 3 3

374-374

3 5 3 53 6 3 63 33 8 3 83 9 3 9380 38038 3838 38383 38338 38385 385386 38638 38388 388389 389390 39039 3939 39393 39339 39395 395396 39639 39398 398399 39900 000 00 003 030 005 0506 060 008 0809 090 03 35 56 68 89 90 03 35 56 68 89 930 303 33 333 33

434-434

35 3536 363 338 38

439-439

0 0

442-442

3 35 56 6

448-448

9 950 505 55 553 535 555 5556 565 558 5859 5960 606 66 663 636 665 6566 666 668 6869 690 0

472-472

3 3

474-474

5 56 68 89 980 80

481-481

8 883 838 8

485-485

86 868 888 8889 8990 909 99 993 939 995 9596 969 998 9899 99500 500

501-501

50 50503 50350 50505 505506 50650 50508 508

509-509

5 0 5 0

511-511

5 55 3 5 35 55 5 5 5

516-516

5 55 8 5 8

519-519

520-520

5 5

522-522

5 3 5 35 5

525-525

5 6 5 65 5

528-528

5 9 5 9530 53053 5353 53533 533

534-534

535 535536 53653 53538 538

539-539

5 0 5 05 55 55 3 5 35 55 5 5 55 6 5 65 55 8 5 85 9 5 9550 55055 5555 55553 55355 55555 555556 55655 55558 558559 559560 56056 5656 56563 56356 56565 565566 56656 56568 568569 5695 0 5 05 55 55 3 5 35 55 5 5 55 6 5 65 55 8 5 85 9 5 9

580-580

58 5858 58583 58358 58585 585586 58658 58588 588589 589590 59059 5959 59593 59359 59595 595596 59659 59598 598599 599600 60060 60

602-602

603 60360 60605 605606 60660 60608 608609 6096 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 66 8 6 8

619-619

6 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 66 8 6 86 9 6 9630 63063 6363 63633 63363 63635 635636 63663 63638 638639 6396 0 6 0

641-641

6 66 3 6 36 6

645-645

6 6 6 66 66 8 6 86 9 6 9650 65065 6565 65653 65365 65655 655656 65665 65658 658659 659660 66066 6666 66

663-663

66 66665 665666 666

667-667

668 668669 6696 0 6 06 66 66 3 6 3

674-674

6 5 6 5

676-676

6 66 8 6 86 9 6 9680 68068 6868 68683 68368 68685 685686 68668 68

688-688

689-689

690 69069 6969 69693 69369 69695 695

696-696

69 69698 698699 69900 000 00 003 03

704-704

05 0506 060 008 0809 090 03 35 56 68 8

719-719

0 0

722-722

3 3

725-725

6 6

727-727

8 89 930 303 3

732-732

33 333 335 3536 363 338 3839 390 0

741-741

3 35 5

746-746

8 89 950 505 5

752-752

753-753

5 555 5556 565 558 5859 5960 606 66 663 636 665 6566 666 6

768-768

69 690 03 35 56 6

778-778

9 980 808 88 883 838 885 8586 86

787-787

88 8889 8990 909 99 993 93

794-794

795-795

96 969 998 9899 99800 800

801-801

80 80803 80380 80805 805806 80680 80808 808

809-809

8 0 8 08 88 88 3 8 38 88 5 8 58 6 8 68 88 8 8 88 9 8 98 0 8 08 88 88 3 8 38 8

825-825

8 6 8 68 88 8 8 88 9 8 9830 83083 8383 83833 83383 83835 835836 83683 83838 838839 8398 0 8 08 88 88 3 8 3

844-844

8 5 8 58 6 8 68 8

848-848

8 9 8 9850 85085 8585 85853 85385 85855 855856 85685 85858 858859 859860 86086 8686 86863 86386 86865 865866 86686 86

868-868

869 8698 0 8 08 88 88 3 8 38 8

875-875876-876

8 8

878-878

8 9 8 9880 88088 8888 88883 88388 88885 885886 88688 88888 888889 889890 89089 89

892-892

893 89389 89895 895896 89689 89898 898899 899900 90090 9090 90903 903

904-904

905-905

906 90690 90908 908909 9099 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 9

918-918

9 9 9 99 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9930 93093 9393 93933 93393 93935 935936 93693 93938 938939 9399 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9950 950

951-951

95 95

953-953

95 95955 955956 95695 95958 958959 959960 96096 9696 96963 963

964-964

965 965966 96696 96968 968969 969

970-970

9 99 99 3 9 39 99 5 9 5

976-976

9 9

978-978

979-979

980-980

98 9898 98

983-983

98 98985 985986 98698 98988 988989 989990 99099 9999 99993 99399 99

995-995

996-996

99 99998 998999 999000 00000 0000 00003 00300 00005 005006 006

1007-1007

008 008009 0090 0 0 00 00 00 3 0 30 0

1015-1015

0 6 0 60 00 8 0 80 9 0 90 0 0 00 00 00 3 0 30 00 5 0 5

1026-1026

1027-1027

0 8 0 80 9 0 9030 03003 0303 03

1033-1033

03 03035 035036 03603 03

1038-1038

039 0390 0 0 00 00 00 3 0 30 00 5 0 50 6 0 60 00 8 0 80 9 0 9050 05005 0505 05053 05305 05055 055056 05605 05058 058

1059-1059

060 060

1061-1061

06 06063 06306 06065 065066 06606 06

1068-1068

069 0690 0 0 0

1071-1071

1072-1072

0 3 0 30 00 5 0 5

1076-1076

0 0

1078-1078

0 9 0 9080 08008 0808 08083 08308 08085 085086 08608 08088 088089 089090 09009 0909 09093 09309 09095 095

1096-1096

09 09098 098099 099

1100-1100

0 00 003 03

1104-1104

1105-1105

06 060 008 0809 09

1110-1110

3 35 56 68 8

1119-1119

1120-1120

3 35 56 68 89 930 303 33 333 33

1134-1134

35 3536 36

1137-1137

38 3839 390 0

1143-1143

5 56 6

1147-1147

8 89 950 505 55 553 535 555 5556 565 558 5859 5960 606 66 663 636 665 65

1166-1166

6 668 6869 690 0

1171-1171

3 35 56 68 89 980 808 88 883 838 885 8586 868 888 8889 89

1190-1190

9 9

1192-1192

1193-1193

9 9

1195-1195

96 969 998 9899 9900 00

1201-1201

0 003 030 005 0506 060 008 0809 090 03 35 56 6

1218-1218

9 90 03 3

1225-1225

6 68 89 930 303 33 333 333 335 3536 363 338 3839 390 03 35 56 6

1247-1247

8 89 950 505 55 5

1253-1253

5 555 5556 56

1257-1257

58 5859 5960 606 66 6

1263-1263

6 665 6566 666 668 6869 690 03 35 5

1276-1276

8 89 980 808 88 883 838 8

1285-1285

1286-1286

8 888 8889 8990 909 99 993 939 995 9596 969 998 9899 99300 300

1301-1301

30 30303 30330 30305 305306 30630 30308 308309 3093 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 93 0 3 03 33 3

1323-1323

3 33 5 3 53 6 3 63 33 8 3 83 9 3 9330 33033 3333 33333 33333 33335 335336 33633 33338 338

1339-1339

3 0 3 03 33 33 3 3 33 33 5 3 53 6 3 6

1347-1347

3 8 3 83 9 3 9350 35035 35

1352-1352

353 35335 35355 355356 35635 35358 358359 359360 36036 36

1362-1362

363 363

1364-1364

365 365366 36636 36368 368

1369-1369

3 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 9

1380-1380

38 3838 38383 38338 38385 385386 386

1387-1387

388 388389 389390 39039 3939 39393 39339 39395 395396 39639 39398 398399 39900 000 0

1402-1402

03 030 0

1405-1405

06 060 008 0809 090 03 35 56 68 89 90 03 3

1425-1425

6 68 89 930 303 33 333 333 335 3536 36

1437-14371438-1438

39 390 03 35 56 68 89 950 505 5

1452-1452

53 53

1454-1454

55 5556 565 558 58

1459-1459

60 606 66 663 636 665 6566 666 668 6869 690 03 3

1474-1474

5 56 6

1478-1478

9 980 80

1481-1481

8 883 838 885 8586 868 888 8889 89

1490-1490

9 9

1492-1492

93 939 995 9596 969 9

1498-1498

99 99500 50050 50

1502-1502

503 50350 50505 505506 50650 50508 508509 5095 0 5 05 55 55 3 5 3

1514-1514

1515-1515

5 6 5 65 55 8 5 85 9 5 95 0 5 05 55 55 3 5 35 55 5 5 55 6 5 6

1527-1527

5 8 5 85 9 5 9530 53053 5353 53533 53353 53535 535536 536

1537-1537

538 538539 5395 0 5 05 55 55 3 5 35 55 5 5 55 6 5 65 55 8 5 8

1549-1549

550 55055 5555 55

1553-1553

1554-1554

555 555556 55655 55

1558-1558

559 559560 56056 5656 56563 56356 56

1565-1565

566 56656 56568 568569 5695 0 5 05 55 5

1573-1573

5 55 5 5 55 6 5 65 5

1578-1578

5 9 5 9580 580

1581-1581

1582-1582

583 58358 58585 585586 58658 58588 588589 589590 59059 5959 59593 59359 59595 595596 59659 59598 598

1599-1599

600 60060 6060 60

1603-1603

60 60605 605606 60660 60608 608609 6096 0 6 06 66 66 3 6 36 6

1615-1615

6 6 6 66 66 8 6 86 9 6 96 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 66 8 6 86 9 6 9630 630

1631-1631

63 63633 63363 63635 635636 636

1637-1637

638 638639 6396 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 6

1648-1648

6 9 6 9650 65065 65

1652-1652

653 65365 65

1655-1655

656 656

1657-1657

658 658659 659660 66066 6666 66663 66366 66665 665666 66666 66668 668669 6696 0 6 06 66 6

1673-1673

6 66 5 6 56 6 6 66 66 8 6 86 9 6 9

1680-1680

68 6868 68

1683-1683

1684-1684

685 685686 68668 68688 688

1689-1689

690 69069 6969 69693 69369 69695 695696 69669 69698 698699 69900 00

1701-1701

0 0

1703-1703

1704-1704

05 0506 060 0

1708-1708

09 090 03 35 56 6

1717-1717

8 8

1719-1719

1720-1720

3 3

1724-1724

5 56 68 89 9

1730-1730

3 33 333 333 335 35

1736-1736

3 338 3839 390 03 35 56 68 89 950 505 55 553 535 555 5556 565 558 5859 59

1760-1760

6 6

1762-1762

63 63

1764-1764

65 6566 666 6

1768-1768

69 69

1770-1770

3 3

1774-1774

5 56 68 89 980 808 88 883 838 885 8586 868 888 8889 8990 909 99 993 939 995 95

1796-1796

9 998 9899 99800 80080 8080 80803 80380 80805 805806 80680 80808 808809 8098 0 8 08 88 88 3 8 38 88 5 8 58 6 8 68 88 8 8 88 9 8 98 0 8 08 88 88 3 8 38 88 5 8 58 6 8 6

1827-1827

8 8 8 88 9 8 9

1830-1830

83 8383 83833 83383 83835 835836 836

1837-1837

838 838839 8398 0 8 08 88 88 3 8 38 88 5 8 58 6 8 6

1847-1847

8 8 8 88 9 8 9850 85085 8585 85853 85385 85855 855856 85685 85858 858859 859860 86086 8686 86863 863865 865866 86686 86868 868869 8698 0 8 08 88 88 3 8 38 88 5 8 5

1876-1876

8 88 8 8 88 9 8 9880 88088 8888 88883 88388 88885 885886 88688 88888 888889 889

1890-1890

89 8989 89893 89389 89895 895896 896

1897-1897

898 898899 899900 900

1901-1901

90 90903 903

1904-1904

905 905906 90690 90908 908909 9099 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9

1920-1920

1921-1921

9 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9930 93093 9393 93933 93393 93935 935936 93693 93938 938939 9399 0 9 0

1941-1941

9 99 3 9 39 99 5 9 59 6 9 69 99 8 9 8

1949-1949

950 95095 9595 95953 95395 95955 955

1956-1956

95 95958 958959 959960 96096 9696 96963 96396 96965 965966 96696 96968 968969 9699 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9980 98098 9898 98983 98398 98985 985986 986

1987-1987

988 988989 989990 990

1991-1991

99 99

1993-1993

99 99995 995996 99699 99998 998999 999000 00000 0000 00003 00300 00005 005006 00600 00008 008009 0090 0 0 00 0

2012-2012

0 3 0 3

2014-2014

0 5 0 50 6 0 60 0

2018-2018

0 9 0 90 0 0 00 00 00 3 0 30 00 5 0 50 6 0 60 00 8 0 80 9 0 9

2030-2030

03 0303 03033 03303 03035 035036 03603 03038 038039 0390 0 0 00 00 00 3 0 30 00 5 0 50 6 0 60 00 8 0 80 9 0 9050 05005 0505 05053 053

2054-2054

055 055056 05605 05058 058059 059060 06006 0606 06063 06306 06065 065066 06606 06068 068069 069

2070-2070

0 00 00 3 0 30 00 5 0 50 6 0 60 00 8 0 80 9 0 9080 08008 0808 08083 08308 08085 085086 08608 08088 088089 089090 09009 09

2092-2092

2093-2093

09 09095 095

2096-2096

09 09098 098099 09900 000 00 003 030 005 0506 060 0

2108-2108

09 090 03 35 5

2116-2116

2117-2117

8 89 90 03 3

2124-2124

5 56 68 89 930 303 33 333 333 3

2135-2135

36 363 338 3839 390 03 35 56 68 89 950 505 55 553 535 555 5556 565 558 5859 5960 606 66 6

2163-2163

6 665 6566 666 6

2168-2168

69 690 03 35 56 68 89 980 808 88 8

2183-2183

8 8

2185-2185

86 868 888 88

2189-2189

90 909 99 993 939 995 9596 969 9

2198-2198

99 99

2200-2200

0 00 003 030 005 0506 060 0

2208-2208

09 090 03 35 56 68 89 90 03 35 56 68 89 930 303 33 333 333 335 3536 363 338 3839 390 03 35 56 68 89 9

2250-2250

5 55 5

2253-2253

5 555 55

2256-2256

5 558 5859 5960 606 66 663 636 665 6566 666 668 6869 690 03 35 56 68 89 980 80

2281-2281

8 8

2283-2283

8 885 8586 868 888 8889 8990 909 99 993 939 995 9596 969 998 9899 99300 300

2301-2301

2302-2302

2303-2303

30 30305 305306 30630 30308 308309 3093 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 93 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 9330 33033 3333 33333 33333 33335 335

2336-2336

33 33338 338339 3393 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 9350 35035 3535 35353 35335 35355 355356 35635 35358 358359 359360 36036 3636 36363 36336 36365 365366 36636 36368 368369 369

2370-2370

3 3

2372-2372

3 3 3 33 3

2375-2375

3 6 3 63 3

2378-2378

3 9 3 9380 38038 3838 38383 38338 38385 385386 38638 38388 388389 389

2390-2390

39 3939 39393 39339 39395 395396 39639 39398 398399 399

2400-2400

0 00 003 03

2404-2404

2405-2405

06 060 008 08

2409-2409

0 0

2411-2411

2412-2412

3 35 56 6

2418-2418

2419-2419

0 0

2422-2422

3 35 56 6

2428-2428

9 930 303 33 333 333 335 3536 363 338 3839 39

2440-2440

3 35 56 68 89 9

2450-2450

5 55 553 53

2454-2454

55 5556 565 558 5859 59

2460-2460

6 66 663 636 6

2465-2465

66 666 668 68

2469-2469

0 03 35 56 68 89 980 808 88 883 838 885 8586 86

2487-2487

88 8889 89

2490-2490

9 99 993 939 995 9596 969 998 9899 99500 50050 5050 50503 50350 50505 505506 50650 50508 508

2509-2509

5 0 5 05 55 55 3 5 35 55 5 5 55 6 5 65 55 8 5 85 9 5 9

2520-2520

2521-2521

5 5

2523-2523

5 55 5 5 55 6 5 65 55 8 5 85 9 5 9530 53053 5353 53533 53353 53535 535536 53653 53538 538539 5395 0 5 05 55 55 3 5 35 55 5 5 55 6 5 6

2547-2547

5 8 5 85 9 5 9550 55055 5555 55553 55355 55555 555556 55655 55558 558559 559560 56056 5656 56563 56356 56565 565566 56656 56568 568569 5695 0 5 05 55 55 3 5 35 55 5 5 55 6 5 6

2577-2577

2578-2578

5 9 5 9580 58058 5858 58583 58358 58585 585586 58658 58

2588-2588

589 589590 59059 5959 59593 59359 59595 595596 59659 59598 598599 599

2600-2600

60 6060 60603 60360 60605 605606 60660 60608 608609 6096 0 6 06 66 66 3 6 3

2614-2614

6 5 6 56 6 6 66 66 8 6 86 9 6 9

2620-2620

6 6

2622-2622

6 3 6 36 6

2625-2625

6 6 6 66 66 8 6 8

2629-2629

630 63063 6363 63

2633-2633

63 63635 635636 63663 63638 638639 6396 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 6

2648-2648

6 9 6 9650 65065 6565 65653 65365 65655 655656 65665 65658 658

2659-2659

660 66066 6666 66

2663-2663

66 66

2665-2665

666 66666 66668 668669 6696 0 6 06 66 6

2673-2673

6 66 5 6 56 6 6 66 6

2678-2678

6 9 6 9

2680-2680

68 6868 68683 68368 68685 685686 68668 68

2688-2688

689 689690 690

2691-2691

69 69693 69369 69695 695696 69669 69698 698699 699

Figure 2: Rendered prototype output after landscape initialisation without prior community segmentation

threshold κ, where a higher κ results in more, but smaller com-munities. For each community Pi , pairwise neighbourhood simi-larities are calculated using euclidean distance between nodes,forming the triangular matrix Si , where skl is the similarity be-tween pk ,pl ∈ Pi .

Furthermore, we train document embeddings [15, 25] on allemails and use them to infer high dimensional semantic vec-tor representations. Let Mi be the set of emails that originatedin community Pi . For each email m ∈ Mi , the dimensionalityis reduced using t-SNE [16], which retains possible semanticclusterings of documents in the higher dimensional space. Theresulting two-dimensional vectors are then placed as dots on themap using the centre of embedded network communities as therespective origin, whereas the size is determined by the numberof related individuals.

We also initialise communication network’s layout. Thereby,the staring position of a node representing an individual is de-termined by the normalised sum of two-dimensional vectors ofall emails he or she has sent or received. This way, we implic-itly group semantically related individuals into communities asfrequent communication biases this normalised sum. Straightedges are added between the nodes if the respective individualsexchanged emails. Note, that many edges may only representa small number of emails. Applying a variable threshold σ canreduce the computational load in later stages, as these edges willnot impact the overall layout very much. They can be addedagain as the user requests a detailed visualisation by zooming inor through other interactions.

In the algorithm’s second stage, we iteratively try to improvethe layout of the communication network by finding a balancebetween the closeness of nodes to semantic context and denselyconnected neighbourhoods a node belongs to. Therefore, foreach individual pj ∈ P we use linear regression to fit a line m̂pjthough all two-dimensional vectors of emails he or she has sentor received. As a node is placed near this line, it remains in asemantically good position.

Adjusting the Network Layout. The first stage of our proposedalgorithm produces a fixed document landscape and roughly fitsthe communication network on top.We now aim to incrementallyadapt the layout of the graph to better reflect salient structuresin the network while keeping each individual’s node close to thereflective semantic area in the landscape.

Therefore we define a score quantifying how well the currentlayout fits these objectives:∑

pi ∈P

[η d (pi ,m̂pi ) +

∑pj ∈P

θ(si j − d (pi ,pj )

) ](1)

where d (·, ·) is the distance between two nodes (zero if no con-nection exists) or shortest distance from a node to its ideal line.To adjust the layout towards either a better semantic or structuralfit, we introduce parameters θ and η.

In order to minimise 1, we use stochastic gradient descent. Ineach iteration step, we can derive the direction and magnitudeeach node should be moved towards a better semantic fit andcloser proximity to it’s neighbourhood in the network.

The semantic gradient for pj ∈ P is defined by

δ⃗sj := (pmj − pj ) ∥pmj − pj ∥ (2)

where pmj is the closest point on m̂pj to pj and ∥·∥ denotes theeuclidean norm, while neighbourhood gradient is defined by

δ⃗nj :=∑pk ∈Ppj∼pk

(pj − pk )(∥pj − pk ∥ − sjk

)(3)

where pj ∼ pk denotes that an edge exists between pj and pk .With the definitions in 2 and 3, we can formulate the update

vector δ⃗j for node pj ∈ P as

δ⃗j := ξ(θδ⃗nj + ηδ⃗

sj

)(4)

where ξ is the learning rate and θ ,η as before parameters toweight between a better semantic or neighbourhood fit.

Most likely, complex network structures might prevent thestochastic gradient descent to find a stable minimum, so the scoreof the objective function should be monitored or intermediatelayouts be visually evaluated to determine a satisfactory result.

Post Processing. Lastly, we use the post processing stage toenhance the readability of our visualisation. Densely connectedcommunities in the graph are potentially hard to read, thus weapply edge bundling [1] to visually clear latent structures. Wealso apply MapSets [7] to separate the regions for each commu-nity. Since semantically similar emails may appear in differentcommunities, we apply colouring based on clusters in the originalglobal document embedding space to retain this aspect. Choosingthe colours depends on the number of latent topics that shouldbe depicted [24]. If the topic number exceeds 25-30 topics, group-ing topics and allowing for zooming within a two-level topic-hierarchy ensures distinguishable colors for up to 10 subtopics(25 × 10 = 25). In order to represent temporal aspects of the data,we calculate the kernel density of the document landscape forfixed time-intervals, which can be used to add heat-map overlaysthat users can select later on.

Page 4: Topic-aware Network Visualisation toExplore Large Email ...bigvis2018.imis.athena-innovation.gr/papers/BigVis_2018_paper_5.pdf · Nowadays, more and more large datasets exhibit an

Figure 3: Semantic landscape of email contents and dominant communication patterns (drawn mockup)

4 CONCLUSION AND VISIONIn this paper, we described an algorithm to lay out a commu-nication network on top of a landscape of semantically embed-ded emails. This is still work in progress, thus 3 shows only amanually drawn mock-up of the visualisation we envision. In it,individuals are represented as nodes positioned such that denselyconnected communities are visually clustered. Edges describe theemail traffic, where the opacity and thickness is used to indicatethe frequency of messages between the nodes they connect.

The semantic representations of emails are used to place dotson a background layer which we call the document landscape.This landscape is used as additional input to the graph layoutalgorithm, aiming to place a node within corresponding semanticregions. The colouring of regions in the landscape is derived fromdensely connected communities in the communication graph. Op-tionally, representative words are selected for densely populatedareas in the landscape, so that users get a rough idea about sub-jects in that area. The aforementioned timestamps of emails canbe used to generate a heatmap overlay to show the activity in acertain time interval which is controlled by a slider. Similar tomodern geographical maps, zooming into a region reveals moredetails. In our case, less prominent individuals and their connec-tions are shown along with additional salient phrases from thedocument landscape. Selecting a node will not only highlight con-nected edges but may also temporarily show more edges whichwere previously hidden at that zoom level. The user will also beable to retrieve documents with the help of a selection rectangleor clicking dots in the document landscape.

In future work, we hope to evaluate this system using full-scale real-world data as well as practitioners from journalism andauditing. It may also be interesting to experimentwith embeddingmethods, which take both the emails and the network graph asinput and directly project the inferred representations into thetwo-dimensional landscape to simplify the proposed algorithm.

REFERENCES[1] Benjamin Bach, Nathalie Henry Riche, Christophe Hurter, Kim Marriott, and

Tim Dwyer. 2017. Towards unambiguous edge bundling: Investigating con-fluent drawings for network visualization. Transactions on Visualization andComputer Graphics 23, 1 (2017), 541–550.

[2] Nikos Bikakis and Timos Sellis. 2016. Exploration and visualization in theweb of big linked data: A survey of the state of the art. In Proceedings of theWorkshops of the EDBT/ICDT 2016 Joint Conference, Vol. 1558. CEUR-WS.org.

[3] Marie-Anne Chabin. 2017. Panama papers: a case study for records manage-ment? Brazilian Journal of Information Science: Research Trends 11, 4 (2017).

[4] Yanhua Chen, Lijun Wang, Ming Dong, and Jing Hua. 2009. Exemplar-basedvisualization of large document corpus. Transactions on Visualization andComputer Graphics 15, 6 (2009), 1161–1168.

[5] Mark Coddington. 2015. Clarifying journalismâĂŹs quantitative turn: A typol-ogy for evaluating data journalism, computational journalism, and computer-assisted reporting. Digital Journalism 3, 3 (2015), 331–348.

[6] Tommy Dang and Angus Forbes. 2017. CactusTree: A tree drawing approachfor hierarchical edge bundling. In Proc. of the Pacific Visualization Symposium.IEEE, 210–214.

[7] Alon Efrat, Yifan Hu, Stephen G Kobourov, and Sergey Pupyrev. 2015. MapSets:Visualizing Embedded and Clustered Graphs. Journal of Graph Algorithmsand Applications 19, 2 (2015), 571–593.

[8] Blaz Fortuna, Marko Grobelnik, and Dunja Mladenic. 2005. Visualization oftext document corpus. Informatica 29, 4 (2005), 497–502.

[9] Katrin Franke and Sargur N Srihari. 2007. Computational forensics: Towardshybrid-intelligent crime investigation. In International Symposium on Infor-mation Assurance and Security. IEEE, 383–386.

[10] Daniel Fried and Stephen G Kobourov. 2014. Maps of computer science. InProc. of the Pacific Visualization Symposium. IEEE, 113–120.

[11] Martin Gronemann and Michael Jünger. 2012. Drawing clustered graphs astopographic maps. In Proc. of the Symposium on Graph Drawing and NetworkVisualization. Springer, 426–438.

[12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learningfor networks. In Proc. of the Conference on Knowledge Discovery and DataMining. ACM, 855–864.

[13] Jan Hildenbrand, Arlind Nocaj, and Ulrik Brandes. 2016. Flexible Level-of-Detail Rendering for Large Graphs. In Proc. of the Symposium on Graph Draw-ing and Network Visualization. Springer, 625–627.

[14] Mukundan Karthik, Mariappan Marikkannan, and Arputharaj Kannan. 2008.An intelligent system for semantic information retrieval information fromtextual web documents. In International Workshop on Computational Forensics.Springer, 135–146.

[15] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentencesand documents. In Proc. of the International Conference on Machine Learning.PMLR, 1188–1196.

[16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data usingt-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605.

[17] Patrick Cheong-Iao Pang, Robert P Biuk-Aghai, Muye Yang, and Bin Pang.2017. Creating realistic map-like visualisations: Results from user studies.Journal of Visual Languages and Computing 43 (2017), 60–70.

[18] Arnaud Sallaberry, Yang-chih Fu, Hwai-Chung Ho, and Kwan-Liu Ma. 2016.Contact trees: Network visualization beyond nodes and edges. PLOS ONE 11,1 (2016), 1–23.

[19] Shilad Sen, Anja Beth Swoap, Qisheng Li, Brooke Boatman, Ilse Dippenaar,Rebecca Gold, Monica Ngo, Sarah Pujol, Bret Jackson, and Brent Hecht. 2017.Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement.In Proc. of Conference on Intelligence User Interfaces. ACM, 179–190.

[20] Kaveri Subrahmanyam, Stephanie M Reich, Natalia Waechter, and GuadalupeEspinoza. 2008. Online and offline social networks: Use of social networkingsites by emerging adults. Journal of Applied Developmental Psychology 29, 6(2008), 420–433.

[21] Jun Tao, Jian Xu, Chaoli Wang, and Nitesh V Chawla. 2017. HoNVis: Visualiz-ing and Exploring Higher-Order Networks. In Proc. of the Pacific VisualizationSymposium. IEEE, 1–10.

[22] Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X Zhou, WeihongQian, Lei Shi, Li Tan, and Qiang Zhang. 2010. TIARA: a visual exploratorytext analytic system. In Proc. of the Conference on Knowledge Discovery andData Mining. ACM, 153–162.

[23] Jaewon Yang and Jure Leskovec. 2014. Overlapping communities explaincore–periphery publisher of networks. Proc. IEEE 102, 12 (2014), 1892–1902.

[24] Achim Zeileis, Kurt Hornik, and Paul Murrell. 2009. Escaping RGBland: se-lecting colors for statistical graphics. Computational Statistics & Data Analysis53, 9 (2009), 3259–3270.

[25] Zhaocheng Zhu and Junfeng Hu. 2017. Context Aware Document Embedding.CoRR abs/1707.01521 (2017).