Topic-aware Network Visualisation toExplore Large Email...
Transcript of Topic-aware Network Visualisation toExplore Large Email...
![Page 1: Topic-aware Network Visualisation toExplore Large Email ...bigvis2018.imis.athena-innovation.gr/papers/BigVis_2018_paper_5.pdf · Nowadays, more and more large datasets exhibit an](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c51007e708231d434cbb8/html5/thumbnails/1.jpg)
Topic-aware Network Visualisation toExplore Large Email Corpora
Tim RepkeHasso Plattner InstitutePotsdam, [email protected]
Ralf KrestelHasso Plattner InstitutePotsdam, [email protected]
ABSTRACTNowadays, more and more large datasets exhibit an intrinsicgraph structure. While there exist special graph databases to han-dle ever increasing amounts of nodes and edges, visualising thisdata becomes infeasible quickly with growing data. In addition,looking at its structure is not sufficient to get an overview of agraph dataset. Indeed, visualising additional information aboutnodes or edges without cluttering the screen is essential. In thispaper, we propose an interactive visualisation for social networksthat positions individuals (nodes) on a two-dimensional canvassuch that communities defined by social links (edges) are easilyrecognisable. Furthermore, we visualise topical relatedness be-tween individuals by analysing information about social links, inour case email communication. To this end, we utilise documentembeddings, which project the content of an email message intoa high dimensional semantic space and graph embeddings, whichproject nodes in a network graph into a latent space reflectingtheir relatedness.
1 INTRODUCTIONIn ourmodern information societywe produce substantial amountsof data each day. A large portion of it comes from the communi-cation on social media platforms or through emails. Special graphdatabases enable the efficient storage of these large communica-tion networks and provide interfaces to query or analyse the data.Visualising networks in their entirety on the other hand is a verychallenging task. Users investigating a communication networkwant to find information about when does who communicatewith whom about what. These kind of networks can be found inmany different shapes. Modern social networks, such as Twitteror Facebook exhibit similar structures as classic, offline socialnetworks [20]. We investigate another type of social network: acollection of emails.
Given the communication data over a year or more, it is prac-tically impossible to gain an overview or quick insights intothe latent network structure with a basic approach as shown in1. Also, in such a traditional network visualisation, informationabout the content of messages sent between individuals is lost.Besides these traditional systems, more exotic approaches usethe metaphor of geographical maps [17] to visualise networks,for example using topology to reflect connectivity of denselyconnected social communities. The map analogy can also be usedto visualise the contents of documents by embedding them intoa high dimensional semantic space [15] and projecting it on themap as a document landscape. In order to highlight how relation-ships form and change based on the interactions, the metaphorof a growing tree ca be used (ContactTrees [18]). Although thisreflects temporal aspects of dynamic networks well, it focuses on
© 2018 Copyright held by the owner/author(s). Published in the WorkshopProceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permittedunder the terms of the Creative Commons license CC-by-nc-nd 4.0.
one person as the root, thus an overview of the entire networkis lost. CactusTrees [6] on the other hand represent hierarchicalstructures with the goal of untangling overlaid bundles of inter-secting edges, making distant connections more apparent. Ashigher order dependencies may get lost in traditional visualisa-tions, HoNVis [21] adds nodes to encode dependencies in chainsof interactions. Usually, a communication network has manynodes and overlapping connections already, so Yang et al. [23]rather focus on discovering overlapping cores to improve theidentification of community boundaries to highlight global latentstructures. Similarly, Gronemann et al. [11] use the metaphor ofislands and hills to visualise clustered graphs, making denselyconnected communities clearly noticeable. The edges are bundledand follow valleys of the resulting topology, thus making rela-tionships between other communities hard to follow. MapSets [7]assume a graph that was laid out using embeddings reflectingcommunities. An algorithm then draws regions around clustersof nodes, such that the bounding shapes are contiguous andnon-overlapping, but yet abstract. Another approach to visualisenetworks at full scale is to aggregate nodes based on their spa-tial distribution and thereby allowing for a simple explorationwith contour lines and heatmap overlays to emphasise latentstructures as proposed by Hildenbrand et al. [13].
Document visualisation aims to visualise the content, such thatusers gain quick insights into topics, latent phrases, or trends.Tiara [22] extracts topics and derives time-sensitive keywordsto depict evolving subjects over time as stacked plots. Otherapproaches project documents into a latent space, using topicmodels or embeddings. Creating scatter-plots of embedded docu-ments of a large corpus may result in a very dense and unclearlayout, so Chen et al. [4] developed an algorithm to reduce over-full visualisations by picking representative documents. A dif-ferent approach is taken by Fortuna et al. [8], who do not showdocuments directly, but generate a heatmap of the populatedcanvas and overlay it with salient phrases at more densely popu-lated areas from the underlying documents in that region. Friedlet al. [10] extend that concept by drawing clear lines betweenregions and colouring them. They also add edges between salientphrases based on co-occurrences in the texts. Most recently Car-tograph [19] was proposed, which is visually very similar toprevious approaches, but uses pre-rendered information of dif-ferent resolution and map technology to enable a responsiveinteractive visualisation. Regions are coloured based on underly-ing ontologies from a knowledge-base.
Our goal is to merge approaches for network and documentvisualisations in one interactive user interface. This means tointegrate multiple dimensions of email datasets including time,interactions, users and topics into a 2D map representation. Giv-ing an overview over latent structures and topics in one mapmay significantly improve the exploration of a corpus by users
1
![Page 2: Topic-aware Network Visualisation toExplore Large Email ...bigvis2018.imis.athena-innovation.gr/papers/BigVis_2018_paper_5.pdf · Nowadays, more and more large datasets exhibit an](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c51007e708231d434cbb8/html5/thumbnails/2.jpg)
Figure 1: Traditional basic visualisation of a communication graph from 2000 emails using force layout
unfamiliar with the domain and terminology. Also domain ex-perts could benefit from such an overview, e.g. by easily beingable to identify global patterns in the data.
A specific application scenario that could benefit from suchintegrated, interactive visualisations is the analysis of large, un-structured, heterogeneous data collections. Data-driven journal-ism [5] often has to deal with leaked, unstructured, very hetero-geneous data, e.g. in the context of the Panama Papers, wherejournalists needed to untangle and order huge amounts of infor-mation, search entities, and visualise found patterns [3]. Similardatasets are of interest in the context of computational foren-sics [9]. Auditing firms and law enforcement need to sift throughhuge amounts of data to gather evidence of criminal activity,often involving communication networks and documents [14].
2 INTERACTIVE VISUALISATIONSystems for document exploration largely vary in what theydisplay and how users interact with them. This depends partlyon the available raw data, but also on information extracted frompre-processing or enrichment with external sources. 1 showsa basic visualisation of the network graph extracted from anemail corpus. Although it is an improvement over only listingconnections, large densely connected graphs quickly becomehard to read and information about the email contents is lost.
Exploring document collections can be seen as a top-downapproach, where the system provides abstract overviews of theentire document collection and users incrementally refine thesearch, narrowing the results to just a few documents of interest.Such a top-down approach may help users without prior knowl-edge to get a sense for the data by visualising high level latentstructures of communication networks or the topical distribu-tions.
In the scope of this work we primarily consider documents tobe emails or data attached to them. The sender, recipients, time,and content can directly be extracted from the raw data. We callthese – and results from further processing – dimensions that canbe visualised. From the contents one may infer named entities,topics, embeddings, or salient phrases, while the communicationnetwork spanned by sender-recipient pairs can be used to detectsalient structures and hierarchies. The temporal informationenables the previously mentioned data to be analysed over timeto detect evolving or changing patterns.
There are numerous ways to visualise each dimension onits own or in combination with others. The requirement of adimension and its priority in a visualisation is dictated by thesystem objective. From the wide range of possibilities, we strivefor a system which supports the exploration of a large collection
of documents without any prior knowledge about its content andindividuals involved.
In our system, we use the names and email addresses of sendersand recipients (individuals), communication network, semanticvector representations of email contents, and as part of an overlaythe timestamps of emails and propose a graph layout over adocument landscape that visually describes who talks with whomabout what at a given time period.
3 SYSTEM ARCHITECTUREVisualising communication networks in a topic-aware fashion toexplore documents and salient structures is not straightforwardDifferent layout objectives may produce contradicting results andthe challenges of processing big data need to be addressed [2]. Inthis section, we describe algorithmic approaches behind the sys-tem we are working on. For a discussion of engineering aspectson how to store, serve, and render the map-like data, we refer tothe Cartograph stack [19], as we will focus on the process howto get the information that the map is generated from.
We visualise the embedded emails as dots in a two-dimensionallandscape in which individuals are placed as nodes connected byedges. All emails between two individuals are reduced into oneedge reducing visual complexity and making it easier to detectsalient structures. However, that comes with the trade-off thatnodes and edges cannot be perfectly placed in the landscape tocover all semantic aspects of the communication between them,but rather an estimate. Our very early prototype placed someindividuals with no dominant topic in a crowded area in thecentre of the landscape as shown in 2, where colours of opaquedots for emails correspond to that of the sender. Although thenetwork visualisation at this point does not make connectionsmore clear than in 1, users can already distinguish individualswith similar or unrelated topics.
Our proposed algorithm to find a stable network layout hasthree stages, namely an (i) initialisation phase which createsthe landscape and roughly places nodes and connections, an(ii) update phase which iteratively updates the node placementtowards a better fit, and finally a (iii) post processing phase whereedges become splines to make latent structures more clear and amap topology is added.
Initialising the Landscape. To generate the document land-scape, we first process the network graph to roughly determineregions, where documents will be placed. Therefore we applynode2vec [12] to the communication network and embed each in-dividual’s node. We separate the graph into communities Pi ⊂ Pusing the kernel density of the resulting populated space at
![Page 3: Topic-aware Network Visualisation toExplore Large Email ...bigvis2018.imis.athena-innovation.gr/papers/BigVis_2018_paper_5.pdf · Nowadays, more and more large datasets exhibit an](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c51007e708231d434cbb8/html5/thumbnails/3.jpg)
0 03 35 56 68 89 90 03 3
14-14
5 56 68 89 9
20-20
3 35 56 68 89 930 303 33 3
33-33
34-34
35 3536 363 338 3839 390 0
41-41
3 3
45-45
6 68 89 950 505 55 553 535 555 5556 565 558 5859 5960 606 66 6
63-63
64-64
65 6566 666 668 6869 690 03 35 56 68 89 980 808 88 883 838 8
85-85
86 868 888 8889 8990 909 99 993 93
94-94
95-95
96 969 9
98-98
99 9900 000 00 003 030 005 0506 060 0
108-108
09 090 0
111-111
3 35 56 6
118-118
9 90 03 3
124-124
5 56 6
127-127
8 89 930 30
131-131
3 333 333 335 3536 363 338 3839 390 0
143-143
145-145
6 68 89 950 505 55 553 535 555 5556 565 558 5859 5960 606 6
162-162
63 636 665 6566 666 668 6869 690 03 3
175-175
176-176
8 89 980 80
181-181
8 883 838 885 8586 868 888 8889 8990 90
191-191
9 993 939 995 9596 969 998 9899 9900 000 00 003 030 005 0506 060 008 08
209-209
210-210
213-213
5 56 6
218-218
9 90 0
221-221
3 35 56 68 89 930 30
231-231
3 333 333 335 3536 363 338 3839 390 03 35 56 6
247-247
8 89 950 505 55 5
253-253
5 555 5556 56
257-257
58 5859 5960 606 66 663 63
264-264
65 65
266-266
267-267
68 6869 690 03 35 56 68 89 980 808 88 883 838 885 8586 868 888 8889 8990 909 99 993 93
294-294
95 9596 969 998 9899 99300 30030 3030 30303 30330 30
305-305
306 306
307-307
308 308309 3093 0 3 03 3
312-312
3 3 3 33 33 5 3 53 6 3 6
317-317
3 8 3 83 9 3 93 0 3 03 3
322-322
3 3 3 33 33 5 3 53 6 3 6
327-327
3 8 3 83 9 3 9330 33033 3333 33333 33333 33335 335336 33633 33338 338339 3393 0 3 03 3
342-342
3 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 9350 35035 3535 35353 35335 35355 355356 35635 35358 358359 359360 36036 3636 36363 36336 36365 365366 36636 36368 368369 3693 0 3 03 33 33 3 3 3
374-374
3 5 3 53 6 3 63 33 8 3 83 9 3 9380 38038 3838 38383 38338 38385 385386 38638 38388 388389 389390 39039 3939 39393 39339 39395 395396 39639 39398 398399 39900 000 00 003 030 005 0506 060 008 0809 090 03 35 56 68 89 90 03 35 56 68 89 930 303 33 333 33
434-434
35 3536 363 338 38
439-439
0 0
442-442
3 35 56 6
448-448
9 950 505 55 553 535 555 5556 565 558 5859 5960 606 66 663 636 665 6566 666 668 6869 690 0
472-472
3 3
474-474
5 56 68 89 980 80
481-481
8 883 838 8
485-485
86 868 888 8889 8990 909 99 993 939 995 9596 969 998 9899 99500 500
501-501
50 50503 50350 50505 505506 50650 50508 508
509-509
5 0 5 0
511-511
5 55 3 5 35 55 5 5 5
516-516
5 55 8 5 8
519-519
520-520
5 5
522-522
5 3 5 35 5
525-525
5 6 5 65 5
528-528
5 9 5 9530 53053 5353 53533 533
534-534
535 535536 53653 53538 538
539-539
5 0 5 05 55 55 3 5 35 55 5 5 55 6 5 65 55 8 5 85 9 5 9550 55055 5555 55553 55355 55555 555556 55655 55558 558559 559560 56056 5656 56563 56356 56565 565566 56656 56568 568569 5695 0 5 05 55 55 3 5 35 55 5 5 55 6 5 65 55 8 5 85 9 5 9
580-580
58 5858 58583 58358 58585 585586 58658 58588 588589 589590 59059 5959 59593 59359 59595 595596 59659 59598 598599 599600 60060 60
602-602
603 60360 60605 605606 60660 60608 608609 6096 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 66 8 6 8
619-619
6 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 66 8 6 86 9 6 9630 63063 6363 63633 63363 63635 635636 63663 63638 638639 6396 0 6 0
641-641
6 66 3 6 36 6
645-645
6 6 6 66 66 8 6 86 9 6 9650 65065 6565 65653 65365 65655 655656 65665 65658 658659 659660 66066 6666 66
663-663
66 66665 665666 666
667-667
668 668669 6696 0 6 06 66 66 3 6 3
674-674
6 5 6 5
676-676
6 66 8 6 86 9 6 9680 68068 6868 68683 68368 68685 685686 68668 68
688-688
689-689
690 69069 6969 69693 69369 69695 695
696-696
69 69698 698699 69900 000 00 003 03
704-704
05 0506 060 008 0809 090 03 35 56 68 8
719-719
0 0
722-722
3 3
725-725
6 6
727-727
8 89 930 303 3
732-732
33 333 335 3536 363 338 3839 390 0
741-741
3 35 5
746-746
8 89 950 505 5
752-752
753-753
5 555 5556 565 558 5859 5960 606 66 663 636 665 6566 666 6
768-768
69 690 03 35 56 6
778-778
9 980 808 88 883 838 885 8586 86
787-787
88 8889 8990 909 99 993 93
794-794
795-795
96 969 998 9899 99800 800
801-801
80 80803 80380 80805 805806 80680 80808 808
809-809
8 0 8 08 88 88 3 8 38 88 5 8 58 6 8 68 88 8 8 88 9 8 98 0 8 08 88 88 3 8 38 8
825-825
8 6 8 68 88 8 8 88 9 8 9830 83083 8383 83833 83383 83835 835836 83683 83838 838839 8398 0 8 08 88 88 3 8 3
844-844
8 5 8 58 6 8 68 8
848-848
8 9 8 9850 85085 8585 85853 85385 85855 855856 85685 85858 858859 859860 86086 8686 86863 86386 86865 865866 86686 86
868-868
869 8698 0 8 08 88 88 3 8 38 8
875-875876-876
8 8
878-878
8 9 8 9880 88088 8888 88883 88388 88885 885886 88688 88888 888889 889890 89089 89
892-892
893 89389 89895 895896 89689 89898 898899 899900 90090 9090 90903 903
904-904
905-905
906 90690 90908 908909 9099 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 9
918-918
9 9 9 99 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9930 93093 9393 93933 93393 93935 935936 93693 93938 938939 9399 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9950 950
951-951
95 95
953-953
95 95955 955956 95695 95958 958959 959960 96096 9696 96963 963
964-964
965 965966 96696 96968 968969 969
970-970
9 99 99 3 9 39 99 5 9 5
976-976
9 9
978-978
979-979
980-980
98 9898 98
983-983
98 98985 985986 98698 98988 988989 989990 99099 9999 99993 99399 99
995-995
996-996
99 99998 998999 999000 00000 0000 00003 00300 00005 005006 006
1007-1007
008 008009 0090 0 0 00 00 00 3 0 30 0
1015-1015
0 6 0 60 00 8 0 80 9 0 90 0 0 00 00 00 3 0 30 00 5 0 5
1026-1026
1027-1027
0 8 0 80 9 0 9030 03003 0303 03
1033-1033
03 03035 035036 03603 03
1038-1038
039 0390 0 0 00 00 00 3 0 30 00 5 0 50 6 0 60 00 8 0 80 9 0 9050 05005 0505 05053 05305 05055 055056 05605 05058 058
1059-1059
060 060
1061-1061
06 06063 06306 06065 065066 06606 06
1068-1068
069 0690 0 0 0
1071-1071
1072-1072
0 3 0 30 00 5 0 5
1076-1076
0 0
1078-1078
0 9 0 9080 08008 0808 08083 08308 08085 085086 08608 08088 088089 089090 09009 0909 09093 09309 09095 095
1096-1096
09 09098 098099 099
1100-1100
0 00 003 03
1104-1104
1105-1105
06 060 008 0809 09
1110-1110
3 35 56 68 8
1119-1119
1120-1120
3 35 56 68 89 930 303 33 333 33
1134-1134
35 3536 36
1137-1137
38 3839 390 0
1143-1143
5 56 6
1147-1147
8 89 950 505 55 553 535 555 5556 565 558 5859 5960 606 66 663 636 665 65
1166-1166
6 668 6869 690 0
1171-1171
3 35 56 68 89 980 808 88 883 838 885 8586 868 888 8889 89
1190-1190
9 9
1192-1192
1193-1193
9 9
1195-1195
96 969 998 9899 9900 00
1201-1201
0 003 030 005 0506 060 008 0809 090 03 35 56 6
1218-1218
9 90 03 3
1225-1225
6 68 89 930 303 33 333 333 335 3536 363 338 3839 390 03 35 56 6
1247-1247
8 89 950 505 55 5
1253-1253
5 555 5556 56
1257-1257
58 5859 5960 606 66 6
1263-1263
6 665 6566 666 668 6869 690 03 35 5
1276-1276
8 89 980 808 88 883 838 8
1285-1285
1286-1286
8 888 8889 8990 909 99 993 939 995 9596 969 998 9899 99300 300
1301-1301
30 30303 30330 30305 305306 30630 30308 308309 3093 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 93 0 3 03 33 3
1323-1323
3 33 5 3 53 6 3 63 33 8 3 83 9 3 9330 33033 3333 33333 33333 33335 335336 33633 33338 338
1339-1339
3 0 3 03 33 33 3 3 33 33 5 3 53 6 3 6
1347-1347
3 8 3 83 9 3 9350 35035 35
1352-1352
353 35335 35355 355356 35635 35358 358359 359360 36036 36
1362-1362
363 363
1364-1364
365 365366 36636 36368 368
1369-1369
3 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 9
1380-1380
38 3838 38383 38338 38385 385386 386
1387-1387
388 388389 389390 39039 3939 39393 39339 39395 395396 39639 39398 398399 39900 000 0
1402-1402
03 030 0
1405-1405
06 060 008 0809 090 03 35 56 68 89 90 03 3
1425-1425
6 68 89 930 303 33 333 333 335 3536 36
1437-14371438-1438
39 390 03 35 56 68 89 950 505 5
1452-1452
53 53
1454-1454
55 5556 565 558 58
1459-1459
60 606 66 663 636 665 6566 666 668 6869 690 03 3
1474-1474
5 56 6
1478-1478
9 980 80
1481-1481
8 883 838 885 8586 868 888 8889 89
1490-1490
9 9
1492-1492
93 939 995 9596 969 9
1498-1498
99 99500 50050 50
1502-1502
503 50350 50505 505506 50650 50508 508509 5095 0 5 05 55 55 3 5 3
1514-1514
1515-1515
5 6 5 65 55 8 5 85 9 5 95 0 5 05 55 55 3 5 35 55 5 5 55 6 5 6
1527-1527
5 8 5 85 9 5 9530 53053 5353 53533 53353 53535 535536 536
1537-1537
538 538539 5395 0 5 05 55 55 3 5 35 55 5 5 55 6 5 65 55 8 5 8
1549-1549
550 55055 5555 55
1553-1553
1554-1554
555 555556 55655 55
1558-1558
559 559560 56056 5656 56563 56356 56
1565-1565
566 56656 56568 568569 5695 0 5 05 55 5
1573-1573
5 55 5 5 55 6 5 65 5
1578-1578
5 9 5 9580 580
1581-1581
1582-1582
583 58358 58585 585586 58658 58588 588589 589590 59059 5959 59593 59359 59595 595596 59659 59598 598
1599-1599
600 60060 6060 60
1603-1603
60 60605 605606 60660 60608 608609 6096 0 6 06 66 66 3 6 36 6
1615-1615
6 6 6 66 66 8 6 86 9 6 96 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 66 8 6 86 9 6 9630 630
1631-1631
63 63633 63363 63635 635636 636
1637-1637
638 638639 6396 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 6
1648-1648
6 9 6 9650 65065 65
1652-1652
653 65365 65
1655-1655
656 656
1657-1657
658 658659 659660 66066 6666 66663 66366 66665 665666 66666 66668 668669 6696 0 6 06 66 6
1673-1673
6 66 5 6 56 6 6 66 66 8 6 86 9 6 9
1680-1680
68 6868 68
1683-1683
1684-1684
685 685686 68668 68688 688
1689-1689
690 69069 6969 69693 69369 69695 695696 69669 69698 698699 69900 00
1701-1701
0 0
1703-1703
1704-1704
05 0506 060 0
1708-1708
09 090 03 35 56 6
1717-1717
8 8
1719-1719
1720-1720
3 3
1724-1724
5 56 68 89 9
1730-1730
3 33 333 333 335 35
1736-1736
3 338 3839 390 03 35 56 68 89 950 505 55 553 535 555 5556 565 558 5859 59
1760-1760
6 6
1762-1762
63 63
1764-1764
65 6566 666 6
1768-1768
69 69
1770-1770
3 3
1774-1774
5 56 68 89 980 808 88 883 838 885 8586 868 888 8889 8990 909 99 993 939 995 95
1796-1796
9 998 9899 99800 80080 8080 80803 80380 80805 805806 80680 80808 808809 8098 0 8 08 88 88 3 8 38 88 5 8 58 6 8 68 88 8 8 88 9 8 98 0 8 08 88 88 3 8 38 88 5 8 58 6 8 6
1827-1827
8 8 8 88 9 8 9
1830-1830
83 8383 83833 83383 83835 835836 836
1837-1837
838 838839 8398 0 8 08 88 88 3 8 38 88 5 8 58 6 8 6
1847-1847
8 8 8 88 9 8 9850 85085 8585 85853 85385 85855 855856 85685 85858 858859 859860 86086 8686 86863 863865 865866 86686 86868 868869 8698 0 8 08 88 88 3 8 38 88 5 8 5
1876-1876
8 88 8 8 88 9 8 9880 88088 8888 88883 88388 88885 885886 88688 88888 888889 889
1890-1890
89 8989 89893 89389 89895 895896 896
1897-1897
898 898899 899900 900
1901-1901
90 90903 903
1904-1904
905 905906 90690 90908 908909 9099 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9
1920-1920
1921-1921
9 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9930 93093 9393 93933 93393 93935 935936 93693 93938 938939 9399 0 9 0
1941-1941
9 99 3 9 39 99 5 9 59 6 9 69 99 8 9 8
1949-1949
950 95095 9595 95953 95395 95955 955
1956-1956
95 95958 958959 959960 96096 9696 96963 96396 96965 965966 96696 96968 968969 9699 0 9 09 99 99 3 9 39 99 5 9 59 6 9 69 99 8 9 89 9 9 9980 98098 9898 98983 98398 98985 985986 986
1987-1987
988 988989 989990 990
1991-1991
99 99
1993-1993
99 99995 995996 99699 99998 998999 999000 00000 0000 00003 00300 00005 005006 00600 00008 008009 0090 0 0 00 0
2012-2012
0 3 0 3
2014-2014
0 5 0 50 6 0 60 0
2018-2018
0 9 0 90 0 0 00 00 00 3 0 30 00 5 0 50 6 0 60 00 8 0 80 9 0 9
2030-2030
03 0303 03033 03303 03035 035036 03603 03038 038039 0390 0 0 00 00 00 3 0 30 00 5 0 50 6 0 60 00 8 0 80 9 0 9050 05005 0505 05053 053
2054-2054
055 055056 05605 05058 058059 059060 06006 0606 06063 06306 06065 065066 06606 06068 068069 069
2070-2070
0 00 00 3 0 30 00 5 0 50 6 0 60 00 8 0 80 9 0 9080 08008 0808 08083 08308 08085 085086 08608 08088 088089 089090 09009 09
2092-2092
2093-2093
09 09095 095
2096-2096
09 09098 098099 09900 000 00 003 030 005 0506 060 0
2108-2108
09 090 03 35 5
2116-2116
2117-2117
8 89 90 03 3
2124-2124
5 56 68 89 930 303 33 333 333 3
2135-2135
36 363 338 3839 390 03 35 56 68 89 950 505 55 553 535 555 5556 565 558 5859 5960 606 66 6
2163-2163
6 665 6566 666 6
2168-2168
69 690 03 35 56 68 89 980 808 88 8
2183-2183
8 8
2185-2185
86 868 888 88
2189-2189
90 909 99 993 939 995 9596 969 9
2198-2198
99 99
2200-2200
0 00 003 030 005 0506 060 0
2208-2208
09 090 03 35 56 68 89 90 03 35 56 68 89 930 303 33 333 333 335 3536 363 338 3839 390 03 35 56 68 89 9
2250-2250
5 55 5
2253-2253
5 555 55
2256-2256
5 558 5859 5960 606 66 663 636 665 6566 666 668 6869 690 03 35 56 68 89 980 80
2281-2281
8 8
2283-2283
8 885 8586 868 888 8889 8990 909 99 993 939 995 9596 969 998 9899 99300 300
2301-2301
2302-2302
2303-2303
30 30305 305306 30630 30308 308309 3093 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 93 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 9330 33033 3333 33333 33333 33335 335
2336-2336
33 33338 338339 3393 0 3 03 33 33 3 3 33 33 5 3 53 6 3 63 33 8 3 83 9 3 9350 35035 3535 35353 35335 35355 355356 35635 35358 358359 359360 36036 3636 36363 36336 36365 365366 36636 36368 368369 369
2370-2370
3 3
2372-2372
3 3 3 33 3
2375-2375
3 6 3 63 3
2378-2378
3 9 3 9380 38038 3838 38383 38338 38385 385386 38638 38388 388389 389
2390-2390
39 3939 39393 39339 39395 395396 39639 39398 398399 399
2400-2400
0 00 003 03
2404-2404
2405-2405
06 060 008 08
2409-2409
0 0
2411-2411
2412-2412
3 35 56 6
2418-2418
2419-2419
0 0
2422-2422
3 35 56 6
2428-2428
9 930 303 33 333 333 335 3536 363 338 3839 39
2440-2440
3 35 56 68 89 9
2450-2450
5 55 553 53
2454-2454
55 5556 565 558 5859 59
2460-2460
6 66 663 636 6
2465-2465
66 666 668 68
2469-2469
0 03 35 56 68 89 980 808 88 883 838 885 8586 86
2487-2487
88 8889 89
2490-2490
9 99 993 939 995 9596 969 998 9899 99500 50050 5050 50503 50350 50505 505506 50650 50508 508
2509-2509
5 0 5 05 55 55 3 5 35 55 5 5 55 6 5 65 55 8 5 85 9 5 9
2520-2520
2521-2521
5 5
2523-2523
5 55 5 5 55 6 5 65 55 8 5 85 9 5 9530 53053 5353 53533 53353 53535 535536 53653 53538 538539 5395 0 5 05 55 55 3 5 35 55 5 5 55 6 5 6
2547-2547
5 8 5 85 9 5 9550 55055 5555 55553 55355 55555 555556 55655 55558 558559 559560 56056 5656 56563 56356 56565 565566 56656 56568 568569 5695 0 5 05 55 55 3 5 35 55 5 5 55 6 5 6
2577-2577
2578-2578
5 9 5 9580 58058 5858 58583 58358 58585 585586 58658 58
2588-2588
589 589590 59059 5959 59593 59359 59595 595596 59659 59598 598599 599
2600-2600
60 6060 60603 60360 60605 605606 60660 60608 608609 6096 0 6 06 66 66 3 6 3
2614-2614
6 5 6 56 6 6 66 66 8 6 86 9 6 9
2620-2620
6 6
2622-2622
6 3 6 36 6
2625-2625
6 6 6 66 66 8 6 8
2629-2629
630 63063 6363 63
2633-2633
63 63635 635636 63663 63638 638639 6396 0 6 06 66 66 3 6 36 66 5 6 56 6 6 66 6
2648-2648
6 9 6 9650 65065 6565 65653 65365 65655 655656 65665 65658 658
2659-2659
660 66066 6666 66
2663-2663
66 66
2665-2665
666 66666 66668 668669 6696 0 6 06 66 6
2673-2673
6 66 5 6 56 6 6 66 6
2678-2678
6 9 6 9
2680-2680
68 6868 68683 68368 68685 685686 68668 68
2688-2688
689 689690 690
2691-2691
69 69693 69369 69695 695696 69669 69698 698699 699
Figure 2: Rendered prototype output after landscape initialisation without prior community segmentation
threshold κ, where a higher κ results in more, but smaller com-munities. For each community Pi , pairwise neighbourhood simi-larities are calculated using euclidean distance between nodes,forming the triangular matrix Si , where skl is the similarity be-tween pk ,pl ∈ Pi .
Furthermore, we train document embeddings [15, 25] on allemails and use them to infer high dimensional semantic vec-tor representations. Let Mi be the set of emails that originatedin community Pi . For each email m ∈ Mi , the dimensionalityis reduced using t-SNE [16], which retains possible semanticclusterings of documents in the higher dimensional space. Theresulting two-dimensional vectors are then placed as dots on themap using the centre of embedded network communities as therespective origin, whereas the size is determined by the numberof related individuals.
We also initialise communication network’s layout. Thereby,the staring position of a node representing an individual is de-termined by the normalised sum of two-dimensional vectors ofall emails he or she has sent or received. This way, we implic-itly group semantically related individuals into communities asfrequent communication biases this normalised sum. Straightedges are added between the nodes if the respective individualsexchanged emails. Note, that many edges may only representa small number of emails. Applying a variable threshold σ canreduce the computational load in later stages, as these edges willnot impact the overall layout very much. They can be addedagain as the user requests a detailed visualisation by zooming inor through other interactions.
In the algorithm’s second stage, we iteratively try to improvethe layout of the communication network by finding a balancebetween the closeness of nodes to semantic context and denselyconnected neighbourhoods a node belongs to. Therefore, foreach individual pj ∈ P we use linear regression to fit a line m̂pjthough all two-dimensional vectors of emails he or she has sentor received. As a node is placed near this line, it remains in asemantically good position.
Adjusting the Network Layout. The first stage of our proposedalgorithm produces a fixed document landscape and roughly fitsthe communication network on top.We now aim to incrementallyadapt the layout of the graph to better reflect salient structuresin the network while keeping each individual’s node close to thereflective semantic area in the landscape.
Therefore we define a score quantifying how well the currentlayout fits these objectives:∑
pi ∈P
[η d (pi ,m̂pi ) +
∑pj ∈P
θ(si j − d (pi ,pj )
) ](1)
where d (·, ·) is the distance between two nodes (zero if no con-nection exists) or shortest distance from a node to its ideal line.To adjust the layout towards either a better semantic or structuralfit, we introduce parameters θ and η.
In order to minimise 1, we use stochastic gradient descent. Ineach iteration step, we can derive the direction and magnitudeeach node should be moved towards a better semantic fit andcloser proximity to it’s neighbourhood in the network.
The semantic gradient for pj ∈ P is defined by
δ⃗sj := (pmj − pj ) ∥pmj − pj ∥ (2)
where pmj is the closest point on m̂pj to pj and ∥·∥ denotes theeuclidean norm, while neighbourhood gradient is defined by
δ⃗nj :=∑pk ∈Ppj∼pk
(pj − pk )(∥pj − pk ∥ − sjk
)(3)
where pj ∼ pk denotes that an edge exists between pj and pk .With the definitions in 2 and 3, we can formulate the update
vector δ⃗j for node pj ∈ P as
δ⃗j := ξ(θδ⃗nj + ηδ⃗
sj
)(4)
where ξ is the learning rate and θ ,η as before parameters toweight between a better semantic or neighbourhood fit.
Most likely, complex network structures might prevent thestochastic gradient descent to find a stable minimum, so the scoreof the objective function should be monitored or intermediatelayouts be visually evaluated to determine a satisfactory result.
Post Processing. Lastly, we use the post processing stage toenhance the readability of our visualisation. Densely connectedcommunities in the graph are potentially hard to read, thus weapply edge bundling [1] to visually clear latent structures. Wealso apply MapSets [7] to separate the regions for each commu-nity. Since semantically similar emails may appear in differentcommunities, we apply colouring based on clusters in the originalglobal document embedding space to retain this aspect. Choosingthe colours depends on the number of latent topics that shouldbe depicted [24]. If the topic number exceeds 25-30 topics, group-ing topics and allowing for zooming within a two-level topic-hierarchy ensures distinguishable colors for up to 10 subtopics(25 × 10 = 25). In order to represent temporal aspects of the data,we calculate the kernel density of the document landscape forfixed time-intervals, which can be used to add heat-map overlaysthat users can select later on.
![Page 4: Topic-aware Network Visualisation toExplore Large Email ...bigvis2018.imis.athena-innovation.gr/papers/BigVis_2018_paper_5.pdf · Nowadays, more and more large datasets exhibit an](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c51007e708231d434cbb8/html5/thumbnails/4.jpg)
Figure 3: Semantic landscape of email contents and dominant communication patterns (drawn mockup)
4 CONCLUSION AND VISIONIn this paper, we described an algorithm to lay out a commu-nication network on top of a landscape of semantically embed-ded emails. This is still work in progress, thus 3 shows only amanually drawn mock-up of the visualisation we envision. In it,individuals are represented as nodes positioned such that denselyconnected communities are visually clustered. Edges describe theemail traffic, where the opacity and thickness is used to indicatethe frequency of messages between the nodes they connect.
The semantic representations of emails are used to place dotson a background layer which we call the document landscape.This landscape is used as additional input to the graph layoutalgorithm, aiming to place a node within corresponding semanticregions. The colouring of regions in the landscape is derived fromdensely connected communities in the communication graph. Op-tionally, representative words are selected for densely populatedareas in the landscape, so that users get a rough idea about sub-jects in that area. The aforementioned timestamps of emails canbe used to generate a heatmap overlay to show the activity in acertain time interval which is controlled by a slider. Similar tomodern geographical maps, zooming into a region reveals moredetails. In our case, less prominent individuals and their connec-tions are shown along with additional salient phrases from thedocument landscape. Selecting a node will not only highlight con-nected edges but may also temporarily show more edges whichwere previously hidden at that zoom level. The user will also beable to retrieve documents with the help of a selection rectangleor clicking dots in the document landscape.
In future work, we hope to evaluate this system using full-scale real-world data as well as practitioners from journalism andauditing. It may also be interesting to experimentwith embeddingmethods, which take both the emails and the network graph asinput and directly project the inferred representations into thetwo-dimensional landscape to simplify the proposed algorithm.
REFERENCES[1] Benjamin Bach, Nathalie Henry Riche, Christophe Hurter, Kim Marriott, and
Tim Dwyer. 2017. Towards unambiguous edge bundling: Investigating con-fluent drawings for network visualization. Transactions on Visualization andComputer Graphics 23, 1 (2017), 541–550.
[2] Nikos Bikakis and Timos Sellis. 2016. Exploration and visualization in theweb of big linked data: A survey of the state of the art. In Proceedings of theWorkshops of the EDBT/ICDT 2016 Joint Conference, Vol. 1558. CEUR-WS.org.
[3] Marie-Anne Chabin. 2017. Panama papers: a case study for records manage-ment? Brazilian Journal of Information Science: Research Trends 11, 4 (2017).
[4] Yanhua Chen, Lijun Wang, Ming Dong, and Jing Hua. 2009. Exemplar-basedvisualization of large document corpus. Transactions on Visualization andComputer Graphics 15, 6 (2009), 1161–1168.
[5] Mark Coddington. 2015. Clarifying journalismâĂŹs quantitative turn: A typol-ogy for evaluating data journalism, computational journalism, and computer-assisted reporting. Digital Journalism 3, 3 (2015), 331–348.
[6] Tommy Dang and Angus Forbes. 2017. CactusTree: A tree drawing approachfor hierarchical edge bundling. In Proc. of the Pacific Visualization Symposium.IEEE, 210–214.
[7] Alon Efrat, Yifan Hu, Stephen G Kobourov, and Sergey Pupyrev. 2015. MapSets:Visualizing Embedded and Clustered Graphs. Journal of Graph Algorithmsand Applications 19, 2 (2015), 571–593.
[8] Blaz Fortuna, Marko Grobelnik, and Dunja Mladenic. 2005. Visualization oftext document corpus. Informatica 29, 4 (2005), 497–502.
[9] Katrin Franke and Sargur N Srihari. 2007. Computational forensics: Towardshybrid-intelligent crime investigation. In International Symposium on Infor-mation Assurance and Security. IEEE, 383–386.
[10] Daniel Fried and Stephen G Kobourov. 2014. Maps of computer science. InProc. of the Pacific Visualization Symposium. IEEE, 113–120.
[11] Martin Gronemann and Michael Jünger. 2012. Drawing clustered graphs astopographic maps. In Proc. of the Symposium on Graph Drawing and NetworkVisualization. Springer, 426–438.
[12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learningfor networks. In Proc. of the Conference on Knowledge Discovery and DataMining. ACM, 855–864.
[13] Jan Hildenbrand, Arlind Nocaj, and Ulrik Brandes. 2016. Flexible Level-of-Detail Rendering for Large Graphs. In Proc. of the Symposium on Graph Draw-ing and Network Visualization. Springer, 625–627.
[14] Mukundan Karthik, Mariappan Marikkannan, and Arputharaj Kannan. 2008.An intelligent system for semantic information retrieval information fromtextual web documents. In International Workshop on Computational Forensics.Springer, 135–146.
[15] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentencesand documents. In Proc. of the International Conference on Machine Learning.PMLR, 1188–1196.
[16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data usingt-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605.
[17] Patrick Cheong-Iao Pang, Robert P Biuk-Aghai, Muye Yang, and Bin Pang.2017. Creating realistic map-like visualisations: Results from user studies.Journal of Visual Languages and Computing 43 (2017), 60–70.
[18] Arnaud Sallaberry, Yang-chih Fu, Hwai-Chung Ho, and Kwan-Liu Ma. 2016.Contact trees: Network visualization beyond nodes and edges. PLOS ONE 11,1 (2016), 1–23.
[19] Shilad Sen, Anja Beth Swoap, Qisheng Li, Brooke Boatman, Ilse Dippenaar,Rebecca Gold, Monica Ngo, Sarah Pujol, Bret Jackson, and Brent Hecht. 2017.Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement.In Proc. of Conference on Intelligence User Interfaces. ACM, 179–190.
[20] Kaveri Subrahmanyam, Stephanie M Reich, Natalia Waechter, and GuadalupeEspinoza. 2008. Online and offline social networks: Use of social networkingsites by emerging adults. Journal of Applied Developmental Psychology 29, 6(2008), 420–433.
[21] Jun Tao, Jian Xu, Chaoli Wang, and Nitesh V Chawla. 2017. HoNVis: Visualiz-ing and Exploring Higher-Order Networks. In Proc. of the Pacific VisualizationSymposium. IEEE, 1–10.
[22] Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X Zhou, WeihongQian, Lei Shi, Li Tan, and Qiang Zhang. 2010. TIARA: a visual exploratorytext analytic system. In Proc. of the Conference on Knowledge Discovery andData Mining. ACM, 153–162.
[23] Jaewon Yang and Jure Leskovec. 2014. Overlapping communities explaincore–periphery publisher of networks. Proc. IEEE 102, 12 (2014), 1892–1902.
[24] Achim Zeileis, Kurt Hornik, and Paul Murrell. 2009. Escaping RGBland: se-lecting colors for statistical graphics. Computational Statistics & Data Analysis53, 9 (2009), 3259–3270.
[25] Zhaocheng Zhu and Junfeng Hu. 2017. Context Aware Document Embedding.CoRR abs/1707.01521 (2017).