Wordonoi: Visualizing the Structure and Textual Contents ...elm/projects/wordonoi/wordonoi.pdf ·...

Eurographics Conference on Visualization (EuroVis) 2013B. Preim, P. Rheingans, and H. Theisel(Guest Editors)

Volume 32 (2013), Number 3

Wordonoi: Visualizing the Structure and Textual Contents ofKnowledge Networks

Submission ID: 238

Figure 1: The top 26 topics in a Wordonoi for a large research organization for homeland security consisting of 2,000 nodesand 5,000 edges. The cells have been colored based on categories to allow distinguishing between different and adjacent topics.

AbstractDatasets with both relationships and textual content are becoming increasingly common; examples include hyper-text documents, rich social networks, and scientific authorship. We call this type of datasets knowledge networks,and present a novel and interactive visualization technique called Wordonoi to visualize them. Wordonoi visual-izes both the textual and relational components of knowledge networks by spatializing them into a multi-scale 2Dvisualization using a Voronoi tessellation and then mapping keywords onto the different cells. Because knowledgenetworks are often large, we also provide aggregation mechanisms for summarization. We explore and implementseveral interactions like interactive coloring, semantic zooming, and searching. We also validate the techniquewith three examples, including a research organizational structure, a hypertext network, and NSF funding data.

Categories and Subject Descriptors (according to ACM CCS): H.5.1 [Information Systems]: Multimedia InformationSystems—; H.5.2 [Information Systems]: User Interfaces—

1. Introduction

Text is one of the most important and common types ofdata in the world today [Shn96], and there exists a multitudeof tools (e.g., [DZG∗07, vHWV09, VWF09]) for visualizingsuch data. However, as we go beyond simple text corpora tomore complex datasets, one particular class of data emergesthat combines textual labels with their relationships (e.g., asgraphs). We denote such datasets knowledge networks be-cause they exhibit a graph structure with textual data for

the nodes and links. Examples include dictionaries, whereeach word has a definition and relationships to synonyms,antonyms, or modifiers; the web, where each webpage con-sists of text and hyperlinks to other pages; and research fund-ing networks, where text describes projects and relationshipscapture investigators, institutions, and program officers.

While several techniques exist for visualizing graphor text alone, visualizing their combination is challeng-ing [KKEE11]. Furthermore, in such graphs sometimes the

submitted to Eurographics Conference on Visualization (EuroVis) (2013)

2 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks

textual content is more important, and sometimes the re-lationship structure is more important. For example, con-sider crime network data consisting of crime reports andtheir relationship. In this case, the user may either want tofocus on how the various crimes occurred (i.e., the con-tents of the reports), or may want to see associations be-tween particular crimes and individuals (i.e., relationshipsbetween reports). Some tools focus only on textual contentsand make little use of connections (e.g., [CVW09]), and viceversa (e.g., [vHWV09]). Only a handful of tools [KKEE11,SGL08] visualize both features simultaneously.

Another problem with knowledge networks is their size.Not only are the graphs themselves generally large, but soare their textual content. Therefore, summarization [EF10]techniques are required to provide effective overview. Thereexist techniques [DZG∗07, VWF09] that visualize large sizetextual data by extracting important tags and patterns fromthem and visualize those tags. However, these tools do notshow the overall landscape or effective summary of thewhole data in one screen. In our case, we need to summa-rize both the networks and their textual contents.

In this work, drawing from previous work such as self-organizing maps (SOMs) [Koh82], WordBridge [KKEE11],and Wordle [VWF09], we propose a visualization techniquethat we call WORDONOI for visualizing both the relationsand textual contents of knowledge networks. Our techniqueis a multi-scale and space-filling 2D text visualization thatsupports hierarchical aggregation [EF10] to allow the userto interactively explore the knowledge network. The contri-bution of this work is the ability to show a summary of anentire textual and relational dataset in a single screen.

We have implemented a Wordonoi prototype that acceptsknowledge networks as input and renders an interactive visu-alization. Our framework initially calculates node positionsusing a graph algorithm and then uses these positions tocompute a Voronoi tessellation of the space. Each Voronoicell represents a node, and the text associated with the nodeis shown inside the cell as tags extracted from the text. Inour implementation, we explore several aspects of the Wor-donoi design space, including aggregation, text visualiza-tion in Voronoi cells, coloring schemes, and interaction tech-niques such as semantic zooming, panning, querying, etc.

We validate our technique by applying it to three exam-ples. The first is a large research structure for relationshipsbetween persons, projects, institutions, and centers, whereeach node contains details about the research. The secondexample is a hypertext network where the relationships arelinks and the text is the document contents. In the last exam-ple, we study a knowledge network of NSF funding data con-taining relationships between PIs and projects, and where thetext is the project descriptions. All three examples show thatour technique works well for practical knowledge networksby summarizing their textual contents while simultaneouslyconsidering their relationship structure.

2. Related Work

Digital technology has made text data ubiquitous. However,staying abreast of this onslaught of textual streams—such asnews articles, academic papers, crime reports, etc—is im-possible [Hea09]. Text visualization uses interactive visualrepresentations to summarize, highlight, and characterize thecontents of textual data [DZG∗07]. However, this method iscomplicated by the fact that text data is categorical, unstruc-tured, and high-dimensional [SWL∗10]. Below we outlinethe general approaches in the literature.

2.1. Frequency-based Visualization

Extracting important words from text according to their fre-quency and visualizing that metric is a common text vi-sualization technique [Hea09]. Most famous among thesetechniques is tag clouds (or word clouds) [HR08], and iscommonly used on the Internet by Web 2.0 and social me-dia websites. Despite their popularity, tag clouds have sev-eral problems, such as attributing too much attention onlonger words [VWF09] and not making efficient use of thespatial dimension. The Wordle technique [VWF09] over-comes many of the problems associated with tag clouds andproduces highly aesthetic and compact clouds. ManiWor-dle [KLKS10] proposes several improvements and allowsthe user to interactively control the cloud layout. Finally, atechnique called clustered word clouds [Cla] use word relat-edness to control positioning in a tag cloud layout and thusdisplay co-occurring and related words in close proximity.

2.2. Visual Concordances

A concordance is an alphabetical index of all the words ina text together with their context [Hea09]. Several text visu-alizations have been designed to visualize texts in this way.SeeSoft [Eic94] visualizes text documents by representingeach document as a vertical column and text in them as acolor-coded row of pixels. Similarly, TextArc [Pal02] dis-plays the lines of text in elliptical layout with frequently oc-curring words placed in the center. Selecting the central worddisplays its connections to lines of text containing it. Finally,DocuBurst [CCP09] and WordTree [WV08] are examples ofdocument concordances built using hierarchies.

2.3. Combining Text with Other Visualizations

Some existing work blends text visualization with other vi-sualization techniques to show patterns in the text. The-meRiver [HHWN02] uses thematic variations over time tovisualize the frequencies of topics extracted from the text.NameVoyager [Wat06] uses stacked bar graphs to show fre-quencies of baby names across time. TIARA [WLS∗10] in-tegrates trend graphs into tag clouds to show important pat-terns over time. Another text visualization technique calledSparkClouds [LRKC10] integrates sparklines into tag cloudsto show the trend of each word over time.


Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks 3

2.4. Visualizing Relationships in Text

Visualizing the relationships of the words in a text collectionis becoming very important, and numerous visualizationshave been proposed to show the structure of the text. Ar-cDiagrams [Wat02] displays patterns of repetition in stringdata. IN-SPIRE [WTP∗95] shows the relationships betweenthe documents in the corpus using a themescape, which is atopic-based projection of document concepts and keywordsonto a 2D space. The Word Tree [WV08], a hierarchical doc-ument concordance, shows the context of each word by dis-playing phrases commonly following or preceding a givenword. Finally, Phrase Nets [vHWV09] shows a graph of re-lated words based on a user-specified relation.

Themail [VGD06] parses email conversations and por-trays relationships between individuals by extracting andanalyzing keywords. Parallel Tag Clouds (PTCs) [CVW09]extends Themail by combining parallel coordinates [Ins85]with tag clouds, enabling faceted browsing of textual docu-ments and comparisons across facets.

Another common approach is to use node-link diagramsto show relations in the text. Wong et al. [WMP∗05] describea novel method of displaying dynamic text in the place oflinks in a node-link diagram. TextArc [Pal02] uses node-link diagrams to show all the contexts in which a word ap-pears. WordBridge [KKEE11] replaces the nodes and linksin a graph with node and link tag clouds that convey notjust connectivity but also the content of the relations. Fac-etAtlas [CSL∗10] combines node-link diagrams with densitymaps to visualize entity relationships in a text.

In particular, Gansner et al. [GHK10a, GHK10b,GHKV09, GHN12] combine geographic maps with node-link diagrams to increase the visual appeal of a graph. Theseworks are closely related to our work from a visual designpoint of view, but have a very different data model and ap-proach. Wordonoi visualizes bodies of text associated with anode cluster in the area assigned to the cluster’s cell, whileGansner’s work focuses on clustering nodes into larger re-gions and rendering their node labels. In other words, Wor-donoi primarily visualizes knowledge networks with textualdata, while Gansner uses the geographical map approach toincrease the visual appeal of the node-link diagram.

All of these techniques focus on showing the relationshipor structure of the text within a text corpus. In contrast, ourWordonoi technique shows the relationships of nodes—eachwith textual content—in a knowledge network. Below, wewill see what impact this focus will have on the technique.

3. Generating Knowledge Networks

We define knowledge networks simply as graphs with asso-ciated textual content for the nodes and edges in that net-work, where the textual data provides some form of seman-tic meaning to the graph structure. In such networks, the ca-

pacity to understand not just the relationship between enti-ties, but also the semantics of the connections is important.For example, a standard citation network shows whether ornot particular authors have collaborated or cited each other,whereas a knowledge network constructed from such datamay be able to tell us the nature of their collaboration or cita-tions. We here propose a mechanism for constructing knowl-edge networks from standard graph and textual datasets.

The most straightforward way to generate knowledge net-works is by deriving multidimensional graphs from tab-ular data, e.g., as described by Liu et al. [LNS11]. Theprocess is heavily dependent on the application domain,but often involves recovering an entity-relationship (E-R)model [Che76] from the data before extracting the knowl-edge network. In such a model, each entity (a person, pub-lication, organization, physical object, or concept) becomesa node in the network, and the relationships between entitytypes are used to generate the links between nodes. For ex-ample, relational databases are built from E-R models, soextracting this mode from tables and their keys is relativelystraightforward (although the user still has to make selec-tion on which entities to include). In other cases, the E-Rmodel must be explicitly specified by the user; for example,in a collection of crime reports, we may first have to identifyentities in the text (similar to how Jigsaw [SGL08] extractsentities) and then decide on how to generate the relationships(co-occurrence, distance in text, semantic meaning).

Given a basic E-R model and network data extracted fromthe original dataset, we must now augment the network withtextual information describing the semantics of the entitiesand their relationship. Again, this process is application-specific and depends on which semantics should be high-lighted. A common approach is to summarize all of the tex-tual information available in the original database and inte-grate it within the knowledge network. This can be achievedusing text mining techniques such as counting word fre-quencies, calculating tf-idf [Jon72] and related metrics, orautomatically extracting text summaries [Hea99]. The ex-tracted information—ranging from keywords, phrases, orentire texts—is then used as node and edge attributes.

4. Wordonoi: Design Space

Wordonoi is an interactive visual representation for knowl-edge networks that combines both relational and textualcontent. Figure 2 depicts the Wordonoi pipeline that takesa knowledge network as input, processes the network instages, and yields an interactive visualization. Below we de-scribe these stages and explore the Wordonoi design space.

Figure 2: From knowledge network to visualization.



Figure 3: Spatializing knowledge networks using Wordonoi: (a) graph layout; (b) Voronoi tessellation; and (c) cell generation.

4.1. Spatialization

The first step in visualizing knowledge networks is to as-sign space on the 2D visual substrate to each entity (node) inthe knowledge network. The intuition is to project the high-dimensional network structure onto 2D space so that the tex-tual component associated with each node can be displayed.This spatialization technique should fill the available spacein the viewport by allocating disjoint 2D regions to eachnetwork node based on the graph structure. To achieve this,we abandon the explicit display of the graph structure itselfin favor of conveying the textual content of the knowledgenetwork while maintaining connectivity. Figure 3 illustratesthe three parts of the spatialization process: (a) generating agraph layout of the network; (b) tessellating the space intodisjoint cells; and (c) allocating a cell to each node.

Graphs are high-dimensional datasets, and graph layoutalgorithms are concerned with finding projections of suchdatasets into 2D (or 3D) space by calculating the positionof each node in a way that optimizes some metric (typicallyreadability). However, the primary purpose of the layout forthe Wordonoi technique is to find a 2D mapping that mimicsthe structure of the underlying graph, i.e., that places highlyconnected nodes in close proximity. Any layout algorithmcan be used; we prefer Noack’s lin-log algorithm [Noa05]because it can cluster nodes based on connectivity.

Having found a graph layout, we now convert the nodesinto 2D regions on the viewport where textual content can bevisualized. For this purpose, we use a Voronoi tessellationthat subdivides the space into disjoint subspaces, or cells,based on node positions (each cell being points closest toeach node). Each cell has an associated node, and its areacan be used for the node’s textual contents.

Several design decisions were made in arriving at theabove spatialization approach. The main tradeoff here isclearly that we are sacrificing some of the graph structurefrom a traditional node-link diagram in order to be able toconvey more textual content in the visualization. This is dif-ferent from techniques such as WordBridge [KKEE11] andPhraseNets [vHWV09] that also combine graph and textvisualization, but retain more of the relationship structurein the representation. The drawback for such approaches isthat less space is available for visualizing textual content,and, as with any node-link diagram, they do not scale wellwith graph size. As we shall see in Section 4.5, the Wor-

donoi space-filling representation not only allows devotingvirtually the entire space to textual content, it is also highlyamenable to hierarchical aggregation to manage scale.

4.2. Text Visualization

Spatialization has subdivided the viewport into cells basedon network topology, yielding one cell per node in theknowledge network. The next step is to use the 2D cells tovisualize the textual content associated with each node:

• Most important keyword: Scale the most important key-word (e.g., most frequent) to fit as a single label.

• Repeat keyword: Again, use the most important key-word, but fill the cell completely using the keyword.

• Word cloud: Draw a word cloud of the tags belonging tothe cell using the global frequency of each keyword.

We use all three strategies depending upon the size of thecell on the screen. For small-sized regions, only the mostimportant tag is displayed; for medium-sized regions, thetag is repeatedly displayed; and for large-sized regions, aword cloud is displayed. This choice of visual representationchanges as the user zooms in and out in the visualization,giving rise to a form of semantic zooming [BGM04].

Most word cloud layout algorithms are designed for rect-angular spaces, so an irregular (non-convex) cell may causeparts of keywords to fall outside of the cell polygon. Wetherefore provide an interaction where users can drill downin any region to see its text without clipping.

4.3. Utilizing Color

Color is a free parameter in our design space, and can beused for features such as topology, categories, and affinity:

• Random assignment: Cells are assigned random colorsto allow for differentiation.

• Graph coloring: A graph coloring algorithm is used tocolor cells such that no two adjacent cells have the same.

• Categories: Node type or textual content can be used tocategorize nodes (and their cells), and cells can then beassigned colors based on their category.

• Color scale: A color scale (such as gray scale or heatscale) can be used to show quantitative information abouteach cell. Some examples of such quantitative features in-clude node centrality, cohesiveness, and connectedness.



In Figure 6, a green color scale has been used to assigncolors to cells based on cohesion (the ratio of internal edgesto total edges associated with a cell). Thus, light green cellsrepresent nodes that have few internal connections, whereasdark green cells have higher connectivity. This can be usedto find communities in the data.

4.4. Recovering Network Topology

Even though the spatialization is based on the topology ofthe underlying knowledge network, the Wordonoi represen-tation still sacrifices some of the graph structure in favor ofthe semantic content of the network. We provide two waysfor users to recover topology information:

• Cell Border Visualization: A Wordonoi visualization nolonger displays the edges in the original node-link dia-gram that formed its basic structure, but there is nothingto prevent us from decorating the cell representation withthis information. More specifically, we can use the bor-ders between adjacent cells to convey information aboutthe connectivity of those cells; for example, by varyingthe thickness or color of the border.

• Interactive Color Diffusion: We can also use color to ex-plicitly show the network topology through an interactivepaint metaphor where users assign a color to a cell, andthe color is then diffused through the representation basedon topology: (1) to all adjacent neighbors of a colorednode; or (2) to all cells that are connected in the network.The diffusion proceeds in a breadth-first fashion with di-minishing amounts of color for each step. The amount ofcolor also depends on the relational strength between thesource and the destination cells; the stronger the relations(i.e., the more edges between the cells), the more color isdiffused. We use an alpha blending model where all re-gions are initially white, and iteratively gets blended withother colors. This technique is incremental, i.e., differentcolors can be assigned to different cells, each blendingwith existing colors across the representation (Figure 5).

4.5. Aggregation

Knowledge networks are often large, and therefore requiresummarization. Popular solutions to achieve this includeclustering, filtering, or sampling the graph, but none of theseapproaches are well-suited in combination with textual data.

For the Wordonoi technique, we simply take advantage ofthe space-filling visual representation of disjoint cells, oneper node, resulting from the spatialization stage by designinga hierarchical aggregation technique [EF10] that incremen-tally agglomerates adjacent cells together until only one—the sum of all cells—remains. For our Voronoi cells, an ag-glomeration of two cells is simply the union of the 2D spaceof each cell, and the textual content of each correspondingnode is also combined. This results in a binary clusteringtree that can be expanded to any level depending on the user.

Choosing a good distance metric is key to any hierarchicalaggregation [EF10]. Examples of possibilities include met-rics based on network topology or layout geometry.

4.5.1. Topology-based Distance Metrics

Distance metrics based on network topology use graph struc-ture to determine the order of agglomeration for cells:

• Degree: Merge the regions for associated nodes that com-bine to result in the highest (or lowest) degree.

• Edge weights: Combine regions for the nodes that areconnected by edges with the highest (or lowest) weight.

• Cohesion: We define the cohesion (clustering affinity)of a region with other regions as the ratio of commonedges between these regions to its degree. This metric willmerge regions that are the most (or least) cohesive.

4.5.2. Spatialization-based Distance Metrics

These metrics use the spatialization data to define distancesbetween cells. While these do not depend on the graph topol-ogy, they result in a more optimized visual representation:

• Minimum area: Merge the two adjacent regions that havethe minimum combined area. This would converge to-wards unifying cell size at each level of the aggregation.

• Maximum Area: Merge the two adjacent regions withthe maximum combined area. This preserves small cells,which for many graph layout algorithms are highly con-nected, as well as central nodes in the center of the space.

• Rectangle completion: To avoid irregular polygons, usea distance metric based on how closely two cells form acomplete rectangle (calculated as the ratio between thecombined area and that of their bounding box).

5. Wordonoi: Implementation

We implemented a Wordonoi prototype consisting of twocomponents: a preprocessor that performs off-line spatializa-tion, and an interactive tool that displays the visualization.

5.1. Preprocessing

The preprocessing tool loads knowledge networks inGraphML format, generated from some earlier generationstage (Section 3), and performs the spatialization processpresented in Section 4.1. Our implementation uses Noack’slin-log graph layout [Noa05] to find a 2D layout for thenodes, and then tessellates the space using a standardVoronoi implementation. We also make sure to preserve thetextual content (extracted while generating the knowledgenetwork) from the GraphML input file in the representation.Finally, the tool computes the complete aggregation hierar-chy (several distance metrics are available for use) and savesthe cell shapes, the text summaries, and the aggregation hier-archy to a custom XML format. Typical computing time forthe preprocessor on a network consisting of approximately



(a) Search and highlighting. (b) Show text visualization without clipping.

(c) Show underlying node-link diagram. (d) Show child cells.

Figure 4: Interactions in our Wordonoi prototype implementation.

2,000 nodes and 5,000 edges is on the order of 5 seconds.The resulting XML file format, which includes the aggre-gation hierarchy and the Voronoi tesselation at all levels, isapproximately 5 MB in size for the above dataset.

Many stages in our implementation are computationallyexpensive. By precalculating the aggregation hierarchy, weavoid long run-times and can quickly render all the compo-nents at a particular hierarchy level without new computa-tion. The result is a smooth and interactive visualization.

5.2. Visualization

The interactive tool is built using the Piccolo [BGM04]toolkit for 2D vector graphics. The tool loads the prepro-cessed data, including the 2D cell shapes vectors and the ag-gregation hierarchy, and visualizes it as an interactive appli-cation. Inside each cell, the tool visualizes the textual contentof the cell (and any aggregated children) using a method de-pendant on the screen space allocation (Section 4.2). We usea deterministic Wordle layout [KKEE11] to avoid the ran-dom and unstable layouts of the original Wordle [VWF09].

5.3. Interactions

Users can interact with the Wordonoi prototype as follows:

• Search: The user can type in a query, and cells that matchwill be highlighted while others are dimmed (Figure 4(a)).

• Pan & zoom: Users can pan and zoom in the visualiza-tion. Semantic zooming will change the visual represen-tation of text depending upon the screen size of each cell.

• Aggregation: A slider (or mouse wheel) controls howmany cells to display. Changing this setting will dynami-cally drill down or roll up the visual aggregation.

• Show text: Disable clipping of the text visualization andshow the full contents of the current cell (Figure 4(b)).

• Show node-link: Display the node-link diagram for thecurrent cell, as well as for neighbors (Figure 4(c)).

• Show children: Cells may be aggregates of multiple cells.This interaction mode will show all of the child cells of thecurrent cell under the mouse cursor (Figure 4(d)).

• Interactive coloring: There are two options available forinteractive coloring to recover network topology:

– Hover diffusion mode: Color is dynamically assignedto the clicked cell and diffused to its neighbors.

– Full color diffusion mode: When starting this mode, all



Figure 5: Interactive color diffusion where cells have beencolored in the order shown as red, blue, light green, darkgreen, and cyan, and then diffused based on connectivity.

cells become white. Users can then iteratively select acolor from a palette, assign it to a cell, and the colorwill be diffused across the representation (Figure 5).

• Cohesion color mapping: This mode switches from thedefault categorical color coding to coloring nodes basedon the cohesion of each cell, i.e., the ratio of internal edgesto the total number of edges for the cell (Figure 6). Thishelps in finding communities in the knowledge network.

• Reset: A reset option is also available to the user to revertthe visualization to its original appearance.

Figure 6: Color scale example showing regions havinghigher cohesion (more internal edges) as darker green.

6. Examples

We showcase Wordonoi by applying it to three examples: aresearch organization, hypertext documents, and NSF fund-ing data. Below we explain these examples in detail.

6.1. Research Organization

The motivation for this example is to support policymakersand researchers alike in understanding the size, scope, andresearch topics in a research organization for homeland se-curity. The original dataset is an SQL database consisting ofa network of research centers, the associated faculty and stu-dents, their institutions, their publications, and the projectsthey work on. Some of the tasks that a program managermight want to perform include the following:

• T1: What research is happening in the organization?• T2: What research is done at particular centers?• T3: What research is done by particular researchers?• T4: How well are the partners collaborating?• T5: Which reports deal with specific keywords?

Several fulltexts are available that characterize the net-work, including project descriptions, paper abstracts, centermission statements, and investigator websites. In a mannersimilar to that described by Liu et al. [LNS11], we extract aknowledge network by mapping the tabular data to an entity-relationship model and summarizing the descriptive text foreach entity using tf-idf. The resulting GraphML file consistsof approximately 2,000 nodes and 5,000 edges.

The file is then used as input to the Wordonoi pipeline.Figure 1 shows a screenshot of the top 26 nodes from the in-teractive visualization of this knowledge network. It servesas an overview of the research done in the organization,showing that the focus is on topics such as food protec-tion, transportation safety, health and disease management,communication, and training (T1). Users can navigate, drilldown, and explore this dataset further using the interactionsdescribed in the previous section. This will allow the audi-ence to see the research landscape at all levels of scale, iden-tify gaps, and find commonalities between projects.

For example, if a program manager searches for a par-ticular center, the summary of topics in the resulting regionsgives the idea of research done in these centers (T2). For taskT3, the user can use search and show node-link interactions.Interactive coloring and show node-link find the collabora-tions, and cohesion color mapping interaction indicates theamount of collaboration (T4). On searching a specific key-words, papers or project reports corresponding to resultingcells can be opened (T5).

6.2. Hypertext Documents

Hypertext document networks are extremely common, withthe web being the canonical hypertext collection, but visu-alizing Internet content is also notoriously difficult [Car96].



Figure 7: Top 50 nodes for a network starting from theWikipedia article on visualization.

We use a straightforward approach to generate knowledgenetworks from the Internet by implementing a simple webcrawler that takes the URL of a webpage as a starting pointand processes all of the outgoing links from that webpagein a breadth-first fashion. Each processed webpage is addedas a connected node with edges for all of the hyperlinks,and we also extract the text on the page by extracting title,meta-data, and document content (the latter summarized us-ing tf-idf). The crawl is stopped at some given size (we use500 documents in these examples), and the resulting data isstored as a GraphML file that can later be visualized. Someof the tasks the user might want to perform on such knowl-edge networks include the following:

• T1: Summarize a given webpage and its neighbors.• T2: Summarize all webpages related to a keyword.• T3: Find webpages with few or many links to other pages.• T4: Given a keyword, find related keywords.• T5: Open all webpages related to a keyword.

Figure 7 shows a Wordonoi using the Wikipedia articleon visualization as root. The top 50 cells shown here conveythe gist of the knowledge network through concepts such asgraphics, maps, graph, data, crime, theory, and charts.

Users can not only see an overview of the whole dataset,but the Wordonoi technique allows them to see details andsearch for a particular page or words to see a summary of textrelated to it (T1 & T2). They can also use interactions such asshow node-link or interactive coloring to see relationships.Color cohesion mapping interaction will help users in an-swering T3. Aggregation and search will help in answeringT4. For example, in our network fields related to visualiza-tion are graphics, data, crime, police, maps etc. Searchingfor specific keyword and open all the webpages associatedto resulting regions will help in answering T5.

While we make no claims as to the utility of Wordonoifor hypertext documents in comparison to other web visual-ization techniques (surveyed by Card [Car96]), we do thinkthat the technique provides some very useful new perspec-tives on hypertext collections. For example, Wordonoi couldbe applied to web browsing history data to see a summary ofvisited websites and their relationships.

6.3. NSF Funding Data

Federal funding agencies around the world typically maketheir funding data publically available, and the U.S. Na-tional Science Foundation is no exception; the NSF awardsearch at http://nsf.gov/awardsearch/ providesfully searchable funding information on more than 300,000awards from 1976 to today. Analyzing and understandingthis portfolio of funded projects has many and diverse appli-cations: for an investigator, this data may yield informationon important topics and previous work; for a program officer,coverage and gaps in their funding portfolio; and for a pol-icy maker, the scope of research being funded. Some tasksinvolving such knowledge networks include the following:

• T1: What are the major research areas funded?• T2: Find the areas in which a person is getting funding.• T3: Open all grant proposals related to a keyword.

To be able to apply our Wordonoi visualization to thisdata, we first downloaded the full set of current awards forour own university. Based on our generation process (Sec-tion 3), we mapped this tabular data to an entity-relationshipmodel consisting of projects, investigators, co-investigators,directorates, and program officers. Project descriptions wereused as the textual data characterizing the network, and wesummarized this using standard tf-idf. We further generatededges between projects based on the co-occurrence of key-words specified for each project, as well as co-occurring con-cepts derived from the tf-idf process. The size of the networkis approximately 1,000 nodes and 5,000 edges.

Figure 8 shows the Wordonoi for this NSF funding net-work for our university. The aggregation level is again cho-sen to show about 50 nodes to yield a high-level overview.It is clear from the visualization that most of the grants arerelated to engineering—this is also accurate given the sizeand prominence of the college of engineering at the univer-sity. Other research topics include chemistry, earthquakes,and agriculture, which are also accurate. Another trend isthe prominence of workshops and conferences, presumablyfor awards used to fund such scientific meetings.

7. Discussion and Limitations

Our aim in this work is to visualize both the textual andstructural content of knowledge networks. One potential lim-itation of the space-filling Wordonoi visualization is that thetechnique does not explicitly maintain relationships in the


http://nsf.gov/awardsearch/


Figure 8: NSF funding example with rectangle aggregation.

visual representation. On the other hand, the tessellation it-self is derived from the network topology, and we have alsopresented an array of interactive and visual methods for re-covering topology while benefiting from all of the advan-tages afforded by the space-filling representation.

With this in mind, it is clear that the choice of graph lay-out algorithm has considerable impact on the resulting Wor-donoi tessellation. A layout algorithm that clusters highlyconnected nodes in close proximity will result in a more co-hesive representation, meaning that the generated cells at alllevels of aggregation will be more faithful to the networktopology. The lin-log layout we use has this property, but oneof its drawbacks is that it is non-deterministic, i.e., it gener-ates different actual layouts for each invocation. This may bedetrimental for users trying to maintain a mental map of theWordonoi representation across spatializations.

Our choice in text visualization is solely based on tagclouds, and there may be other and more efficient text vi-sualization techniques to use for this purpose. Nevertheless,we think that any text visualization will benefit from havinga maximum of available space inside the Wordonoi cells, andshould therefore be possible to simply plug into the existinglayout. Similarly, our use of tf-idf for text mining and ex-traction should not be seen as indicative of limitations in theWordonoi technique itself; we certainly think more advancedtext analytics algorithm can be used in its stead.

It should also be noted that we provide distance metricsfor aggregation based both on spatial information as well asnetwork topology. Different metrics have different strengthsand weaknesses. While topology-based metrics are clearlythe most faithful to the original network structure, they mayyield aggregations that are difficult to use efficiently for textvisualization. Furthermore, even such spatially based met-rics may yield less-than-ideal layouts. These effects were thereason we devised the rectangular completion distance met-

ric where cell aggregates tend towards a rectangular—andthus more efficient—shape.

Another limitation with our implementation—althoughnot necessarily with the general technique—is that we donot group or even link keywords that refer to the sameconcept. For example, the keywords “first responder” and“emergency personnel” are almost synonyms, but our naïveco-occurrence mapping would not catch this distinction. Amore sophisticated approach—left for future work—woulduse a word ontology to make such connections.

Finally, there exists several additional examples where theWordonoi technique can be applied. For example, social me-dia data from sites such as Facebook, Twitter, and MySpacecontain both relations and large amounts of textual data, andare therefore potential applications for the technique. An-other example could be for crime and incident reports tosummarize types of crime, relations between different inci-dents, and the crime trend in a particular area.

8. Conclusion and Future Work

We have presented a novel visualization technique calledWordonoi that combines both the structure and textual con-tent in knowledge networks. While this approach sacrificessome of the structure from the original network in favor oftextual content, it is highly amenable to hierarchical aggre-gation to combat large scale, and we also provide multipleinteractive and visual methods for recovering this lost struc-ture. We have also demonstrated the utility of the Wordonoitechnique in three examples of knowledge networks.

Several potential future directions exist. We plan to deployand evaluate the system in a large research organization. Wewill bring in more text mining and analytics, such as topicmodeling and word ontologies, to improve the text visual-ization component.

References[BGM04] BEDERSON B. B., GROSJEAN J., MEYER J.: Toolkit

design for interactive structured graphics. IEEE Transactions onSoftware Engineering 30, 8 (2004), 535–546. 4, 6

[Car96] CARD S. K.: Visualizing retrieved information: A survey.IEEE Computer Graphics and Applications 16, 2 (Mar. 1996),63–67. 7, 8

[CCP09] COLLINS C., CARPENDALE M. S. T., PENN G.:DocuBurst: Visualizing document content using language struc-ture. Computer Graphics Forum 28, 3 (2009), 1039–1046. 2

[Che76] CHEN P. P.-S.: The entity-relationship model—towarda unified view of data. ACM Transactions on Database Systems1, 1 (1976), 9–36. 3

[Cla] CLARK J.: Clustered word clouds. http://neoformix.com/2008/ClusteredWordClouds.html. Oct 2008. 2

[CSL∗10] CAO N., SUN J., LIN Y., GOTZ D., LIU S., QUH.: FacetAtlas: Multifaceted visualization for rich text corpora.IEEE Transactions on Visualization and Computer Graphics 16,6 (2010), 1172–1181. 3


http://neoformix.com/2008/ClusteredWordClouds.htmlhttp://neoformix.com/2008/ClusteredWordClouds.htmlhttp://neoformix.com/2008/ClusteredWordClouds.html


[CVW09] COLLINS C., VIÉGAS F. B., WATTENBERG M.: Par-allel tag clouds to explore faceted text corpora. In Proceedings ofthe IEEE Symposium on Visual Analytics Science and Technology(2009), pp. 91–98. 2, 3

[DZG∗07] DON A., ZHELEVA E., GREGORY M., TARKAN S.,AUVIL L., CLEMENT T., SHNEIDERMAN B., PLAISANT C.:Discovering interesting usage patterns in text collections: inte-grating text mining with visualization. In Proceedings of theACM Conference on Information and Knowledge Management(2007), pp. 213–222. 1, 2

[EF10] ELMQVIST N., FEKETE J.-D.: Hierarchical aggregationfor information visualization: Overview, techniques and designguidelines. IEEE Transactions on Visualization and ComputerGraphics 16, 3 (2010), 439–454. 2, 5

[Eic94] EICK S. G.: Graphically displaying text. Journal of Com-putational and Graphical Statistics 3, 2 (1994), 127–142. 2

[GHK10a] GANSNER E., HU Y., KOBOUROV S.: GMap: Visu-alizing graphs and clusters as maps. In Proceedings of the IEEEPacific Visualization Symposium (2010), pp. 201–208. 3

[GHK10b] GANSNER E., HU Y., KOBOUROV S.: Visualizinggraphs and clusters as maps. Computer Graphics and Applica-tions 30, 6 (2010), 54–66. 3

[GHKV09] GANSNER E., HU Y., KOBOUROV S., VOLINSKYC.: Putting recommendations on the map: visualizing clustersand relations. In Proceedings of the ACM Conference on Recom-mender Systems (2009), pp. 345–348. 3

[GHN12] GANSNER E. R., HU Y., NORTH S. C.: Visualizingstreaming text data with dynamic maps. CoRR abs/1206.3980(2012). 3

[Hea99] HEARST M. A.: Untangling text data mining. In Pro-ceedings of the Annual Meeting of the Association for Compu-tational Linguistics on Computational Linguistics (1999), pp. 3–10. 3

[Hea09] HEARST M.: Search user interfaces. Cambridge Univer-sity Press, 2009. 2

[HHWN02] HAVRE S., HETZLER E., WHITNEY P., NOWELLL.: ThemeRiver: Visualizing thematic changes in large documentcollections. IEEE Transactions on Visualization and ComputerGraphics 8, 1 (Jan. 2002), 9–20. 2

[HR08] HEARST M. A., ROSNER D. K.: Tag clouds: Data anal-ysis tool or social signaller? In Proceedings of the Hawaii In-ternational Conference on System Sciences (2008), pp. 160–160.2

[Ins85] INSELBERG A.: The plane with parallel coordinates. TheVisual Computer 1, 2 (1985), 69–91. 3

[Jon72] JONES S. K.: A statistical interpretation of term speci-ficity and its application in retrieval. Journal of Documentation28 (1972), 11–21. 3

[KKEE11] KIM K., KO S., ELMQVIST N., EBERT D. S.: Word-Bridge: using composite tag clouds in node-link diagrams for vi-sualizing content and relations in text corpora. In Proceedings ofthe Hawaii International Conference on System Sciences (2011),pp. 1–8. 1, 2, 3, 4, 6

[KLKS10] KOH K., LEE B., KIM B., SEO J.: ManiWordle: Pro-viding flexible control over wordle. IEEE Transactions on Visu-alization and Computer Graphics 16, 6 (2010), 1190–1197. 2

[Koh82] KOHONEN T.: Self-organized formation of topologicallycorrect feature maps. Biological Cybernetics 43, 1 (1982), 59–69. 2

[LNS11] LIU Z., NAVATHE S. B., STASKO J. T.: Network-based visual analysis of tabular data. In Proceedings of the IEEE

Symposium on Visual Analytics Science and Technology (2011),pp. 39–48. 3, 7

[LRKC10] LEE B., RICHE N. H., KARLSON A. K., CARPEN-DALE S.: SparkClouds: visualizing trends in tag clouds. IEEETransactions on Visualization and Computer Graphics 16, 6(2010), 1182–1189. 2

[Noa05] NOACK A.: Energy-based clustering of graphs withnonuniform degrees. In Proceedings of the International Sym-posium on Graph Drawing (2005), pp. 309–320. 4, 5

[Pal02] PALEY W. B.: TextArc: Showing word frequency and dis-tribution in text. In Poster Proceedigns of the IEEE Symposiumon Information Visualization (2002). 2, 3

[SGL08] STASKO J. T., GÖRG C., LIU Z.: Jigsaw: supportinginvestigative analysis through interactive visualization. Informa-tion Visualization 7, 2 (2008), 118–132. 2, 3

[Shn96] SHNEIDERMAN B.: The eyes have it: A task by data typetaxonomy for information visualizations. In Proceedings of theIEEE Symposium on Visual Languages (1996), pp. 336–343. 1

[SWL∗10] SHI L., WEI F., LIU S., TAN L., LIAN X., ZHOUM. X.: Understanding text corpora with multiple facets. In Pro-ceedings of the IEEE Symposium on Visual Analytics Science andTechnology (2010), pp. 99–106. 2

[VGD06] VIÉGAS F. B., GOLDER S., DONATH J.: Visualizingemail content: portraying relationships from conversational his-tories. In Proceedings of the ACM Conference on Human Factorsin Computing Systems (2006), pp. 979–988. 3

[vHWV09] VAN HAM F., WATTENBERG M., VIÉGAS F. B.:Mapping text with phrase nets. IEEE Transactions on Visual-ization and Computer Graphics 15, 6 (2009), 1169–1176. 1, 2,3, 4

[VWF09] VIÉGAS F. B., WATTENBERG M., FEINBERG J.: Par-ticipatory visualization with Wordle. IEEE Transactions on Vi-sualization and Computer Graphics 15, 6 (2009), 1137–1144. 1,2, 6

[Wat02] WATTENBERG M.: Arc diagrams: Visualizing structurein strings. In Proceedings of the IEEE Symposium on InformationVisualization (2002), pp. 110–116. 3

[Wat06] WATTENBERG M.: Visual exploration of multivariategraphs. In Proceedings of the ACM Conference on Human Fac-tors in Computing Systems (2006), pp. 811–819. 2

[WLS∗10] WEI F., LIU S., SONG Y., PAN S., ZHOU M., QIANW., SHI L., TAN L., ZHANG Q.: TIARA: a visual exploratorytext analytic system. In Proceedings of the ACM InternationalConference on Knowledge Discovery and Data Mining (2010),pp. 153–162. 2

[WMP∗05] WONG P. C., MACKEY P., PERRINE K., EAGAN J.,FOOTE H., THOMAS J.: Dynamic visualization of graphs withextended labels. In Proceedings of IEEE Symposium on Informa-tion Visualization (2005), pp. 73–80. 3

[WTP∗95] WISE J. A., THOMAS J. J., PENNOCK K., LANTRIPD., POTTIER M., SCHUR A., CROW V.: Visualizing the non-visual: Spatial analysis and interaction with information from textdocuments. In Proceedings of the IEEE Symposium on Informa-tion Visualization (1995), pp. 51–58. 3

[WV08] WATTENBERG M., VIÉGAS F. B.: The word tree, aninteractive visual concordance. IEEE Transactions on Visualiza-tion and Computer Graphics 14, 6 (Nov./Dec. 2008), 1221–1228.2, 3


Wordonoi: Visualizing the Structure and Textual Contents ...elm/projects/wordonoi/wordonoi.pdf ·...

Documents

Transcript of Wordonoi: Visualizing the Structure and Textual Contents ...elm/projects/wordonoi/wordonoi.pdf ·...