VISUAL ANALYSIS OF DOCUMENT TRIAGE DATA - Swanseacsbob/research/docTriage/geng11visual.pdf ·...

VISUAL ANALYSIS OF DOCUMENT TRIAGE DATA

Zhao Geng, Robert S.LarameeVisual Computing Group, Computer Science Department, Swansea University, UK

[email protected], [email protected]

Fernando Loizides, George BuchananCentre for HCI Design, City University London, UK

[email protected], [email protected]

Keywords: Information visualization, document triage, evaluation

Abstract: As part of the information seeking process, a large amount of effort is invested in order to study and understandhow information seekers search through documents such that they can assess their relevance. This search andassessment of document relevance, known as document triage, is an important information seeking process, butis not yet well understood. Human-computer interaction (HCI) and digital library scientists have undertaken aseries of user studies involving information seeking, collected a large amount of data describing informationseekers’ behavior during document search. Next to this, we have witnessed a rapid increase in the number ofoff-the-shelf visualization tools which can benefit document triage study. Here we set out to utilize existinginformation visualization techniques and tools in order to gain a better understanding of the large amount ofuser-study data collected by HCI and digital library researchers. We describe the range of available tools andvisualizations we use in order to increase our knowledge of document triage. Treemap, parallel coordinates,stack graph, matrix chart, as well as other visualization methods, prove to be insightful in exploring, analyzingand presenting user behavior during document triage. Our findings and visualizations are evaluated by HCIand digital library researchers studying this problem.

1 INTRODUCTIONDocument triage is an important stage of the infor-mation seeking process. It focuses on user behaviorwith respect to skimming, evaluating and organizingdocuments when searching for information. Variousstudies have been conducted (Buchanan and Loizides,2007; Loizides and Buchanan, 2009) to explore users’behavior during document triage. Over the course ofthese studies, a large amount of qualitative and quan-titative data is collected. However, understanding andanalyzing this data is difficult in its raw form. Con-ventionally, these experimental data are analyzed bystatistical methods and simple visualizations, such asbar charts, line graphs and pie charts. These sim-ple visualizations are useful, but of limited help forhigh dimensional data. Thus, there is a great de-mand for summarizing and presenting the data in amore insightful way that HCI scientists can betterutilize. This motivates the exploitation of more ad-vanced information visualization techniques. In re-cent years, we have witnessed a rapid increase in the

number of visualization tools for general use, such asXMDV (Ward, 1994), Mondrian (Theus, 2002), Top-Cat (Mark Taylor, 2005) and ManyEyes (Viegas et al.,2007). We carry out an investigation on how well datacollected by HCI and digital library researchers canbe visualized by existing off-the-shelf information vi-sualization tools and how well each can be applied.Results show that the amount of time spent on doc-uments, pages and document features as depicted bysome of our visualizations, such as the treemap, paral-lel coordinates, stack graph and matrix chart can helpHCI and digital library scientists understand and ex-plore user behaviors during document triage. On theother hand, we also learn that some of the visualiza-tions, such as the 2D bar chart, 2D and 3D scatterplotare limited in their applicability to this problem. Theadvantages and disadvantages of the most promisingvisualization techniques are compared and evaluated.

In this paper we contribute the following:

• A novel attempt to systematically visualize exper-imental document triage data studying human be-

haviors using state-of-the-art information visual-ization methods.

• We survey the variety of state-of-the-art, off-the-shelf information visualization tools in order tohelp researchers from another domain gain insightinto their experimental data.

• We compare and evaluate the various tools and vi-sualizations with respect to their effectiveness insolving a given problem.

• The results of our investigation are evaluated byHCI and digital library researchers studying doc-ument triage.

The result of our study also provides the readerwith a concise introduction to free, off-the-shelf infor-mation visualization applications and their features.

The rest of this paper is organized as follows. Insection 3 we briefly review the past and related workin the study of document triage. In section 4, wedescribe in detail the data collected during the doc-ument triage study. In section 6, we investigate dif-ferent visualization techniques from various tools andevaluate the usefulness of each. Section 7 presents asummary on the interactivity and scalability of visu-alization softwares. Section 8 contains feedback fromthe domain experts studying this problem. Section 9wraps up with conclusions and the specifications forthe visualization tools for HCI researchers.

2 Exploratory Specifications

The data provided by the HCI group, as discussedin Section 4 is quantitative in nature, including tim-ings, numerical ratings from participants. The mainaim of a visualization to the HCI researchers is a fastoverview of the data in order to formulate hypothe-ses on a) relationships between document properties,times and ratings and b) common recurring patternsover all the three areas mentioned in part (a). Thesecan then be tested empirically for validity by statis-tical significance. Thus far, hypotheses are inferredbefore the study by previous results or by observingthe individual behavior of participants as they performa specific task. Indeed, many hypotheses are specula-tive and are sometimes based on curiosity rather thanevidence. We hope that visualizations will greatly de-crease the time taken to formulate more grounded hy-potheses and dismiss non substantive data patterns.Furthermore, we hope to be able to test for patternsand relationships which may have previously goneunnoticed without visualization of the information.

3 Related workDocument triage is a highly manual process, ulti-mately leading to a relevance decision from the user.It is unlike the automatic information retrieval pro-cess where the decision is made by the search en-gine. The document triage process begins after theautomatic information retrieval process and before in-depth reading. The usual starting point of the docu-ment triage process is a results list. Currently, thereis limited further assistance for the information seek-ers after the information retrieval process. A typicalresults list includes a title and a small description ofthe contents of the document (whether that is a web-page or an academic paper). In general, documenttriage is a fast process. From previous research wesee that when presented with a results list, informa-tion seekers will make fast decisions on the relevanceof documents to their information need rather thanscrutinizing the full documents further (Buchanan andLoizides, 2007). Numerous previous studies havedeciphered document triage behavior of informationseekers (Loizides and Buchanan, 2009; Cool et al.,1993). Recent work has been geared toward summa-rization of documents for relevance overview. Onespecific area that has been gaining attention is thatof tag clouds for document summarization(Gottron,2009; Bateman et al., 2008).

Up to now, various visual analysis tools on docu-ment triage have been developed(Jonker et al., 2005;Bae et al., 2008). But these tools do not help HCI re-searchers visualize their user-study data. Because ofthe experimental nature of the study, we find it neces-sary to introduce more advanced existing informationvisualization tools to the HCI community, and helpthem better present and explore their various data sets.

There are many general purpose information visu-alization tools developed for industry, such as Eureka,SpotFire and InfoZoom (Kobsa, 2001). They providevarious interactions for users to enable the ”VisualInformation Seeking Mantra” : overview first, zoomand filter, details on demand (Keim., 2002; Shneider-man, 1996). Advanced tools such as OpenViz (AD-VANCED VISUAL SYSTEMS INC., 2009) and ILogDiscovery (Baudel, 2004), are integrated with multi-ple visualization techniques to handle complex datasets and queries. These tools are for commercial use.Our study focuses on free, off-the-shelf visualizationsoftware. As stated by Kobsa (Kobsa, 2001), whensolving a specific problem, users, especially fromother domains, might have great difficulties in select-ing the most effective visualizations out of numerouschoices. Also, the task questions posed from the do-main experts often affect how they derive informa-tion from a visualization (Ziemkiewicz and Kosara,

2008). It is important to note that although there havebeen several general user-study evaluations of infor-mation visualizations (Kobsa, 2001; Ziemkiewicz andKosara, 2008). This work presented here is not ageneral user-study but a very specialized investiga-tion for a focused audience, namely, document triageresearchers. Although we do believe the work con-ducted here can benefit other users as well. Our workis to facilitate document triage experts search for vi-sualizations that benefit most for their experimentaldata.

4 Background : User-Study DataThe data sets used in our visualizations, is col-lected by our collaborated HCI researchers Buchananand Loizides from their document triage user-study (Loizides and Buchanan, 2009). Their exper-iment aims to investigate the human behaviors in theprocess of reading the documents, searching for infor-mation and evaluating the document relevance. Dur-ing the study, 20 participants performed documenttriage on a closed corpus of electronic PDF docu-ments, evaluating each for its suitability for two tasks.Log data of their interactions is captured. The docu-ments provided for search range from short papers (2pages) to full journal papers (29 pages). There are 6documents including TABLET to TABLET5 in Task1, and 10 documents including HCI to HCI9 in Task2. In Task 1, the goal is to find material on the in-terfaces of tablet PC’s. In Task 2, the goal is to findpapers on specific CHI evaluation methods. The par-ticipants’ ages range from 21 to 28. They are studyingat postgraduate level in computer science discipline,and all have experience with PDF reader software. Inthis section we briefly describe the data collected dur-ing the study.1. Pre-study questionnaire: Study participants filledout a questionnaire before the experiments indicatingtheir: (1) age, (2) number of years electronic doc-ument readers have been used, (3) average numberof academic documents triaged per day, (4) averageamount of time per day spent searching documents.Participants were also asked to rate the importance ofthe following document attributes in a range from 1to 10 ( 1 meaning ”very irrelevant” and 10 meaning”very relevant”): main title, headings, introduction,plain text, conclusion, references, images and figures,highlighted and emphasized text. Participants also in-dicated their preference for searching on paper versusthat of using a computer.2. Data recorded for each participant during the study:For each participant the total amount of time in 1/5thof a second accuracy viewing each page of each elec-tronic document was recorded. From this the total

Number Features Abbreviation1 Heading He2 Abstract Ab3 Keywords Kw4 General Term Gt5 Emphasized Text Em6 Figure Fi7 Conclusion Co8 Reference Re9 Picture Pi10 Plain Text Pl

Table 1: Abbreviation of document features. Plain textmeans a page contains none of features from 1 to 9. Em-phasized text includes bullet point, bold text, italic textand underlined text. The abstract, keywords and generalterms are features here, but not included in previous litera-ture (Loizides and Buchanan, 2009).

time viewing each document can be calculated. Thereare 10 types of document features appearing in thestudy. We abbreviate the document features in orderto optimize the space for visualizations, as shown inTable 1. During the study, participants’ viewing timeon pages is logged. Although eye-trackers are notused in the current document triage study, the HCIresearchers still want to see how document featureswould potentially influence participants’ reading pat-terns. Thus the viewing time on document featuresis inferred based on HCI researchers’ hypotheses thatmore time spent on a page suggests more interest onthe features in that page. The new findings or hy-potheses obtained then will be used for further experi-ment design aided by the eye-trackers. The frequencyof document feature on each page can be defined asFi,k,q =

ni,k,qNk,q

, where ni,k,q represents the number offeature i appeared in page k of document q, and Nk,qrepresents the total number of all features appeared inpage k of document q.

Thus the viewing time on each feature can be es-timated as ti,k,q = Tk,q ×Fi,k,q, where Tk,q representsthe participants’ average viewing time on page k ofdocument q.

3. Participant rating of document relevance: Aftereach search task participants were asked to assigneach document a relevance score (1 meaning ”worst”or very irrelevant and 10 meaning ”best” or most rele-vant). Also, for each document, the significance of thefollowing document features is recorded: (1) head-ings (2) picture (3) figures (4) emphasized text.

5 Objective Relevance MetricsAs part of this investigation we attempt to derive

some objective document relevance metrics based onkey terms’ T F × IDF score in the document (Leeet al., 1997). These objective metrics may then beused to gain insight into how effective participants arein their search for relevant information and can also becompared with subjective metrics. The comparison ofobjective and subjective document relevance metricsis shown in Figure 1 and Figure 2. The details canbe found in the supplementary PDF file (due to spacelimitations).

Figure 1: We can plot the subjective and objective rat-ings described in Section 4 and 5 onto the bubble chart inManyEyes (Viegas et al., 2007). As shown in the figure, thetop one presents the objective ratings and the bottom oneshows the subjective ratings. From the bottom bubble chart,we can observe the difference of scores between documentsis very slight. These two bubble charts highlight the dis-crepancies between the objective and subjective relevancemetrics.

6 VisualizationIn this section, we utilize various existing visual-ization techniques and tools to investigate documenttriage data. The following list summarizes the toolsand their visualizations we have experimented:

• The ManyEyes (Viegas et al., 2007) applica-tion with the following visualizations: Wordle,Tag Cloud, TreeMaps, Line Graph, Stack Graph,

Figure 2: This figure shows the line graph plot of the sub-jective and objective rating scores. Documents in the orderfrom HCI to HCI9, TABLET to TABLET6 are mapped tothe x-axis. We can observe that, except documents HCI1,HCI7, HCI8 and TABLET4, the subjective rating (the lineabove) and the objective score (the line below) correspondin a linear fashion.

Bar Chart, Bubble Chart, Scatterplot, and MatrixChart

• The XMDV (Ward, 1994) application with thefollowing: Parallel Coordinates, Scatterplot Ma-trix, Star Glyphs and Dimensional Stacking

• The Mondrian (Theus, 2002) application with thefollowing visualizations: Bar Charts, Histograms,Parallel Coordinates, Boxplots, Scatterplot Matrix

• The Treemap Application 4.1 (Kobsa, 2004):TreeMaps

• The Topcat Application (Mark Taylor, 2005): 3DScatterplots, Histogram, Sky, Lines and Density

• Microsoft Office 2007 (off, 2007): 3D Barchart,Radar chart, 3D line graph, 3D Bubble Chart. Al-though Excel is a commercial application, we in-clude it as an exception because we have a univer-sity license for this product.

• The Tableau Application (Heer et al., 2008): Thefree trial version only contains a few basic visual-izations. Advanced options such as parallel coor-dinates are not available for use in the free trial.

In the following subsections, we describe the ap-plications we used along with the visualizations ofdocument triage data and evaluate the usefulness ofeach. We tried over 17 visualizations with severaldifferent variations for a total of 110 images. Eachtool was systematically applied to the same data de-scribed in Section 4 (Loizides and Buchanan, 2009)by visualization researchers. For each tool, we (1)re-formated the data to match the application’s in-put requirements, (2) tried out each of the visualiza-tions offered by the tools, and (3) evaluated the utility

Figure 3: The top image shows the 2D stack graph visualization of document triage data in ManyEyes (Viegas et al., 2007).The X-axis represents documents in both Task 1 and Task 2. Y-axis represents the average viewing time on each page over allparticipants in each document. The strips in different colors represent viewing time trend for individual pages. The numberin every strip indicates the page number. The bottom image illustrates the 3D stack graph plotted on Microsoft Office Excel2007. The X-axis is mapped to the page number, Y-axis to the viewing time and Z-axis to the documents. Compared with 2Dstack graph, we can gain an overview of all documents’ time and page distribution and compare them more intuitively.

based on the domain expert’s feedback. The visual-izations are assessed by domain experts - the HCI sci-entists who carried out the user-study (Loizides andBuchanan, 2009). Due to space limitations, we can-not describe every visualization we tried out, but onlythose most relevant and beneficial to the investiga-tion. The beneficial visualizations are able to pro-vide more insight of the data set and help the HCIresearchers obtain and form new findings and hy-potheses. We also provide some of the less insight-ful visualizations as supplementary material. Some ofthe less beneficial visualizations include: bar charts,bubble charts, 2D and 3D scatterplots. We providea tour through all of the visualizations in a supple-mentary video. Each image of visualization, plusthe supplementary video and PDF file, are stored inits original resolution on the supplementary websitehttp//cs.swan.ac.uk/˜cszg/docTriage.

6.1 Stack Graph VisualizationThe stack graph in ManyEyes is used to visualize thetotal change of a group of quantities over time(Viegaset al., 2007). During the document triage study, HCIscientists observe that a user’s triage process can pro-ceed in a linear fashion starting with the first docu-ment and then reading and scoring every subsequentdocument (Buchanan and Loizides, 2007). The se-quence of documents in Task 1 and 2 can be mappedto the time parameter of the stack graph. For eachdocument, we can observe the changes in viewingtime spent on individual pages from top image in Fig-ure 3. This visualization shows participants spentmost of their time viewing page one. Also, usersspent less time on pages near the end of the docu-ments. From the peak of each document, we can rankthe documents by viewing time, e.g. HCI receivesthe most time and HCI(6) the least. Furthermore, wecan compare individual pages of different documents,

Figure 4: This figure shows the treemaps from ManyEyes (Viegas et al., 2007). The top image shows a Task-Document-Page-Feature hierarchy. The top row of the visualization shows the current tree hierarchy. Each document name in blackbold character is manually annotated. Different colors represent different documents. The document features on each pageare mapped to the leaves of the tree. The bottom image shows a Task-Feature-Document-Page structure. Different colorsrepresent distinct document features in both tasks. The feature names are manually annotated. The pages that features appearon are visualized as leaves of the structure.

such as all pages of TABLET(2) receive more viewingtime than adjacent documents.

The 2D stack graph utilizes the accurate graphicalperception encodings (Cleveland and McGill, 1985),such as position, length, area, angle slope and color,to convey multiple data attributes to the user simul-taneously. Also, an additional variate, namely, doc-uments, can be included in the visualization as op-posed to just two dimensions (page and time) in thebar chart, line graph or pie chart. However too manypages in this visualization leads to problems such asvery thin strips or degenerate line strips. It’s difficultto discern the last few pages of longer documents (10pages more), thus the length of such documents is dif-ficult to infer.

The problems of degenerate or overlapping stripscan be reduced in the 3D stack graph, which is avail-able in Microsoft Office Excel 2007, as shown in Fig-ure 3 bottom. Compared with the 2D stack graph,

the length of each document is clearly shown in 3Dspace. Also, we can gain a general trend of partic-ipants viewing time on documents and pages which2D stack graph cannot offer. Although the 3D stackgraph suffers from occlusion and perspective distor-tion (Shneiderman, 2003), with the help of the inter-action techniques, such as zoom, pan, rotate and shad-ing, the benefits provided by 3D outweigh the draw-backs in this particular case.

6.2 TreemapHCI researchers study how document features influ-ence user behaviors when searching documents. Therelationship of viewing time between pages and docu-ment features may unveil user navigation patterns dur-ing document triage. In order to optimize the spaceto display more information, we abbreviate the nodesin the tree structure. Pg1, Pg2 etc. are page num-bers. TA, TA1 etc. and HC, HC1 etc. represent doc-uments in Task 1 and 2. The abbreviation of doc-

Figure 5: This image shows a matrix chart in ManyEyes (Viegas et al., 2007). Rows are mapped to the document features,columns to the documents, colors to page number and size of each bar to time. Rows, columns and colors can only acceptcategorical data. This visualization depicts four variates at a time and display the general view among the four variates.

ument features is given in Table 1. Each page in-cludes features, such as headings, abstract, pictures,etc. Each feature is associated with the average view-ing time. The treemap is an alternative representationof tree diagram, introduced by Johnson and Shneider-man (Johnson and Shneiderman, 1991; Shneiderman,1992). ManyEyes offers squarified treemaps, whichuses rectangles with an aspect ratio close to 1 and areordered by size (Bruls et al., 2000). It also providesvarious navigation such as smooth zooming, hierar-chy reordering and color mapping for users to interactwith different levels of the tree structure.

We can create the treemap using a Task-Document-Page-Feature-Time hierarchy. We anno-tate the document names of treemap visualization re-sult to indicate the intermediate nodes, as shown onthe top in Figure 4. From this visualization, page oneincluding its most frequent features, such as abstract(Ab), keyword (Kw) and headings (He), covers themost area in all documents except ”HC6”. Partici-pants almost even out the distribution of their view-ing time on each document in Task 1 and on somegroups of documents in Task 2, even the documents’length varies from 5 to 29 pages. We can hypothe-size that participants’ viewing time is mostly affectedby the page one, not by the document length. With thetreemap, only one level in the tree structure can be dis-played each time. To compare different variates, weneed to frequently switch between various tree depths,which is tedious and error-prone. Some authors try tovisualize the changes of hierarchy in treemap(Tu andShen, 2007; Blanch and Lecolinet, 2007), but such at-tempts are not available in the tools presented in thispaper. In order to further explore participants’ view-ing patterns, we need a visualization which can com-bine documents, pages and features together in just

one view. This motivates the use of matrix chart inSection 6.3.

The treemap Task-Document-Page-Feature hier-archy can be switched to Task-Feature-Document-Page order. Each task contains several distinct docu-ment features. Each feature appears in different docu-ments. We manually annotate the document features,as shown on the bottom in Figure 4. From this vi-sualization, we can observe the distribution of doc-ument features. Visual components, such as figures,pictures and emphasized texts, have a weaker impactthan headings in terms of their population in docu-ments and frequency in pages. Compared with Task2, the area in plain text (Pl) in Task 1 is dramaticallyreduced. This is might because featureless pages ap-pears in 6 documents in Task 2, whereas only in 2 doc-uments in Task 1. We also observe that pictures andfigures in Task 1 cover much larger proportion than inTask 2. This is might because such features spread outin more pages in Task 1 than in Task 2. In order to fur-ther explore the influence of the feature distribution,we manually calculate participants’ average viewingtime on documents, and pages with figures, picturesand plain texts in both tasks. The result shows that onaverage, participants spent more time viewing docu-ments in Task 1 than in Task 2. Moreover, the view-ing time on pages with figures and pictures is largerthan on pages only contain plain text. From these vi-sual clues, we can form a hypotheses that pages richin visual features, such as pictures and figures, mightdraw more attention than pages only containing plaintext from participants. During the analysis, we needthe aggregation method of numerical attributes, forexample the total and average viewing time on vari-ates appeared in the higher level than features, suchas pages, documents and tasks, should be calculated

and displayed. This function is not supported by thetreemap in ManyEyes, but is supported in TreeMap4.1 (Kobsa, 2004).

Compared with the stack graph, the treemap isable to co-relate four variates. It gives the researchera clear outline of the elements in question and hierar-chically classify them regarding importance.

6.3 Matrix Chart Visualization

Matrix chart was introduced by Marsh (Marsh, 1992).It maintains tabular organization of the data, but us-ing bars or bubbles to represent each of the ele-ments in the table. The matrix chart is supportedin ManyEyes (Viegas et al., 2007), and variates canbe mapped to four visual attributes: rows, columns,size of the bubble or bar and color, as shown onthe in Figure 5 left. We adopt the bar chart for ev-ery row/column combination, because viewers caninterpret changes in length more accurately than inarea (Cleveland and McGill, 1985). From this visu-alization, we can gain an overview of four variatessimultaneously. Headings (He) are the most popu-lar features in most documents and pages. Page oneoften does not contain figures and pictures. Docu-ment ”HC6” jumps out as an outlier which containsthe least pages and document features. We observethat document ”TABLET2” is rich in visual documentfeatures, such as emphasized text (Em), pictures (Pi)and figures (Fi). Although this document only has7 pages, it receives the second largest viewing timeon average from participants. Also, we find page onein document ”TABLET2” receives less viewing timethan most of the initial pages on other documents.This might extend our hypotheses drawn from Section6.2 that participants’ viewing time is not only affectedby the initial page, but also by the existence of visualdocument features in pages, such as emphasized text,pictures and figures. We also notice that pages con-taining a conclusion (Co) across all documents onlyreceive little average viewing time from all partici-pants. This seems to contradict the HCI researchers’hypotheses, which suggests that participants used topay more attention to the document’s conclusion dur-ing document triage process.

The matrix chart can present the same data set astreemap. Although it is unable to depict the hierar-chies, it offers a broad view encompassing all dataattributes (Marsh, 1992). Compared with treemap inSection 6.2, it provides aggregation to calculate av-erage and total value on numerical attributes, whichoffers more convenience for us to explore the anoma-lies and patterns among four variates.

Figure 7: This image shows a parallel coordinates inXMDV (Ward, 1994). From left to right, the first four axesshow the percentage of each participant’s viewing time onplain text, pictures, figures and page one respectively. Thelast three axes show the number of conclusions each partic-ipant viewed, percentage of each participant’s viewing timeon conclusions and headings respectively.

6.4 Parallel Coordinates VisualizationParallel coordinates are used for displaying high-dimensional data (Inselberg and Dimsdale, 1990).During the document triage study, for each partici-pant, the percentage of his viewing time on pages withpictures, plain text, figures, conclusions and head-ings is calculated. The percentage of viewing timeon page one and the number of conclusions each par-ticipant viewed are also recorded. This multivariatedata can be plotted to seven axes on parallel coor-dinates in XMDV (Ward, 1994). By reordering theaxes, we can find several patterns of value, as shownin Figure 7 right. There are 20 polylines in the fig-ure, each one represents every participant’s readingbehavior on pages with document features. From thisvisualization, an inverse correlation between viewingtime on page one and the number of viewed conclu-sions is clearly revealed. This implies that as partic-ipants spent more time on page one, they are likelyto overlook the conclusions, and vice versa. Also,the number of conclusions being viewed and theirreceived viewing time reveal a correlation. All thedata in viewing time on figures, pictures and page oneshow a general trend toward inverse correlations. Itcould be page one often does not contain pictures andfigures, as discussed in section 6.3, such that moretime viewing on page one means less time is spent onfigures and pictures.

Parallel coordinates are able to unveil the correla-tions between the viewing time of conclusions, pageone, figures and pictures: observations we were un-able to make with previous visualizations. A disad-

Figure 6: This figure shows a combination of bar charts, parallel coordinates and scatterplot matrix in Mondrian (Martinand Simon, 2008; Theus, 2002). There are five variates in the visualization: task, document, page number, document featureand viewing time. Plots are fully linked to one another. From these visualizations, we can observe the distribution of thehighlighted feature heading (He) in document and page respectively.

vantage of parallel coordinates is that large data mightcause clutter which makes interpretation more diffi-cult.

6.5 Coordinated, Multiple ViewsVisualization

Mondrian is a general purpose information visualiza-tion system. It allows multiple displays to representone data set and links them by brushing and selec-tion (Theus, 2002; Martin and Simon, 2008). Figure6 shows 5 coordinated views using bar charts, par-allel coordinates and scatterplot matrix, on our five-variate data: task, document, page number, documentfeature and viewing time. The viewing time on doc-uments, pages and features in the three bar charts issorted in ascending order. Picture (Pi) and figure (Fi)nearly have the equal importance. The Page/Timebar chart reveals that participants focus on the firstfew pages, and quickly skip over the last pages. Aswe brush heading (He) from parallel coordinates, theother views are updated. But the multiple views canonly deal with a single table at one time. If we needto compare the participants subjective score and theirestimated viewing time on the document features, wehave to work in parallel with tables describing the pre-questionnaire.

The power of the Mondrain is its ability to visu-alize arbitrary dimensions of the data set separately.Due to the limitations of screen resolution, multipleviews in Mondrian may be difficult to display and in-teract on large data set simultaneously. Also, it canbe difficult to infer which combination of visualiza-tions is suitable and sufficient for HCI researchers toanalyze their experimental data and solve the queries.

6.6 3D Bar Chart VisualizationThere are 11 distinct visualization techniques in Mi-crosoft Excel, including various 3D visualizations forgeneral use, and each of them has multiple variations.HCI researchers may use Excel to organize their rawdata, in which arbitrary table columns can be easilymapped to the visual attributes. As shown in Figure8, this provides an interesting bird’s-eye overview ofall the participants reading behaviors on documentsand pages in the experiment. From this visualiza-tion, each individual participant’s reading pattern canbe displayed. But it suffers from occlusion prob-lems. This might be addressed by the user interac-tions, such as selection and smooth zooming, rota-tion, and panning. Although there’s a lot of debateon 3D interface(Shneiderman, 2003; Teyseyre andCampo, 2009), considering our data set is semanti-cally rich which contains documents, pages, partici-pants and viewing time, we believe that the 3D barchart is a way to further explore the individual partic-ipant’s reading pattern provided that the software isable to offer enough interaction support.

7 A Brief Summary of Interactionsthe Tools

Our goal in this paper is not a general compari-son of information visualization tools, but rather aspecialized comparison dedicated solely to the in-vestigation of document triage data. The benefi-cial visualization tools presented in this paper areXMDV (Ward, 1994), Mondrian (Martin and Simon,2008), ManyEyes (Viegas et al., 2007), TreeMap4.1 (Kobsa, 2004), TopCat (Mark Taylor, 2005) and

Figure 8: This figure display an overview of all 20 participants’ status during the experiment in EXCEL 2007 (off, 2007).The X-axis is mapped to the document, Y-axis to the participants and Z-axis is the time spent on viewing documents. Thisvisualization provides an interesting overview of the data.

Figure 9: For each tool, we summarize its interaction techniques and supported data types. In addition, whether a toolcontains 3D visualizations is also recorded. In every cell of the table, ”Y” denotes the specific interaction or data type issupported in that tool, ”N” denotes such interaction or data type is not supported.

Microsoft Excel 2007 (off, 2007). They provide avariety of visualizations and integrate with differ-ent interaction options. These interaction designs ofeach tool are systematically applied to every visu-alization component within that tool. According tothe taxonomy of Shneiderman (Shneiderman, 1996)and Keim (Keim., 2002), data types to be visualizedcan be categorized as 1-, 2-, 3-dimensional (color ismostly used to depict the third dimension in mostof the tools), hyper-dimensional, text, tree and net-work data. In addition, based on Keim (Keim., 2002)and Kosara et al (Kosara et al., 2003)’s work and theneed for visual exploration on document triage data,the most frequently used interaction techniques in-clude filtering, brushing, linking, dimension manip-ulation, (Here the dimension manipulation includesdimension reduction and re-ordering options.) anddynamic projection. The filtering can be achieved

by either a direct selection of desired subset (brows-ing) or by a specification of properties of the de-sired subsets (querying) (Keim., 2002). The dynamicprojection refers to dynamically change the projec-tion of multi-dimensional data, such as Matrix Chartin ManyEyes, and Scatterplot Matrix in XMDV andMondrian (Ward, 1994; Theus, 2002). In this section,we present a brief summary for the tools introducedin this paper. Our summary is based on the tools’ in-teraction designs and the scalability to various datatypes, as shown in Table 9.

ManyEyes is able to handle all those listed datatypes except for high dimensional data. During thedocument triage study, data gathered would usuallybe from an excel spreadsheet, XML document or atext file. Many-eyes provides a good precedent tobuild upon regarding raw data input for custom visu-alizations. Since ManyEyes is deployed on the web, it

saves a lot of time for the user during software instal-lation and configuration compared with other desktopapplications. To use this application, all we need is ausername and password. In terms of ease of use, theManyEyes is no doubt the best out of the six tools toour investigation. However, because of the social andcollaborative nature, the data uploaded in ManyEyeswill become visible to the public. This limits its usagewith respect to the data privacy.

XMDV and Mondrian, as complements toManyEyes, are proficient in visualizing high dimen-sional data. XMDV features interactive, proximity-based clustering, which is effective for reducing theclutter caused by large data sets. But the structure-based bushing can be complicated to use for HCIresearchers. Mondrian offers the coordinated mul-tiple views (CMV) which effectively unveil differ-ent facets of the data. Compared with XMDV, itprovides greater choice of visualizations. Not onlyhigh-dimensional data, Mondrian is quite effective inplotting large, low-dimensional data as well. Also,the input data format in Mondrian is more flexible.However, except changing the alpha value, Mon-drian does not provide more advanced clutter reduc-tion techniques, such as the clustering offered byXMDV (Ward, 1994).

The TreeMap 4.1 (Kobsa, 2004) is specificallydesigned to implement the treemap. Compared withManyEyes, it offers much more interaction options,such as numerical aggregation, various layouts, fil-tering and etc. However, with respect to the aes-thetic feel of the visualization, the HCI experts preferManyEyes which provides more aesthetically pleas-ing color coding and the animation when traversingthrough the different hierarchies.

Excel and TopCat are the only tools offering the3D visualization through the tools presented in thispaper. Although there is a lot of debate on 3D visu-alization (Shneiderman, 2003), it’s surprising that theHCI researchers show more preference in 3D scatter-plot and bar chart shown in Figures 8 and 3. How-ever, in order to completely exploit the potential of3D visualizations, the interaction supports are veryimportant. Although changing the viewing perspec-tive, such as rotation, zooming and pan, are providedin Excel and TopCat, the shading, which can effec-tively depict the depth information, is missing in bothtools.

The first factor that became apparent is that no onevisualization or tool on its own can identify all pat-terns and behaviors needed to be tested. Also, fromthe Table 9, we can see that no tool is able to sup-port all of the data types and interactions. Further-more, some visualizations like the bubble chart can

cause the researcher to miss patterns and make falseinference, such as introduced in Section 8. In light ofthis, it would be reasonable for a bespoke tool to in-clude several visualizations in parallel. Therefore, acoordinated multiple view application allowing for a)several visualizations of the same data and b) one vi-sualization with different data sets is needed. Ideally,visualizations for the document triage data would in-clude: the line graph, 3D stack graph, treemaps andparallel coordinates.

Overall, visualizations are underused in the HCIcommunity as a means of interacting with extracteddata sets. In this research we have explored the waysin which the visualizations enrich the exploration ofrelationships between different data sets of the samestudy. As an exploratory tool, using these visual-izations provides insight into hypotheses formulationabout our data that is not evident from raw material.It is the aim of future work to apply the visualiza-tions presented here, as well as further visualizationsto not only explore the deciphering of the raw data,but to also assist users performing triage in makinginferences about their material.

8 Domain Expert ReviewWe, the researchers of document triage study, are im-pressed to see a multitude of visualizations that canrepresent our data. What became immediately evi-dent was the ease and speed at which these visual-izations could be produced. We systematically wentthrough the visualizations identifying the immediateinferences that would have been possible before sta-tistical analysis, but also factors that may obfuscateuseful hypotheses from being formed. Analysis thusfar has relied on statistical scrutiny such as t-tests. Al-though these are necessary for verifying a relationshipor pattern they do not provide good means for explo-ration of the data. Here, we discuss the most signifi-cant observations and compare some of the visualiza-tions presented in the paper.

The first visualization that caught our attentionwas the bubble chart in Figure 1 which comparedthe participants’ subjective scores for the relevance ofthe documents with the (TFxIDF) (Lee et al., 1997)scores. By simply viewing the bubble chart visual-ization we are mistakenly led to think that due to theunevenness of the size of the bubbles, that there is nocorrelation between the occurrence of popular termsin the documents and the participant rankings. Wenote that this is a test that was not performed whenlooking at the data without the help of visualizations.We were surprised to then observe the line graph vi-sualization in Figure 2 which revealed a relationshipbetween term occurrence and document ratings. It

seems that, although the sizes of the subjective bub-bles in the first visualization were a different size thanthe objective bubbles, the size proportion between thecorresponding documents in each category has a pos-itive correlation. We can see from this that the bub-ble visualization would be useful to our work whencomparing two groups, but care needs to be taken tomatch the correct type of variables for the two groups.Due to this limitation and risk of misinterpretation ofthe data it is deemed quite difficult to apply the bubblechart visualization to effective exploratory research inour work. We are however, convinced that the linegraph visualization can produce much more accurateoverviews of relations between groups and patterns.

Closely related in information representation isthe treemap visualization in Figure 4. The flexibil-ity that this visualization offers in manually changingthe hierarchy of the data to be processed gives it anadvantage. Furthermore, the representation areas givea much clearer means of comparing features and tim-ings. However, as hypothesized by this visualization,it lacks specific detail. For this visualization to be suc-cessful at providing useful specific information to anHCI researcher would need further interaction capa-bilities.

Another beneficial representation of the data isfound on the 3D stack graph in Figure 3. Beyond giv-ing more information that the closely related 2D stackgraph (which basically gives us the average values ofall the pages) it allows us to detect further interestingbehaviors worth exploring. For example, we noticethe importance of the first page, but also the steady de-cline in attention as the page count increases. We alsoobserve the peaks close to the ends which requiresfurther scrutiny, but also that the decline in attentionis mostly steady. The ’anomalies’ in the decline at-test to one of two things: a) a sharp drop in attentionon a specific point in the document or b) an increasedamount of attention. We can therefore infer that fur-ther features also attract attention and test for the im-pact each feature has on attention.

One of the most interesting visualizations wecame across was that of parallel coordinates on theright in Figure 7. The features of a document and theinfluence they have time wise on participants consti-tutes a very important part of our data pool. This vi-sualization gives us a clear image as to the percentageof time spent on those features in an-easy-to-compareformat. Although there is great potential for this spe-cific visualization, we do have two criticisms. Thefirst is regarding the upper and lower limit of the ver-tical axis. For every feature the maximum percent-age time is set to the upper limit and therefore givinga false comparison between the feature values. This

should be remedied in order to facilitate clearer com-parative abilities. Another improvement which wouldincrease comparative ability between data sets wouldbe to be able to produce superimposed average values,standard deviations and multiple side by side visual-izations of data sets.

9 Conclusions And Future WorkWe have surveyed a range of off-the-shelf, freelyavailable information visualization tools for the visualanalysis and investigation of document triage. Al-though there are many options available, only a se-lected few visualizations are useful for this particularapplication. The beneficial visualizations are able toreveal the relationship and get more insight among thedocument triage experimental data sets, which havehigh dimensions and contain both categorical and nu-merical data types. In this paper, we have carried outthe exploratory tasks on these data sets. A range ofless beneficial visualizations, and the full set of all vi-sualizations are provided via a supplementary mate-rial and video stored in our url. Our study also servesas a useful tool for readers interested in gaining anoverview of existing, free, state-of-the-art informa-tion visualization tools. We also report positive andconstructive feedback from experts in the HCI anddigital library domain. Since summarized represen-tations of documents rely heavily on text presented tothe user. In the future, we will focus on another inter-esting visualization that has potential in the documenttriage process, namely the text visualization, such asWordle (Viegas et al., 2009), which allows for high-lighting the most frequently occurring terms hintingat the importance of the document to the informationneed of a seeker.

REFERENCES

(2007). Microsoft Office Excel 2007 product guide. Mi-crosoft office.

ADVANCED VISUAL SYSTEMS INC. (2009). Open-Viz. 300 Fifth Avenue,Waltham, MA 02451.http://www.avs.com.

Bae, S., Hsieh, H., Kim, D., Marshall, C., Meintanis, K.,Moore, M., Zacchi, A., and Shipman, F. (2008). Sup-porting Document Triage via Annotation-based Visu-alizations. In Proceedings of the American Societyfor Information Science and Technology, volume 45,pages 1–16.

Bateman, S., Gutwin, C., and Nacenta, M. (2008). SeeingThings in the Clouds: The Effect of Visual Featureson Tag Cloud Selections. In Proceedings of the nine-teenth ACM conference on Hypertext and hypermedia,pages 193–202, New York, NY, USA. ACM.

Baudel, T. (2004). Browsing Through an Information Visu-alization Design Space. In Proceedings of ACM CHIConference on Human Factors in Computing Systems,volume 2 of Demonstrations, pages 765–766.

Blanch, R. and Lecolinet, E. (2007). Browsing ZoomableTreemaps: Structure-Aware Multi-Scale NavigationTechniques. IEEE Transactions on Visualization andComputer Graphics, 13(6):1248–1253.

Bruls, M., Huizing, K., and van Wijk, J. J. (2000). Squar-ified Treemaps. In Proceedings of Joint Eurograph-ics/IEEE TVCG symposium Visualization, pages 33–42.

Buchanan, G. and Loizides, F. (2007). InvestigatingDocument Triage On Paper And Electronic Me-dia. In Proceedings of the European Conference onResearch and advanced Technology for Digital Li-braries, 4675:416–427.

Cleveland, W. S. and McGill, R. (1985). Graphical Percep-tion and Graphical Methods for Analyzing ScientificData. Science, 229(4716):828–833.

Cool, C., Belkin, N. J., Frieder, O., and Kantor, P. (1993).Characteristics of Texts Affecting Relevance Judg-ments. In In 14th National Online Meeting, pages77–84.

Gottron, T. (2009). Document Word Clouds: VisualisingWeb Documents as Tag Clouds to Aid Users in Rele-vance Decisions. In Research and Advanced Technol-ogy for Digital Libraries, 13th European Conference,Proceedings, volume 5714 of Lecture Notes in Com-puter Science, pages 94–105. Springer.

Heer, J., Mackinlay, J. D., Stolte, C., and Agrawala, M.(2008). Graphical Histories for Visualization: Sup-porting Analysis, Communication, and Evaluation.IEEE Transactions on Visualization and ComputerGraphics, 14(6):1189–1196.

Inselberg, A. and Dimsdale, B. (1990). Parallel Coordi-nates: A Tool for Visualizing Multi-dimensional Ge-ometry. In Proceedings of IEEE Visualization, pages361–378.

Johnson, B. and Shneiderman, B. (1991). Tree Maps: ASpace-Filling Approach to the Visualization of Hierar-chical Information Structures. In Proceedings of IEEEVisualization, pages 284–291.

Jonker, D., Wright, W., Schroh, D., Proulx, P., and Cort, B.(2005). Information Triage With Trist. In Proceedingsof Intelligence Analysis, pages 1–6.

Keim., D. A. (2002). Information Visualization and VisualData Mining. IEEE Transactions on Visualization andComputer Graphics, 8:1–8.

Kobsa, A. (2001). An Empirical Comparison of ThreeCommercial Information Visualization Systems. InProceedings of IEEE Symposium on Information Vi-sualization, San Diego, CA, pages 123–130.

Kobsa, A. (2004). User Experiments with Tree Visualiza-tion Systems. In Proceedings of IEEE Symposiumon Information Visualization, pages 9–16. IEEE Com-puter Society.

Kosara, R., Hauser, H., and Gresh, D. (2003). An Interac-tion View on Information Visualization. In Proceed-ings of Eurographics, pages 123–137.

Lee, D. L., Chuang, H., and Seamons, K. E. (1997). Doc-ument Ranking and the Vector-Space Model. IEEESoftware, 14(2):67–75.

Loizides, F. and Buchanan, G. (2009). An Empirical Studyof User Navigation during Document Triage. In Pro-ceedings of Research and Advanced Technology forDigital Libraries, 13th European Conference, volume5714 of Lecture Notes in Computer Science, pages138–149. Springer.

Mark Taylor (2005). TOPCAT - Tool for OPerations onCatalogues And Tables Version 3.4-3. Starlink devel-opment.

Marsh, S. (1992). The Interactive Matrix Chart. ACMSIGCHI Bulletin, 24(4):32–38.

Martin, T. and Simon, U. (2008). Interactive Graphics forData Analysis: Principles and Examples (ComputerScience and Data Analysis). Chapman & Hall/CRC.

Shneiderman, B. (1992). Tree Visualization WithTreemaps: a 2-d Space-filling Approach. ACM Trans-actions on Graphics, 11(1):92–99.

Shneiderman, B. (1996). The Eyes Have It: A Task byData Type Taxonomy for Information Visualizations.In Proceedings of IEEE Symposium on Visual Lan-guages, pages 336–343.

Shneiderman, B. (2003). Why Not Make Interfaces Bet-ter than 3D Reality? IEEE Computer Graphics andApplications, 23(6):12–15.

Teyseyre, A. R. and Campo, M. R. (2009). An Overviewof 3D Software Visualization. IEEE Transactions onVisualization and Computer Graphics, 15(1):87–105.

Theus, M. (2002). Interactive Data Visualization UsingMondrian. Journal of Statistical Software, 7(11):1–9.

Tu, Y. and Shen, H.-W. (2007). Visualizing Changesof Hierarchical Data using Treemaps. IEEE Trans-actions on Visualization and Computer Graphics,13(6):1286–1293.

Viegas, F. B., Wattenberg, M., and Feinberg, J. (2009). Par-ticipatory Visualization with Wordle. IEEE Trans-actions on Visualization and Computer Graphics,15(6):1137–1144.

Viegas, F. B., Wattenberg, M., van Ham, F., Kriss, J., andMckeon, M. (2007). ManyEyes: A Site for Visualiza-tion at Internet Scale. IEEE Transactions on Visual-ization and Computer Graphics, 13(6):1121–1128.

Ward, M. O. (1994). XmdvTool: Integrating MultipleMethods for Visualizing Multivariate Data. In Pro-ceedings of IEEE on Visualization, pages 326–336.IEEE Computer Society Press.

Ziemkiewicz, C. and Kosara, R. (2008). The Shapingof Information by Visual Metaphors. IEEE Trans-actions on Visualization and Computer Graphics,14(6):1269–1276.

VISUAL ANALYSIS OF DOCUMENT TRIAGE DATA - Swanseacsbob/research/docTriage/geng11visual.pdf ·...

Documents

Transcript of VISUAL ANALYSIS OF DOCUMENT TRIAGE DATA - Swanseacsbob/research/docTriage/geng11visual.pdf ·...