Samah H. Gad - Virginia Tech › bitstream › handle › 10919 › ... · 2020-01-16 · Samah H....
Transcript of Samah H. Gad - Virginia Tech › bitstream › handle › 10919 › ... · 2020-01-16 · Samah H....
EXPRESSIVE FORMS OF TOPIC MODELING TOSUPPORT DIGITAL HUMANITIES
Samah H. Gad
Dissertation submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Computer Science and Applications
Naren Ramakrishnan, Chair
Andrea L. Kavanaugh
Christopher L. North
Eli Tilevich
Niklas L. Elmqvist
September 8, 2014
Blacksburg, Virginia
Keywords: Topic Modeling, LDA, Time Series Segmentation, Visual Analytics
Copyright c©2014, Samah H. Gad
Expressive Forms of Topic Modeling to Support Digital Humanities
Samah H. Gad
(ABSTRACT)
Unstructured textual data is rapidly growing and practitioners from diverse disciplines are expe-
riencing a need to structure this massive amount of data. Topic modeling is one of the most used
techniques for analyzing and understanding the latent structure of large text collections. Probabilistic
graphical models are the main building block behind topic modeling and they are used to express
assumptions about the latent structure of complex data. This dissertation address four problems
related to drawing structure from high dimensional data and improving the text mining process.
Studying the ebb and flow of ideas during critical events, e.g. an epidemic, is very important
to understanding the reporting or coverage around the event or the impact of the event on the
society. This can be accomplished by capturing the dynamic evolution of topics underlying a
text corpora. We propose an approach to this problem by identifying segment boundaries that
detect significant shifts of topic coverage. In order to identify segment boundaries, we embed a
temporal segmentation algorithm around a topic modeling algorithm to capture such significant
shifts of coverage. A key advantage of our approach is that it integrates with existing topic modeling
algorithms in a transparent manner; thus, more sophisticated algorithms can be readily plugged in as
research in topic modeling evolves. We apply this algorithm to studying data from the iNeighbors
system, and apply our algorithm to six neighborhoods (three economically advantaged and three
economically disadvantaged) to evaluate differences in conversations for statistical significance.
Our findings suggest that social technologies may afford opportunities for democratic engagement
in contexts that are otherwise less likely to support opportunities for deliberation and participatory
democracy. We also examine the progression in coverage of historical newspapers about the 1918
influenza epidemic by applying our algorithm on the Washington Times archives. The algorithm is
successful in identifying important qualitative features of news coverage of the pandemic.
Visually convincing results of data mining algorithms and models is crucial to analyzing and
driving conclusions from the algorithms. We develop ThemeDelta, a visual analytics system for
extracting and visualizing temporal trends, clustering, and reorganization in time-indexed textual
datasets. ThemeDelta is supported by a dynamic temporal segmentation algorithm that integrates
with topic modeling algorithms to identify change points where significant shifts in topics occur.
This algorithm detects not only the clustering and associations of keywords in a time period, but
also their convergence into topics (groups of keywords) that may later diverge into new groups.
The visual representation of ThemeDelta uses sinuous, variable-width lines to show this evolution
on a timeline, utilizing color for categories, and line width for keyword strength. We demonstrate
how interaction with ThemeDelta helps capture the rise and fall of topics by analyzing archives of
historical newspapers, of U.S. presidential campaign speeches, and of social messages collected
through iNeighbors. ThemeDelta is evaluated using a qualitative expert user study involving three
researchers from rhetoric and history using the historical newspapers corpus.
Time and location are key parameters in any event; neglecting them while discovering topics
from a collection of documents results in missing valuable information. We propose a dynamic
spatial topic model (DSTM), a true spatio-temporal model that enables disaggregating a corpus’s
coverage into location-based reporting, and understanding how such coverage varies over time.
DSTM naturally generalizes traditional spatial and temporal topic models so that many existing
formalisms can be viewed as special cases of DSTM. We demonstrate a successful application of
DSTM to multiple newspapers from the Chronicling America repository. We demonstrate how our
approach helps uncover key differences in the coverage of the flu as it spread through the nation,
and provide possible explanations for such differences.
Major events that can change the flow of people’s lives are important to predict, especially
when we have powerful models and sufficient data available at our fingertips. The problem of
embedding the DSTM in a predictive setting is the last part of this dissertation. To predict events
and their locations across time, we present a predictive dynamic spatial topic model that can predict
future topics and their locations from unseen documents. We showed the applicability of our
proposed approach by applying it on streaming tweets from Latin America. The prediction approach
was successful in identify major events and their locations.
iii
Contents
1 Introduction 1
1.1 Motivation and Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Datasets 9
2.1 iNeighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Chronicling America Historical Newspapers . . . . . . . . . . . . . . . . . . . . 12
2.3 Presidential Campaigns Press Releases . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Tweets from Latin America . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Survey of Related Research 19
3.1 Temporal Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Topic Modeling other Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Topic Modeling for Short Text . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Syntactic Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iv
3.2.3 Sentiment Analysis and Topic Modeling . . . . . . . . . . . . . . . . . . . 24
3.2.4 Author Topic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.5 Spatial Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Temporal Text Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Dynamic Temporal Topic Modeling 29
4.1 Segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Algorithm Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Bridging the Divide in Democratic Engagement: Studying Conversation
Patterns in Advantaged and Disadvantaged Communities . . . . . . . . . . 34
4.2.2 Digging into Historical Newspaper Archives using Dynamic Temporal
Segmentations over Topic Models . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 New Visual Analytic Representations 56
5.1 ThemeDelta Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 ThemeDelta: Visual Representation . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 Visual Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.2 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.3 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Domain Specific Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
v
5.3.1 U.S. 2012 Presidential Campaign . . . . . . . . . . . . . . . . . . . . . . 64
5.3.2 i-Neighbors Social Messages . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 Historical U.S. Newspapers . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Qualitative User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6 Dynamic Spatial Topic Model 75
6.1 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Parameter Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Model Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.1 East, west, midwest 1918-1919 news coverage . . . . . . . . . . . . . . . 85
6.4.2 1918-1919 Influenza related tones, topics, and locations . . . . . . . . . . 90
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Predictive Analysis 98
7.1 Prediction Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 Latin America Unrest Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8 Conclusion 107
vi
8.1 Dynamic Temporal Segmentations over Topic Models . . . . . . . . . . . . . . . . 108
8.2 New Visual Analytics Representations . . . . . . . . . . . . . . . . . . . . . . . . 109
8.3 Dynamic Spatial Topic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.4 Predictive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Bibliography 112
vii
List of Figures
1.1 Sample pages from Bisbee Daily Review-AZ (December 17,1918), the New York
Tribune-NY (September 01, 1918), and Red Lake News-MN (November 01, 1918). 2
2.1 i-Neighbors: Social networking service connecting residents of geographic neigh-
borhoods [iNe, 2012]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Distribution of messages across neighborhoods. . . . . . . . . . . . . . . . . . . . 12
2.3 Distribution of messages across neighborhoods. . . . . . . . . . . . . . . . . . . . 13
2.4 Distribution of influenza reporting over the year of 1918 and 1919. . . . . . . . . . 15
2.5 Reporting concentration across locations and time. . . . . . . . . . . . . . . . . . 15
4.1 Contingency table used to evaluate independence of topic distributions for two
adjacent windows [Gad et al., 2012]. . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Partial segmentation output from a low-poverty neighborhood. . . . . . . . . . . . 40
4.3 Partial segmentation output from a high-poverty neighborhood. . . . . . . . . . . . 41
4.4 Partial segmentation output from a low-poverty neighborhood. . . . . . . . . . . . 41
4.5 Durations of segments in advantaged (low poverty) and disadvantaged (high poverty)
neighborhoods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Example clusters of discovered segments across neighborhoods. . . . . . . . . . . 43
viii
4.7 Segmentation results for The Washington Times Influenza paragraphs from Septem-
ber 1918 to December 1918. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8 Segmentation results for The Washington Times front pages from September 1918
to December 1918. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 ThemeDelta visualization for Barack Obama campaign speeches during the U.S.
2012 presidential election (until September 10, 2012). Green lines are shared terms
between Obama and Romney. Data from the “The American Presidency Project” at
UCSB (http://www.presidency.ucsb.edu/). . . . . . . . . . . . . . . . 57
5.2 Basic visual representation used by ThemeDelta. . . . . . . . . . . . . . . . . . . 59
5.3 ThemeDelta visualization after performing a filtering operation, based on the key-
word “energy”, in the visualization presented in 5.1. . . . . . . . . . . . . . . . . . 62
5.4 Comparison of different stages of the layout sorting algorithm used for the ThemeDelta
technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 ThemeDelta visualization for Mitt Romney campaign speeches for the U.S. 2012
presidential election (as of September 10, 2012). Green lines are shared terms
between Obama and Romney speeches. Data from the American Presidency Project
at UCSB (http://www.presidency.ucsb.edu/). . . . . . . . . . . . . . 64
5.6 Result of searching for the word “watch” in low-poverty neighborhood. . . . . . . 67
5.7 Partial output from a high-poverty neighborhood. . . . . . . . . . . . . . . . . . . 68
5.8 Partial output from a low-poverty neighborhood. . . . . . . . . . . . . . . . . . . . 69
5.9 ThemeDelta visualization for newspaper paragraphs during the period September
to December in 1918. Color transparency for different trendlines signify the global
frequency for that keyword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 Graphical model representation of the DSTM for three consecutive time slices. . . 77
ix
6.2 Perplexity as a function of number of topics. . . . . . . . . . . . . . . . . . . . . . 84
6.3 Perplexity as a function of vocabulary size. . . . . . . . . . . . . . . . . . . . . . 84
6.4 New York Tribune, NY DSTM Output. . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5 The Evening Missourian, MO DSTM Output. . . . . . . . . . . . . . . . . . . . . 91
6.6 Bisbee Daily Review, AZ DSTM Output. . . . . . . . . . . . . . . . . . . . . . . 91
6.7 Tones distribution over Influenza reporting. . . . . . . . . . . . . . . . . . . . . . 93
6.8 East coast newspapers discovered topics and locations grouped by tones. . . . . . . 94
7.1 Experimental setup for predicting topics and their locations from streaming data. . 100
7.2 Predicted topics and their locations from the 8th day of June 2013 . . . . . . . . . 105
7.3 Predicted topics and their locations from June 29th, 2013. . . . . . . . . . . . . . . 105
x
List of Tables
2.1 The six neighborhoods studied in our experiments. . . . . . . . . . . . . . . . . . 12
2.2 Daily newspapers summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Event Timeline created from Front Pages of The Washington Times (1918). . . . . 51
6.1 DSTM notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.1 Sample topic-document(tweet) assignment . . . . . . . . . . . . . . . . . . . . . 103
7.2 Predicated topic assignment counts for June 8th, 2013. . . . . . . . . . . . . . . . 104
7.3 Predicated topic assignment counts for June 29th, 2013. . . . . . . . . . . . . . . . 106
xi
Chapter 1
Introduction
Historians and humanists are rapidly embracing the notion of ‘big data’ [Grossman, 2012] as a
context to pose and investigate their research questions. The application of algorithmic techniques
enables them to systematically explore a broad repository of data and identify qualitative features of
a phenomenon (response, sentiment, and associations) in the small scale as well as the genealogy of
information flow in the large scale.
The field of humanities has traditionally relied on close reading of documents in a topic of
interest. The increasing availability of electronic document archives and their rapid growth has
ushered in the era of digital humanities, and what is referred to as distant reading. Distant reading
entails the comprehension of literature ‘not by studying particular texts, but by aggregating and
analyzing massive amounts of data [Schulz, 2011].’
A key area that can benefit from distant reading of hundreds of text is in the comprehension
of newspaper coverage of significant events, such as the 1918 ‘Spanish’ flu. Understanding the
coverage of reported infected locations across local and national newspapers (see Fig. 1.1) is a key
step that can help us understand how news propagated through time and space in those early times,
when newspapers were the only widely used information resource.
A different medium, also relevant to modern digital humanities research, spans personal
1
2
TWO
INDIANS ACHIEVIPLACE FOR SELVES
(By Review Leased Wire)WASHINGTON, Dec. 16. The
American Indian, by enlisting in thearmy and navy, by subscribing liberally to tbe Liberty loans, by incieasine the urod action of loodstuits onIndian lauds aud 'by contrioutions torelief agencies, greatly aided theUnited btates and the allies In winning tbe war, declared. Cato Sells.commissioner ol Indian aHairs, todayin his annual report.Mr. Sells said that out of S3,0uu
eligibles for military duty, more thanbouO Indians entered the army, lowenlisted in the navy aud 5u0 were inother war work. More' than 6UU0 01the enlistments were voluntary. Lib-erty bonds were bought, Commissioner Sells said, until Indians now holdthe equivalent of one ?5U bond lorevery man, woman and cnild of theirrace in the nation.Through It all. Commissioner Sells
declared, a new view of life and hisresponsibilities is- - coming to the Indian."In the midst of the most decisive
and expensive achievements of his-tory," said the report, "be is alearner of the eternal principles involved. He is a student of the rightsof individuals, of nations and of international ethics. It is somethingto challenge attention when 8000 or10,000 of a race which, within thememory of living men, knew little be-yond the restraints of barbarism, crossthe ocean as crusaders of democracyand civilization."The policy adopted In 1917 of giving
control of their own affaiM to asmany of the Indians as possible hasproven successful and fully justified,the report declares, adding that asfast as individual wards of the gov-ernment become capable of
their properties will beturned over to them.
WILSON VISITSFRENCH TIGER ANDHOLDS CONFERENCE(Continued from Page One)
lor the most part women, were with-in the building.The women cheered the president in
a manner, which, while not as lustilyas the president had been accustomedto hear on the college campusfi musthave sounded sweet in his ears as hesmiled and bowed repeatedly.Adritn Mithouard, president of the
municipal council, read an address towhich the president replied in loudclear tones, in which there was atinge of emotion.After the official ceremony the
president inspected the city hall, end-ing finally in a small room where abutfet had been installed and refresh-ments were served. Several officialswere presented and many othersshowed eagerness to shake his hand.Shortly the president left, returning tothe Murat residence.
MEETS PREMIER VEMZELOSPARIS, Dec. 16. Premier Veuiselos
of Greece met President Wilson todayin a conference at which the Greet;aspirations aud viewpoint were doubt-less placed before the president.The meeting with the Greek pre-
mier, like that with Premier Clenien-ceau- ,was outside of the formal pro-
gram for today and constitute a partof the intimate personal exchangesby which the president is obtainingthe views of the statesmen aud theyare obtaining his views.The chief regular feature of the pro-
gram today was the president's visitto the Hotel de Ville at 1:45 o'clockin the afternoon. He was escortedthere by President Poincare. Largecrowds had gathered along the routeand there was another popular
HOLIDAY IN PARISPARIS. Dec. 16 Monday was anoth-
er holiday for Paris. The residencein the section of the city east of thePlace de 'a Concorde saw the presi- -
1
WARNS BUSINESSMEN TO BE FAIR
i
r! ?-
7 .!,'' ';
fi i" v ; .p . ..
1
Louis F. Post.Louis F. PosUsassistant secretary
of labor, has issued a warning to theemployers of the country to be cau-tious in dealing with labor duringthe reconstruction period that is nowunder way. He says that unless em-ployers are' fair and liberal in theirattitude toward the workers L'olshe-vis- m
threatens to gain here as it hasin som countries abroad.
dent today for the first time and theymade the most of their opportunity.The trip Sunday to the tomb of
Lafayette in the Picpus cemetery', inthe southeastern section, was madeunofficially and the populace in thatsection did not know that the presi-dent bad been through it.
TRANSACTS ROUTINE BUSINESSPAH1S, Dec. 16 During the inter-
vals between official calls and visitstoday, the president was engaged inaftairs of a state much after the man-ner of his procedure in the WhiteHouse: offices. He did considerablebusiness over the telephone, just as athome.President Wilson has been insisting
that the American peace mission or- -'ganize its machinery so that the peo-ple in the United States can begin toKnow what is happening. Meanwhilethe mission is trying to get itself set-tled so as to begin preliminary work.Joseph C. Crew, formerly counsellor
of the American embassy in Vienna,will take charge of the official announcements which will be ordinarilytransmitted to the United States. Hewill have as assistants Kay Stannard
VISITS FOCH TODAYPARIS, Dec. 16. President Wilson
will proceed tomorrow to Senlis, Mar-shal Foch's headquarters, to conferwith tbe allied .commander. Later hewill visit the battlefield at ChateauThierry, where the first Americandivisions distinguished themselves,and also Rheims.
BRITISH OFFICIALS COMINGPARIS, Dec. 16. David Lloyd
George, the Uritish prime minister,and Arthur J. Balfour, British foreignsecretary, will arrive in Paris nextSaturday or Sunday.
WILL MEET TWICE.PARIS, Dec. 10. President Wilson's
trip to tlaly will present the secondoccasion for a meeting with KingKmmauuel. The Italian king will ar-rive in Taris next Thursday, whichwill afford an opportunity for1 thefirst meeting between sovereign andpresident. Iietails of the first meet-ing already have been arranged. Thepresident probably will call ou KingVictor at the Italian embassy.
DEPUTIES APPEAR IMPLICATED
(V.y Hoview Wiro)LISBON, Portugal. Dec. 16 (Havas).Dr. I rito Caiuacho. leader of the
unionist group in the Portuguesechamber nf deputies, and MagalliaesLima, leader of the republican party,have been arrested.Lima was taken into custody be-
cause, it is alleged, a letter addressedto him was found on the person ofthe assassin of President Paes.- -
A goodsmileIpMai
makes its own way. Rightliving makes the smile.SNsmNiPosioMinstead of coffee putsinoJiy x man wt ijto smiling health and"ri4crnn'c A ncucmi" ibUUUaO. 1 1 lJUrvL. J r KLrtJUii r i
THE BISBEE DAILY REVIEW, TUESDAY, DECEMBER 17, 1918.
RUSSIA APPEALS
FOR ALLIES HELP
(Vv Review Leased Wire)WASHINGTON, Dec. 16. Repre-
sentatives of all political groups inSouthern Russia except the bolshe-vik! and absolutists have appealed tothe American and allied governmentsto send an expedition into SouthernRussia to combat the Soviets and pre-vent anarchy.The appeal was made to the min-
isters of the associated nations atJassy, Rumania, on December 6. withthe request that it be transmitted totheir governments. The report fromthe American minister was receivedtoday at the state department.The Russian repiesentatives to!d
the ministers that a renewal of abloody civil war was threatened inSouthern Russia.
DESTROYERS SAIL HOMEWARD.
QUEEXSTOWX. Dec. 16. TwelveAmerican torpedo boat destroyerssailed for home today.
Food will win the world.
111 SfeIB ZfSil VS'2f6.
M
B
r
Themi i
k. mjxw l iir r " - i
'
5
I it - 1
:5..v Si
f'sf Clothes
Dm GvjtpstAU&rki tfieWorld
Join theRedGoss
--Zltt 9on JSGedhaJfeartSttlfl J t
Mr. Ford ought to get out a rattlinggood newspaper. St. Louis Star.
A few months ago Belgium was toBe held as a paw n. Albany Journal.
Keep one eye on your garbage pail.
V
Ama
BO H NOW
(Bv Review Leased Wire)XEW YORK, Dec. 18. The bolshe-vik- iare trying to raise an army of
3,000,000 to put down the conserva-tive element in Russia, whom theyterm Imperialists, Cpt. Platon Ousti-nof- f,
formerly of the second life rus-sar- s
and who left Petrograd October30, declared on his arrival here to-day.Executions by the bolsheviki were a
daily occurrence, he said, and thou-sands of conservatives were held bythe radicals as hostages, so as to pro-vide victims of revenge in case anybolsheviki officials were killed. Afterthe recent slaying of a minister of theinterior, be asserted, the "reds" shot512 officers of the former imperialregiment.Food is so scarce, the captain said,
that horseflesh sells for 10 rules ($2)a pound and black bread for 12 rublesa loaf, when it can he obtained at all.
tional Colored
serve
the rest.
"Each
$30well
the
carry the
In All the latest
fibre and silk the pair
Arc every season. carrystock cf the both cotton and
CLOUDY AT CAPITAL.(By Lease TVlrs)
WASHIXGTOX, Dec. 16. The NaDemocracy
WE THE OF IN
met here today to elect a commis-sion to go to France to ask that "full
for colored Americans bemade a part of the world's peace
Delegates were presentfrom 37 states.
RED CROSS DRIVESTARTS WITH SNAP i
(llv Review Leased Wire)WASHINGTON. Dec. 16. Only
reports on tbe openingof the American Red Cross annualChristmas roll call had been re-ceived tonight at national head-quarters here and few figureswere available. It was announced,however, that the Berks countycharter in the di-vision, was tbe first to go "overthe top" in the week's campaign.Whirlwind campaigns were be--
gun by many chapters and by spe- -cial committees in scores of citiesover the country and officialshope to enroll many millions ofmembers in excess of the present
of 22,000.000.
ASLEEP IN THE DEEPNEW YORK. Dec. 16. Seventeen
men, members of the crew of the Bri-- 1
tish steamship Lairhgrove, lost theirlives wnen that ship was sunk in acollision with the American steamer
in the latter part of Octo-ber in the straua of Gibraltar, it was.learned here today with the arrival ofthe Hawaiian.
your Go As Far as: As never before in history, CHRISTMAS DAY will this year
dawn upon a world dedicated to service.
Let Your GiftsService
Buy only gifts that help. Dollars aa truly as men. Putyour Christmas into things that people need. Let Uncle Samhave
Any man who needs a suit or will appreciate
-- $30 -- $35rade tht same price the nation ever"
Well-buil- t, smart-lookin- g, splendidly serviceable.
Even in these war-tim- es Styleplus prices are reasonable.Hart, Schaffner & Marx from to $45. A man
may be proud of his judgment in selecting a suit or overcoatfrom our stocks.
' Every dollar invested the utmost in clothes satisfaction.Newest models, latest fabric-weave- s and colorings; thorough
tailoring. Visit store
Make useful and appreciated gifts. We famous Manhattanand Earl and Wilson makes. Starched cuffs from $1.25 to
Soft cuffs in percales, and silks from
$1.50 to
and bows. colorings from
50c to $2.50HOSIERY
Cotton, from,
25c to $1.50UNION SUITS
beccrpinq more popular We a completefamous Globe and Lewis makes, wool
from $2.00 to $6.50
RAISING
Review
coneress
25
CONTROL SALE STYLEPLUS CLOTHES TOWN
FORCES
democracyset-
tlement."
scattering
Pennsylvania
membership
Hawaiian,
eswer the
money
overcoat
Clothes
buys
tomorrow.
SHIRTSpriced
$3,00. madras
L J vv
1 .v
HOMEMadeand the housewife Tt.fju?1made happy becauseL.jjt?rtjJshe is sure of theggjjwiM mt!5leavening power of f i". itT--- !
No experimenting it will raise anydough perfectly the bread is light,pure and wholesome.
At all grocers25c lb.
CRESCENT MFC. CO.Seattle, Wash.
Something not to worry about nowis the price and style regulation ofstraw hats.
Xmas. Buying
USEFULNESS. Holiday Fund Possible.
CombineWith Pleasure
StyleplusClothes
$10.00NECKWEAR
Early
1 V
1 : BI '.'yj j.--
MEN'S
BAKINGPopular
CrescentBakingPowder
for
Make
HATSStyleplus
r
S'ttson, Knox and No-Nam- e. Shapes iind colors suitable for alloccasions.
$4.00 to $8.50MEN'S MACKINAWS
Made from the famous Oregon City cashmere3. Both plain and beltedmodels, in a good assortment of patterns. Prices
$13.50 to $18.50HANDKERCHIEFS
Plain and initial, silk and linen, each
25c to $1.50
ALL MERCHANDISE ADVER¬TISED mTHE TRIBUNE
18 gvabanteed Kem^arkFirst to Last.the Truth: News . Editorials - Advertisements«fritante WEATHERShowers to-day. followed by fair:
.lightly cooler. To-morrow fair;moderate west winds.Fall Report «m Page 14
Vol. LXXVIH No. 26,222 iSSrw*,9ls-rrihun« A_8'n] SUNDAY, SEPTEMBER 1, 1918-FIVE PARTS-FORTY-EIGHT PAGES FIVE CENTS AÎSSYork CM»
British Advance on 20-Mile Flanders Front;MtKemmel and FourVillages Captured:Aisne LineFlanked in New Soissons DriveWilson FixesSept. 12 forNew DraftRegistration
12,778,758 Men andYouths From 18 to45 Are Expected
to Enroll
2,300,000 to GoAbroad by June
Present Call Will PutFour Millions UnderArms; Boards Are
PreparedWASHINGTON, Aug. 31.All men
from eighteen to forty-five years ofage in the continental United States,except those in the army or navy or
already registered, were summoned byPresident Wilson to-day to registerfor military service on Thursday, Sep¬tember 12.Machinery of the Provost Marshal 1
General's office was sent in motion to tcarry out the second great enrolmentunder a Presidential proclamationissued soon after the President had jsigned the new man-power act ex-jtending the draft ages. There was a
supplementary enrolment when men
reaching their majority since June 6,1917, were registered. The bill com-
pleted in Congress yesterday had beensent to the White House for the Presi-dent's signature to-day soon after theHouse and Senate convened.
List Put at 12,778,758It is estimated that at least 12,778,758
men will register this time, comparedwith nearly 10,000,000 on the first reg-istration of men from twenty-one to
thirty-one, rn June 5, 1917. Of thosewho enroll now it is estimated that2,300,000 will be called for general mil-itary service, probably two-thirds ofthe number coming from among the?.,500,000 or more between the ages ofeighteen and twenty-one.
General March has said all» regís-trants called into the army will be inFrance before next June 30, swellingthe American expeditionary force to
the 4,000,000 men expected to win thewar in 1919. The last to be called willbe the youths in their eighteenth year,but those of that age who desire andwho have the necessary qualificationsmay be inducted into service on October 1 for special technical training or
vocational training.Same Procedure Used
Registration will be conducted as
heretofore by tKe local draft boards.All Federal, state, county and municipalofficers aré called upon to aid theboards in their work to preserve orderand to round up slackers. All reg¬istrants will be classified as quickly as
possible under the questionnaire sys¬tem, and a drawing will be held at
tho capital to fix the order of regis¬trants in their respective classes.
The Provost Marshal General's esti¬mate to-day places the number of men
under twenty-one now in the army at
about 245,000 and the number of those
from thirty-two to forty-five at 165,000.
Youths WillStay in School
Until Calledl Special Dispatch to The Tribune)
WASHINGTON, Aug. 31..Under the
War Department's plans for delayingthe ca'l to colors of youths under
nineteen years of age until all Classmen above nineteen years arc sum¬
moned in the new draft, provision for
the education of special classes ofths were outlined to-day by the
^.° !L«,i*t#u on Education and SpecialCom«»«««. Department,Training of *-.<. p.¦
Youths under twenty years of age
who arc in college or intend to begintheîr collegiate instruction this fall
were urged to-day not to let their mil-
ury liability Pavent their matricula.^Z The>« students will not be given
ri.if.rred classification, nor be exempt.«call to military service when
of «mil« »**8 sre draw". but
binued 9* Awe ten
The New Call to DutyTX7"ASHINGTON, Aug. SI..President Wilson's proclamation fixing** Thursday, September 12, as draft registration day for men fromeighteen to forty-five cites the law and states the regulations. Then fol¬lows this call to duty:
Fifteen months ago the men ofthe country from twenty-one tothirty years of age registered.Three months ago and again thismonth those who had just reachedthe age of twenty-one were add¬ed. It now remains to include allmen between the ages of eighteenand forty-five.IMs is not a new policy. A
century and a «quarter ago it wasdeliberately ordained by thosewho were then responsible for thesafety and defence of the nationthat the duty of military ser¬vice should rest upon all able-bodied men between the ages ofeighteen and forty-five. We nowaccept and fulfil the obligationwhich they established, an obliga¬tion expressed in our nationalstatutes from that time until now.We solemnly purpose a decisivevictory of arms and deliberatelyto devote the larger part of themilitary man power of the nationto the accomplishment of thatpurpose.
The younger men have fromthe first' been ready to go.They have furnished voluntaryenlistments out of all proportionto their numbers. Our militaryauthorities regard them as havingthe highest combatant qualities.Their youthful enthusiasm, theirvirile eagerness, their gallant spir¬it of daring make them the ad¬miration of all who see them inaction. They covet not only th«distinction of serving in this greatwar, but also the inspiring memo¬ries which hundreds of thousandsof them will cherish through theyears to come of a great day anda great service for their countryand for mankind.By the men of the older group
now called on the opportunitynow opened to them will be accepted with the calm resolution ofthose who realize to the full thedeep and solemn significance olwhat they do. Having made t
place for themselves in their re¬spective communities, having as¬sumed at home the graver respon¬sibilities of life in many spheres,looking back upon honorable rec¬ords in civil and industrial life,they will realize as perhaps noothers could how entirely theirown fortunes and the fortunes ofall whom they love are put atstake in this war for right andwill know that the very recordsthey have made render this newduty the commanding duty oftheir lives. They know how sure¬ly this is the nation's war, howimperatively it demands the mob¬ilization and massing of all ourresources of every kind. Theywill regard this call as the supremecall of their day and will answerit accordingly.
Only a portion of those whoregister will be called upon tobear arms. Those who are notphysically fit will be excused;those exempted by alien allegi¬ance; those who should not be re¬lieved of their present responsi¬bilities; above all those who can¬not be spared from the civil andindustrial tasks at home uponwhich the success of our armiesdepends as much as upon thefighting at the front. But allmust be registered in order thatthe selection for military servicemay be niade intelligently andwith full information. This willbe our final demonstration of loy¬alty, democracy and the will towin, our solemn notice to all theworld that we stand absolutely to¬gether in a common resolutionand purpose. It is the call toduty to which every true man inthe country will respond withpride and with the consciousnessthat in doing so he plays his partin vindication of a great cause atwhose summons every true heartoffers its supreme service.
Peace ManiaSweeps Berlin;Hertling To GoDr. Solf Expected to Suc¬
ceed Chancellor; Sol¬diers Mutiny
LONDON, Aug. 31. It is rumornd inBerlin, according to a dispatch fromAmsterdam to the Central NewsAgency, that Chancellor von Hertlingshortly will retire owing to his" ad¬vanced age and wilt be succeeded byDr. W. S. Snlf, the German ColonialSecretary.The Germans have been seized with
a sort of "peace mania," according tothe frontier correspondent of the Am¬sterdam "Telegraaf." The events inFrance have made such a 'profoundimpression that the Germans onemeets along the frontier are indif¬ferent to the prospect of the defeato the Central Empires, and only wishto get peace as quickly as possible.The correspondent declares that two
German regiments in Russia refused togo to the Western front and that 130soldiers were shot. Seven hundred ofthe bodyguards at Munich refused togo to the front and barricaded them¬selves in their barracks until they werecompelled to surrender, the correspon¬dent _nys.
Count Georg von Hertling is seventy-five years old, having been born inDarmstadt in 1843. He succeeded to theChancellorship late in October, 1917,and by rallying around him other con¬servatives in the Clerical party, suc¬ceeded in breaking up the anti-govern¬ment bloc in the Reichstag.The mentioning of Dr. Solf as hissuccessor may be looked upon as a newstep in the German peace offensive, asthe. Colonial Secretary has shown by hisrecent answer to Lloyd George and inother utterances a more conciliatory at¬titude toward Allied war aims than thepresent Chancellor has ever exhibited.
......?---.
German Submarine SinksAnother Spanish Vessel
PARIS, Aug. 31. Another Spanishship, the Alexandrine, has been tor-pedoed, according to a Madrid di.patchto the "Journal." \ _h
Russians FillDepletedRanksOf Hun ArmyAllied Intervention HaltsFlow of Large Body ofReserve» to Germany(-Special Dispatch to 'Ihr Tribune)
WASHINGTON,- Aug. 31. Fourmonths ago, according to offlcinl in¬telligence received here, Germany wn»
recruiting large numbers of Russiansfor service in the German army, andit is only now that the flow of freshtroops from Russian provinces has beenarrested.The situation threatened afone time
to furnish to Germany all the reservesshe might need, making the solution ofher man-power problem appear com¬
paratively simple. The defeat of theenemy project for drawing upon Rus¬sia for men to fill the enormous gapsin the German armies ig attributed tothe intervention of tho Allies in Rus¬sia and to the action in Northern Rus¬sia rather than in Siberia.The danger of the revival of this re¬
cruiting in Russia has not yet beenended definitely, but it i? believed thatthe larger the contact mule by the Al¬lies with Russia JUie less soldiers Ger¬many will obtain%rom that country.One of the reasons which made Ger¬
man recruiting in Russia comparativelyeasy was the fact that the former sol¬diers of the Russian army, without em¬
ployment and without food, were will¬ing to accept any occupation, ccn thatof fighting with the enemies of Russiam order to obtain the means of living.So far as is known here, no Russians
serving in the German army have beenidentified on the Western front, and itis assumed that they have been used torelieve Germans heretofore employedin war industries for active service.
It is doubted that the Russians wouldfight efficiently and happily against theAllies, although under the German dis¬cipline and if mixed with Germantroops it is thought that they mightserve as effectively as some of the olderclasse^' enlisted among the Germantroops.'
Downs9Enemy'Planes in LarkWhen on Leave!Texas Lieutenant Recom¬mended for Victoria Cross
and Congress Medal
Compass as BombUsed to Fool Foe
Forced to Ground, He Capt¬ures a German and Res¬cues French Officer
LONDON, Aug. 31..First LieutenantEdmund G. Chamberlain, of San An¬tonio, Tex-, a graduate of Princetonand the University of Texas and an
aviator attached to the United StatesMarine Corps, has received simultane¬ous recommendations for the VictoriaCross and the Congressional Medal ofHonor for an exploit in which he fig¬ured on July 28.On that day, over the British front,
Lieutenant Chamberlain took part inan aerial battle with twelve Germanmachines. He destroyed five of them,damaged two others and, sweepingearthward witv i dam. »ed 'plane, scat-tereú a detaci men*« v.
' German sol-I dlers. After lending hv bluffed threeothers into bclievi.,g his compass wasa bomb and captured one of them. Hethen carried a wounded French officer
j back to safety, and finally refused togive his name to the British officer in
i command of aerial forces in that sec¬tion of the front, because of his fearof being reprimanded.The story, which is one of the most
thrilling chapters in the drama of the jwar, also has been cabled to Americaby the London office of the Committeeon Public Information.
Appears at British CampLieutenant Chamberlain appeared at
a British aviation camp on July 27 andinformed the major in command that jhe had personal, hut not official, per-mission to visit the camp. This isborní« out by tho young man's aupe-rior, who says Lieutenant Chamber-lain had asked to be permitted to goup near the front (luring a furloughbecause ho desired to get some more
¡experience before resuming hi» workThe British commander wai in need
lof aviators, and Ol there wm n honth-Ing squadron obout to leave told Lieutenant Chamberlain he could go along,On this fliifht the young Americanbrought down ene Gorman airplane inflamea and sent another whirling downout of control.The next day came Lieutenant Cham
berlain's wonderful exploit. He wasone of > dotachment ol thirty aviator*who went out over the battlefieldthrough which the Germans were beingdriven by the Allies. As the thirty mu
j chines circled about over the fleeingTeutons they were attacked by «n equalnumber of German machines. It. was ahurricane battle from the lirst, and al¬most at the inception of the combatthe British lost three "planes.
His Knglne DamagedIn (he tempest of machine gun bul¬
lets that roared about his machineLieutenant Chamberlain's engine wasdamaged. One of his machine guns be¬came jammed, and he seemed to be outof the action. But instead of startingfor home he remained to offer assist¬ance to two other airplanes which hadbeen attacked by twelve German ma¬chines.
His machine had lost altitude, owingto engine trouble, but when he was at¬tacked by a German he opened such ahot fire that tho enemy went into adive toward the earth.
His two companions were now en¬gaged in a life and death struggle, andLieutenant Chamberlain went to theirassistance. His action probably savedthe lives of the two Englishmen.
His engine was now working better.He climbed up toward the enemy, and,
Foe ThrowingNew MassesAgainstYanksHeavy Artillery Effort IsBeing Used on Franco-
American Front
Ludendorff Forced. To Use Best Reserves!
i
Soissons Conflict DevelopsInto Desperate Struggle,With Enemy Losing
By Wilbur Forrest{Special Cable to The Tribune)
(Copyright, 1918. by The Tribune Association.New York Tribune)
WITH THE AMERICANFORCES NORTH OF SOISSONS,Aug. 30 (delayed)..The fluctuatingconflict which began with thesweeping advance of the Franco-American troops north of SoissonsThursday morning has developedinto a stubborn combat and a hardstruggle. The enemy is fightingwith the desperation of de$pair.Knowing the strategic issue of the
operation the Germans have gar¬nished the old lines'in this regionwith an enormous number of ma¬chine guns. In addition fresh Prus¬sian troops are employing heavyartillery with concentrations alongthe entire Franco-American battlefront.Very few prisoners have been
taken in the American sector, wherethe doughboys are fighting alongsidesome of France's elite units. Thismorning the Americans werechecked on the ridge above the vil¬lage of Juvigny, which was defendedby hundreds of machine guns' andthe intervening fire of scores of Ger¬man batteries.The Americans, however, have
learned in previous encounters thatan impetuous advance against suchopposition is entirely unwise, andlate to-day the doughboys were let¬ting the artillery slowly batter thevillage into a rock heap before at¬tempting to advance. Toward sun¬down I saw hundreds of shells perminute throwing smoke and dusthigh In the air as destruction pro¬gressed, It seemed that the enemymachine gunners who had been fir¬ing from a nest around the villageand from the houses would neversurvive the inferno,
American« Saving MenThis element of caution which the
American troops have now injectedinto their warfare is not only man-saving, but with a system of usinghigh explosives whenever possibleforces the enemy to employ machineguns in ever increasing numbers toreplace badly worn effectives.The Allied advance, though slow,
is sure, and the importance of to¬day's struggle is that the enemy isbeing forced to use his best andfreshest effectives, who have suf¬fered very heavy losses. Prisonersaffirm that all the units have beenordered to hold at all costs.With losses such as the Franco-
Americans may be able to inflict onthem, the wastage of German man¬power promises to become a highlyimportant point in the Allies' favor,Ludendorff must continue to throwhis best into the furnace, and thequestion is hpw long his best will beavailable. ¡The spirit of the American dough-
boy was shown on every turn of to¬day's battle. In the advanced dress-
(Continued on page three) (Continued on page three)
Paper Saving Sunday, Too'T'O-DAY'S issue of the Tribune is the first Sunday
number published under the regulations of theU. S. Government for the conservation of printpaper. h
Germans Now BlameLack of Spy System
WASHINGTON, Aug. 31..A newexplanation from the German
newspapers of what is happening inFrance and Flanders came to-day inan official dispatch from Switzerland.
It says the German press now as¬serts that Germany has never knownhow to organize her system of es¬pionage, and that it is to the mis¬takes made by her secret service thatshe owes her unpleasant experienceson the western front.
YanksinThickOf Big BattleOn Vesle LineDesperateResistance of FoeMakes Ailette-Aisne Dis¬
trict Sea of Fire
(By The Associated Press)WITH THE AMERICAN ARMY IN
FRANCE, Aug. 31..Between the Ai-lette and the Aisne, and far to thesoutheastward along the line of theVesle, the battlefield is one vast pano¬rama of fire. Here at the moment theGermans are offering the most desper¬ate resistance, since the issue in thissector has a graver strategic bearingthan anywhere else along the wholefront.With General Mangin's men already
across the Ailette on either side of thevillage of Champs, the enemy's hold onCoucy-le-Château is threatened. Coucy-le-Chateau is highly important to theGermans as a distributing centre oftroops falling back from Noyon andthose fighting stoutly on the left bankof the Ailette.From the crest of the plateau north
of Soissons shells can be seen burstinglike surf against the German lines.
Americans Fighting HardAmerican troops, in the centre, are
still fighting to overcome the difficultentanglement of ravines before them.There has been no close fighting yet inthese valleys.A wounded prisoner was encountered
to-day in the road near the battlefield.He said: "They told me that the Amer¬icans murdered their prisoners."When asked if he believed that
charge, he answered: "One does notmake a great nation out of men likethat."German troops attempted to raidAmerican advanced posts in the Vosges
sector early this morning. Their ar¬tillery and mine-throwing activity hadcaved in one American dugout, burningtwelve men and wounding two othersslightly, before the enemy made his at-tack.The ten unwounded men dug them-
selves free as soon as the artillery firestopped. They drove off between thirtyor forty Germans and killed at least'one. The body of this man will bebrought into the American lines forburial as soon as it can be rescuedfrom the German machine guns, whichare keeping up a steady fire all around... _
Conflans AgainBombed by U. S.;Longuyon Attacked]WITH THE AMERICAN ARMY IN jFRANCE, Aug. 31 American bomb-
ing machines again yesterday morn-ing successfully attacked railwayyards and buildings at Conflans.Several direct bursts were observedand enemy pursuit 'planes followedthe invading Americans back to theirlines, but did not attack them.At noon American airmen dropped |
bombs on the railway yards at Lon-guyon, scoring several direct hits. Latein the afternoon Conflans was againraided, but poor visibility made it dif-ficult to ascertain whether the bombingwas effective. Enemy anti-aircraftguns were active against the Americanraiders in all three of the day's ex-cursions. All of our machines re-turned.One American aviator yesterday at-|tacked a German who was diving at a
French balloon. Despite the fact thatthere were six Germans above him, theAmerican forced the German machineinto a nose dive. The six other Ger¬mans then attacked the American andforced him to descend. He landed be¬hind the American lines uninjured.Americans Now in SightOf Laon Cathedral TowersPARIS, Aug. 31 (1:10 p. m.). The
positions won yesterday by the Ameri¬can forces northwest of Soissons, "LeLiberte" point3 out, give them a fineview along the Chemin ,des Dames.The Americans can now see the towersof the Laon Cathedral.
Canadian TroopsEncircle Péronne;Town Near Fall
1,500 Germans Taken Prisoner at Mt. St. Quen¬tin and Feuillaucourt When Gen. Haig's MenLaunch Heavy Attack Near the Sommeand Surround Ludendorff Stronghold-
|Gen. Mangin Crosses Canal du NordAnd Occupies Three More Towns_
French and Americans Sweep Through Juvignyand Crouy and Approach Southern Bastion of
Old Hindenburg Line; Campagne AlsoTaken by Victorious Foch Army .
The British in Flanders yesterday drove steadily againslthe retreating Germans on a twenty-mile front south of Yprespushing ahead for gains of more than two miles at severapoints. They regained the dominating height of Mount Kern-imel, besides four villages.
Defeated along the whole line- further south and dreadinga new Allied offensive in the Lys Valley, the enemy is with«
i drawing rapidly from his hard-won positions here to a moreeasily defended line.
North and west of Péronne the Australians advanced morethan a mile, almost completing the encirclement of that city,I capturing 1,500 prisoners, with only slight losses to themselves,and wresting from the enemy the hill and village of Mt. St.Quentin and the town of Feuillaucourt. Mt. St. Quentin is onlyn mile north of Péronne.
Mangin Gains North of SoissonsIn bitter fighting north of Noyon the French stormed for¬
ward against stiffened German resistance. New forces thrownacross the Canal du Nord and the Oise captured the village ofCampagne and advanced up the slopes of the plateau north ofHapplincourt and Morlincourt.
General Mangin's Franco-American army struck at twopoints north of Soissons and pushed deeper into the enemy'sflank north of the Aisne-Vesle line. A thrust beyond the Ailetteforced the Germans back nearly to Coucy-le-Chateau, a bastionin the old Hindenburg line. Further south the French capturedJuvigny and Crouy in heavy fighting and reached the outskirtsof Leury.German Counter Attacks Break Down
At numerous points along the battleline the foe is counterattacking heavily, but ineffectively. Successive attacks againstthe British before Bapaume and before Arras were batteredback by the guns of Haig's men, who held their gains at allpoints.
Foe Caught in Perilous PositionBv British Advance Near Peronne
(By The Associated Press)WITH THE BRITISH ARMY IN
FRANCE, Aug. 31. With Mont St.Quentin, which fell to-day. in Britishpossession, the Germans to the northand south for a considerable distanceare placed in a precarious position.Péronne itself must be evacuated, andif this is not done quickly, ¿he foe willlose many more men here.
Starting out from east of Cleryabout 5 o'clock in the morning theAustralians fought their way forwarddespite the heavy fire from the Bochemachine guns *nd swarmed into Feuil-laucourt. They captured 200 Germans.
Germans Taken by SurpriseAbout the same time another body of
Australians "silently"- which meansthat they were unaided by artilleryattacked Mont St. Quentin. The Ger¬mans were taken completely by sur¬prise, for they had no idea that theAustralians would dare attempt such afeat. By 8 o'clock the Australians hadfought their way to the top of themount, and soon after that signalledits capture.Mont St. Quentin was alive with Ger¬
mans, who came from everywhere andcried "Kamerad." Those who did not
j were driven from their retreats oikilled with grenades and bombs. Hundreds of prisoners were captured alj this place.
I While the hill was being mopped ujBritish guns, which had ¿>een move«ud close to the river, cut loose anc
began pounding a torrent of steel backof Mont St. Quentin as a reminder tothe Germans that they had better startmoving quickly. The Australians mutthave worked with great, swiftness tomake so much progress in so short atime.
Enemy Retreats from the Lye(Noon).British successes on the
Lys .valient sector of the battlefronthave caused the Germans to startretreat from the neighborhood of Kem-mel tc opposite Bethune. The with¬drawal is progressing rapidly.
Field Marshal Haig's men to-day arcattacking near Marrienes Wood, b<_-tween Bapaume and the River Somme,which position is strongly held by th_enemy.
British Make Slight GainsAdvances have been made here ñítá
there by British forces along the bat¬tlefront, but they generally have beanslight. The night was comparativelyquiet throughout the zone, but fightingagain became heavy after dawn thismorning.The enemy has delivered vicious
counter attacks with powerful forcessouth of the Arras-Cambrai road. A*Jjjut result of one of these counterbîo«.« the British withdrew to the edgeof Riencourt-les-Cagnicourt,The Germans also are in *oi__strength sputh of the railway belowBu'.Iecourt, and they are now being a1-tacked by the British. The outskirtsof Ecoust-St. Mein, from where th_
.N
£-
RED LAKE NEWS A newspaper devoted to the interests of
the Red Lake Chippewa Indians. MONTHLY SEPT. 1, TO JULY 15.
Subscription 75c a year Entered as second class matter Septem-
ber 1, 1912, at the postoffice at Red Lake, Minn., under the act of March 3, 1879.
Address all communications to— RED LAKE NEWS
Red Lake, Minn.
With the turning back of Carlisle by the Interior Department to the War Department, sentiment plays havoc with the feelings of hundreds of Indian students throughout the country to which that in-stitution has become Alma Mater since 1882.
Our contemporaries have commented, some at length, upon the reign of this well known institu-tion of Indian education. Reciting its history down to date in such commendatory manner that we hesi-tate, at this late hour, to add our squib. Red Lake has its returned Carlisle students, and the news of its evacuation by Indians and accommodations to convalescing soldiers was received here with mingled regret and pride.
HERE IS HOW TO FIGHT OFF SPANISH INFLUENZA
The following suggestions for the prevention and treatment of influenza have been issued by the Chicago emergency medical committee:
To Avoid Influenza First—Avoid contact with other people so far
as possible. Especially avoid crowds. Second—Avoid persons suffering from "cold,"
sore throats and coughs. Third—Avoid chilling of the body or living in
rooms with temperature below 65 degrees or above 72.
Fourth—Sle'ep and work in clean, fresh air. Fifth—Keep your hands clean and keep them
out of your mouth. Sixth—Avoid expectorating in public places and
see that others do likewise. Seventh—Avoid visting the sick. Eighth—Eat plain, nourishing food and avoid
alcoholic stimulants. .. Ninth—Cover your nose with your handkerchief
when you sneeze and your mouth when you cough. Change handkerchiefs frequently. Promptly dis-infect soiled handkerchiefs by boiling or washing with soap and water.
Tenth—Don't worry, and keep your feet warm. Wet feet demand prompt attention. Wet clothes are dangerous and must be removed as soon as possible.
To Treat Influenza Oftentimes it is impossible to tell a cold from
mild influenza. Therefore: First—If you got a cold go to bed in a well
ventilated room. Keep warm. Second—Keep away from other people. Do not
kiss anyone. Use individual towels, handkerchiefs, soaps, wash basin and knives, forks, spoons, plates and cups. -
Third—-Every case of influenza should go to bed at once under the care of a physician. The patient should stay in bed at least three days after fever has disappeared and until convalescence is well established.
Fourth—Patient must not cough or sneeze ex-cept when a mask or handkerchief is held before the face.
Fifth—Patient should be in a warm and. well ventnateoTToom.
Sixth—There is no specific for the disease. Symptoms should be met as they arise.
Seventh—The great danger is from pneumonia. Avoid it by staying in bed while actually ill and until convalescence is fully established.
Eigth—The after effects of influenza are worse than the disease. Take care of yourself.
BULLETIN ON SPANISH INFLUENZA. The Surgeon General of the United States Public
Health Service has just issued a publication dealing with Spanish influenza, which contains all known available information regarding this disease. Sim-ple methods relative to its prevention, manner of spread, and care of patients, are also given. Readers may obtain copies of the pamphlet free of charge by writing to the "Surgeon General, U. S. Public Health Service, Washington, D. C, or to this paper.
WAR SAYINGS SALES NEAR BILLION MARK
Including cash received in the Treasury Depart-ment on October 21 from the sale of War Savings securities, the total Treasury receipts from this source amounted to $801,453,415.86. This repre-sents the purchase of War Savings Stamps to the total maturity value of approximately $950,824,-474.10.
PEYOTE -The introduction of peyote into
this reservation and its use within the reservation is forbidden by law under penalty of imprisonment for not less than SO days. A reward of $5.00 will be paid to the party or parties furnishing information lead-ing to the conviction of any violator of the above law.
ANOTHER LIBERTY LOAN COMING Secretary of the Treasury McAdoo has announced
that, no matter what the results of the pending-overtures for peace may be, there will be another Liberty loan. To use his expression, "We are going to have to finance peace for a while just as we have had to finance war."
There are over 2,000,000 United States soldiers abroad. If we transport these men back to the United States at the rate of 300,000 a month, it will be over half a year before they are all returned. Our army, therefore, must be maintained, victualed and clothed for many months after peace is an actuality.
The Arnerican people, therefore, having support-ed the Liberty loan with a patriotism that future historians will love to extol, will have an appor-tunity to show the same patriotism in financing the just and conclusive victorious peace whenever it comes.
Not for a moment, however, is the Treasury act-ing on any assumption that peace is to come soon. Until peace is actually assured the attitude of the Treasury and the attitude of the whole United States Government is for the most vigorous prose-cution of the war, and the motto of force against Germany without stint or limit will be acted up to until peace is an absolute accomplished fact.
One more Liberty loan, at least, is certain. The fourth, loan was popularly called the "Fighting Loan"; the next loan may be a fighting loan, too, or it may be a peace loan. Whatever the condi-tions, the loan must be prepared for and its suc-cess rendered certain and absolute. Begin now to prepare to support it.
H. Christianson —Dealers in—
GENERAL MERCHANDISE Gocdridge, Mina.
L. P. ECKSTRUM Plumbing, Steam and Hot Water Heating .
Phones 55J5 and 3 0 9
320 Beltrami Ave., Bemidji, Minn.
FARMERS CASH MARKET TOP PRICES paid every day for Chickens, Ducks, Geese, Turkeys, Cream, Dressed Calves, Hogs, Mutton, Wool, Cattle Hides, Horse Hides, Pelts, Purs, Muskrat, Skunk, Beans, Ra-bbits. Get our price list before selling. Make more money by shipping here. Write us now for quotation*, tags, and how to ship. THE R. E. COBB CO., E. 3rd St., St. Paul, Minn. Licensed by U. S. Government.
HIDES AND FURS Bring them to our meat mar-ket any time ami get the high-est market priee. We want ali the hides you have to seti and if given a chance we will prove our prices are right.
ONE GERMAN EXHIBIT IN THE "BRITISH MUSEUM"
ft
& • :
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
^Sui. i ££si '
Figure 1.1: Sample pages from Bisbee Daily Review-AZ (December 17,1918), the New YorkTribune-NY (September 01, 1918), and Red Lake News-MN (November 01, 1918).
blogs, Facebook posts, tweets, product reviews, and any shared information online by organizations
or individuals. Mining social media in any of its forms it very important for social science
researchers for many different reasons. Text mining is the concept of deriving high-quality features
from text [Hotho et al., 2005]. One of the currently most promising lines of research in text mining
is topic modeling by the formalism of Latent Dirichlet Allocation (LDA) [Blei et al., 2003], where
documents are modeled as distributions (mixtures) over topics, and topics in turn are distributions
over the vocabulary used in the corpus. LDA is considered a generalization of Probabilistic Latent
Semantic Analysis (PLSA), proposed by Hofmann [Hofmann, 1999a]. (The difference between
LDA and PLSA is that the topics distributions in LDA are assumed to be distributed according to a
Dirichlet prior.)
Through text mining, a great number of social theories can be examined. For example,
the detection of deliberation and common interests can be compared across different groups with
specific demographics. Blogs, Facebook feeds, and tweets are great venues for characterizing public
interest and opinions about a specific issue.
In the rest of this chapter, the motivation behind each part of this dissertation along with the
3
specific research questions will be presented. Then, contributions of different parts will be clearly
stated. The last section of this chapter is an outline for the rest of this dissertation.
1.1 Motivation and Research Questions
Classic topic modeling has been applied in a great number of fields. Extensions and modifications
have also been proposed in the literature. Some added a temporal aspect to topic models and
others added structure to the discovered topics. The previously mentioned applications were a great
motivation to build on and extend the classic topic models. In this section the motivation behind each
part of this dissertation will be discussed and a short overview will be provided. This dissertation in
divided into four major parts: Dynamic Temporal Segmentations over Topic Models, new visual
analytic representations, Dynamic Spatial Topic Models (DSTM), and predictive analysis.
Dynamic Temporal Segmentations over Topic Models: The first part, Dynamic Temporal Seg-
mentations over Topic Models, is motivated by significant ongoing research in capturing the dynamic
evolution of topics underlying a text corpora. Most of these efforts are focused on extending the clas-
sical probabilistic model of Latent Dirichlet Allocation (LDA) [Blei et al., 2003] to a time-indexed
context. Our temporal topic modeling approach is differentiated by its emphasis on automatically
identifying segments where topic distribution is uniform and segment boundaries around which
significant changes are occurring. We embed a temporal segmentation algorithm around a topic
modeling algorithm to capture such significant shifts of coverage. A key advantage of our approach
is that it integrates with existing topic modeling algorithms in a transparent manner; thus, more
sophisticated algorithms can be readily plugged in as research in topic modeling evolves.
New Visual Analytic Representations: Several visual analytic applications require the analysis
of dynamically changing trends over time. Example contexts include studies of idea diffusion in
scientific communities, the ebb and flow of news on global, national, and local levels, and the
meandering patterns of communication in social networks. Trends, each representing a particular
keyword or concept, that converge into topics at different points in time, then just as unpredictably
4
diverge into new defined topics at a later time, are key patterns of interest to an analyst. Both
experts and casual users alike need mechanisms for understanding such evolving trends for analysis,
prediction, and decision making.
We present THEMEDELTA, a visual analytics system for accurately extracting and portraying
how individual trends gather with other trends to form ad hoc groups of trends at specific points in
time. Such gathering is inevitably followed by scattering, where trends diverge or fork to form new
groupings. Understanding the interplay between these two behaviors provides significant insight into
the temporal evolution of a dataset. Existing visualization techniques such as ThemeRiver [Havre
et al., 2002] and streamgraphs [Byron and Wattenberg, 2008] are aimed to capturing overall trends
in textual corpora but fail to capture their branching and merging nature. Our ThemeDelta temporal
topic modeling approach is differentiated by its emphasis on automatically identifying segments
where topic distribution is uniform and segment boundaries around which significant changes are
occurring.
Dynamic Spatial Topic Models (DSTM): Temporal topic models have become quite standardized
[Blei and Lafferty, 2006,Wang and McCallum, 2006,AlSumait et al., 2008,Gohr et al., 2009,Zhang
et al., 2010, Hoffman et al., 2010, Hong et al., 2011]. Spatial topic models capture the notion of
location but thus far have used location as a proxy for similarity [Pan and Mitra, 2011, Wang et al.,
2009] (i.e., words closer in space are more similar to each other). In modeling newspapers that report
events from across the country, we require topic models to be decomposable into specific topics for
specific locations which are then aggregated in different ways to form news stories. Modeling such
decompositions and tracking their evolution over time leads to a holistic understanding of coverage
of large-scale events such as the Spanish flu.
In the third part of this work, we propose a new dynamic spatial topic model (DSTM) that
incorporate reporting locations of inferred topics, and captures their evolutions over time. Topics
(distributions over terms) are associated with locations and documents are comprised of multiple
topics, i.e., coverage of several locations. The main goal behind building this model is to assist
in the comprehension of newspaper coverage of significant events, such as the 1918 ‘Spanish’ flu.
5
Understanding the coverage of reported infected locations across local and national newspapers is a
key step that can help us understand how news propagated through time and space in those early
times, when newspapers were the only widely used information resource.
Predictive Analysis: The fourth and last part of this work is concerned with enabling powerful
models to predict future topics. Enabling DSTM for predictive analysis will allow us to predict
what, where, and when a major event will happen. We adapted the work of [Wang et al., 2012]
where the idea is to train a basic topic model (LDA) on past data, and to calculate a topic distribution
transition parameter from discovered topics. This transition parameter is then used to predict future
topic distributions for unseen data. The transition parameter needs to be updated every time new
data is streamed. Limitations of this work stem from its reliance on the vanilla LDA formulation,
i.e., a non-dynamic and non-spatial topic model. Second, updating the transition parameter is
computationally intensive. In this part of the dissertation we overcome those drawbacks by training
the model using our DSTM approach. The inherent dynamicism in our model circumvents the need
to update the transition parameters explicitly. Furthermore, the use of DSTM over LDA enables
predicting the locations of topics in addition to topics. We demonstrate the use of this approach in
forecasting civil unrest events (including their locations) in Latin America.
In summary, the research questions that will be explored in the four different parts of this
dissertation are:
1. Dynamic Temporal Segmentations over Topic Models:
• How do we identify segment boundaries that detect significant shifts of topic coverage?
2. New Visual Analytic Representations:
• How can a visual analytics tool based on the segmentation algorithm facilitate dataset
exploration?
3. Dynamic Spatial Topic Model:
6
• How can we generalize the basic topic modeling framework to accommodate location
and temporal distinctions in large document sets?
4. Predictive Analysis:
• How can we use the DSTM for predicting attributes of future events?
5. General research question:
• Will the above modifications and extensions to classic LDA-based topic modeling help
extract greater information from data and improve the utility of the text mining process?
Our goal is to increase the expressiveness of topic models as a text analysis tool. Classic topic
modeling only focuses on word/token level analysis. These modifications to LDA will embed more
structure and render the discovered topics much meaningful. To support this claim the presented
work will be applied on a number of applications.
1.2 Contributions
As presented earlier, this dissertation is divided into four major parts and each part has a set of
contributions. The first part is the Dynamic Temporal Topic Modeling and our specific contributions
in this part are:
• A time series segmentation algorithm where segment boundaries detect significant shifts of
topic coverage. To this purpose, we embed a topic modeling algorithm inside a segmentation
algorithm and optimize for segment boundaries that reflect significant shifts of topic content.
• A novel application to studying Internet use in communities using the i-Neighbors system.
The voluntary participation of i-Neighbors users enables us to gain significant insight into
questions of engagement and deliberation.
7
• Qualitative as well as quantitative summaries of distinctions observed between advantaged
and disadvantaged communities. These results lead to an understanding of how engagement
and deliberation practices relate to access and uses of new communication technologies.
• A novel application to understanding the progression in coverage about the 1918 influenza
from historical newspapers and a successful application of our algorithm to archives of the
Washington Times. By studying the ebb and flow of ideas in the Fall of 1918 we illustrate
how our algorithm extracts important qualitative features of news coverage of the pandemic.
The second part relates to new visual analytics representations and our key contributions can be
summarized as follows:
• We present a visual analytics system, ThemeDelta, for accurately extracting and portraying
how individual trends gather with other trends to form ad hoc groups of trends at specific
points in time. Such gathering is inevitably followed by scattering, where trends diverge
or fork to form new groupings. Understanding the interplay between these two behaviors
provides significant insight into the temporal evolution of a dataset.
• We demonstrated several potential usage scenarios for our novel ThemeDelta system. The
scenarios are: historical U.S. newspaper data from four months in the year 1918 during the
second wave of the Spanish flu pandemic; the similarities and differences in trends and themes
being discussed by the two candidates in the U.S. 2012 presidential campaign; and social
messages exchanged between virtual communities via the i-Neighbors web-based applica-
tion [iNe, 2012]. These applications are intended to demonstrate that ThemeDelta provides
an interesting insight into datasets not immediately apparent through other representations.
In the Dynamic Spatial Topic Model (DSTM), the third part of this thesis, our key contributions can
be summarized as follows:
• DSTM is a true spatio-temporal model and enables disaggregating a newspaper’s coverage
into location based reporting, and how such coverage varies over time.
8
• DSTM naturally generalizes traditional spatial and temporal topic models so that many
existing formalisms are special cases of DSTM. Conceptually, DSTM is closest to author-
topic models [Rosen-Zvi et al., 2004] but where the notion of author is instead replaced by
location.
• We demonstrate a successful application of DSTM to multiple newspapers from the Chroni-
cling America repository. We demonstrate how our approach helps uncover key differences in
the coverage of the flu as it spread through the nation, and provide possible explanations for
such differences.
The fourth and last part of this dissertation, Predictive Analytics, our main contribution is as follows:
• A predictive dynamic spatial topic model that can predict future topics and their locations from
unseen documents by adapting the work proposed by [Wang et al., 2012] and overcoming
two main drawbacks of their approach.
• We show the applicability of our proposed approach for unrest predication from Latin
American tweets.
1.3 Outline of the Dissertation
The rest of this dissertation is organized as follows:
• Chapter 2: Datasets
• Chapter 4: Survey of Related Research
• Chapter 5: New Visual Analytic Representations
• Chapter 6: Dynamic Spatial Topic Model
• Chapter 7: Predictive Analysis
• Chapter 8: Conclusions
Chapter 2
Datasets
This chapter is dedicated to describing the different datasets used in the four parts of this dissertation.
The work presented here will be applied on four different datasets. These datasets were collected
from the following APIs: iNeighbors, Chronicling America, the US presidential campaign repository,
and Twitter. In the Dynamic Temporal Segmentations over Topic Models part, the iNeighbors
and Chronicling America datasets were used. To evaluate the applicability of the New proposed
Visual Analytic Representation (ThemeDelta) we applied the system on the iNeighbors, Chronicling
America, and presidential campaign datasets. In the Dynamic Spatial Topic Model part, the model
was applied on partial datasets derived from Chronicling America dataset. For predictive Analysis
approach evaluation, we used the Twitter dataset (comprising tweets from Latin America). In the
following sections, we will review each dataset in details.
2.1 iNeighbors
The iNeighbors system, shown in Figure 2.1, was created as part of a university research project first
run from the Massachusetts Institute of Technology and later from the University of Pennsylvania
that has been operational since 2004 [Hampton, 2010]. The site allows anyone in the United States
9
10
Figure 2.1: i-Neighbors: Social networking service connecting residents of geographic neighbor-hoods [iNe, 2012].
or Canada to join and create a virtual community that matches their geographic neighborhood.
Users who join the website agree to a Terms of Use, as approved by the Institutional Review
Board (IRB). Through the Term of Use, users are informed that participation is voluntary and that
logs of user activity would be recorded and analyzed. The iNeighbors project was designed as a
naturalistic experiment; there was no attempt to provide training or to encourage any individual user
or community to participate. The website offers the following services:
• Discussion forum / email list: each neighborhood has a discussion forum that allows users to
contribute and comment by email.
• Directory: a list of all group members and their profile information.
• Events calendar: a group calendar.
• Photo gallery: a group photo gallery.
• Reviews: user contributed reviews of local companies and services.
• Polls: surveys administered to other group members.
11
• Documents: storage for shared documents and links.
As of 2012, the i-Neighbors website has attracted over 110,000 users who have registered
over 15,000 neighborhoods. The size of each group and the number of active groups varies from
month to month. In a typical month, over 1,000 neighborhoods are active and over 7,000 unique
messages are collectively contributed to neighborhood discussion forums, which in turn are viewed
over 1 million times. This analysis focuses on the adoption pattern of the most active i-Neighbors
communities, based on measures of the concentration of poverty, and the content of messages
contributed to their respective discussion forums.
The percentage of families below the poverty level in geographic areas represented by the
20 most active i-Neighbors groups, shown in Figure 2.2, ranges from a low of 3.2% to a high of
47.6%. 40% of the most active neighborhoods are in areas of concentrated poverty. Given that 15%
of Americans live below the poverty level [Kneebone and Nadeau, 2011], that 40% of the most
active i-Neighbors groups are in areas where more than 20% of families are in poverty indicates
adoption by high poverty neighborhoods at a higher rate than would be expected at random.
In this dataset, we ranked neighborhoods based on the number of unique comments that
members posted to their neighborhood’s discussion forum over a one year period that started on
October 1, 2010. For each neighborhood group, we identified the poverty rate, as defined by the US
Census [cen, 2012], based on Census tract data collected as part of the 2009 American Community
Survey (US Census Bureau). In Figure 2.3, the same neighborhoods shown in Figure 2.2 were
rearranged based on poverty level. While recognizing that the selection of any absolute threshold
will have its shortcomings, consistent with previous research, we used a 20% poverty rate as an
indicator of an area of high-poverty [Kneebone and Nadeau, 2011].
We limited the scope of this dataset to the three most active i-Neighbors groups above our
20% poverty level threshold, and the three most active below the threshold. While we recognize
that there are a number of potential sampling approaches, including sampling groups from similar
or diverse geographic areas, we chose to maximize the available data for topic modeling. However,
our approach also served to provide a sample that was geographically diverse, with the six groups
12
Table 2.1: The six neighborhoods studied in our experiments.Neighborhood ID Number of Members Number of messages State Poverty
High1 440 2122 Ohio 47.60%
High2 334 3466 New York 26.30%
High3 539 2969 Maryland 24.90%
Low3 378 2472 Texas 6.60%
Low2 324 3534 Georgia 3.90%
Low1 371 2523 North Carolina 3.20%
0
1000
2000
3000
4000
5000
6000
7000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Num
ber o
f Messages
Neighborhood
Figure 2.2: Distribution of messages across neighborhoods.
used for our topic analysis representing six different U.S. States as shown in Table 2.1 .
2.2 Chronicling America Historical Newspapers
Chronicling America, which is sponsored by the National Endowment for the Humanities and the
Library of Congress, is a great example of an open source digital library of historical newspapers.
It provides an Internet-based, searchable database of historical U.S. newspapers. The website is
maintained by the National Digital Newspaper Program (NDNP). Example newspapers included in
13
Figure 2.3: Distribution of messages across neighborhoods.
this dataset are: The Washington Times (Washington, DC), Evening Public Ledger (Philadelphia,
PA), The Evening Missourian (Columbia, MO), El Paso Herald (El Paso, TX), and The Holt County
Sentinel (Oregon, MO). Data collected from these newspapers are stored as pages. Each page has a
record in the dataset and the following information is available for each page: page OCR text, page
number, newspaper name, and publish date.
We built this dataset by crawling the publicly accessible archive of Chronicling America
Website. During the period we are interested in there were 104 newspapers available. The focus of
this work is on the 1918 influenza epidemic. For this purpose we extract all paragraphs that contain
one or more of the following words: influenza, grip, flu, epidemic, and grippe. Several sub-datasets
were extracted from this dataset.
One sub-dataset was focused on 14 daily newspapers and extracted the influenza related
paragraphs from them. Those paragraphs were extracted by searching the OCR text of newspapers
pages. A summary for the daily newspapers we included in this sub-dataset is shown in Table 2.2. In
the table, the period column indicates the duration in which Chronicling America provide data for a
specific newspaper. The pages column provide the number of pages available for a newspaper. The
number of pages that do contain one or more of the filtering keywords is shin in the relative pages
14
column. The last column is paragraphs, and it is a summary for the number of paragraphs extracted
from a specific newspaper during the January 1918 to December 1919 period. The paragraph’s
extraction from the daily newspapers resulted in 47,650 paragraphs.
For another sub-dataset, we ran a location detection script to label paragraphs with locations
mentioned within their text. We provide the script with a list of all the cities and counties in all 50
USA states and military camps. We discarded paragraphs without location mentions. Paragraphs
here are considered documents, and they are composed of five sentences. The five sentences are the
result of including two sentences before and two sentences after the main sentence that contain one
or more of the filtering keywords. Stop words, punctuation, and non-alphabetic characters were
removed from paragraphs. Then we divide the dataset of extracted paragraphs into months. As a
result, our data consist of 24 time slices from two years worth of data. Time slice sizes should vary
based on the application. In a historical newspaper dataset, monthly time slices are appropriate
because the news did not travel as fast as today’s news and because we are interested in major events
that do have a monthly granularity. The resulting datasets will act as a stand alone dataset, one for
each month.
Figure 2.4 shows the distribution of influenza reporting in the west, midwest, and east sides
of the county over the year 1918 and 1919. Figure 2.5 displays the concentration of reporting
for three different parts of the country. Columns in this grid represent the reporting percentage
with respect to the other parts of the country. For each part, we used three different shades of the
same color to show different levels of concentration. Concentration levels ranged from low to high
and represented by light to dark shades of the same color. From this grid, we concluded that the
midwest has a stable reporting on influenza compared to the east and west. The concentration of
reporting in east, midwest, and west around the peaks of influenza confirms with the influenza
spread. During the September 1918 and October 1918 the east was reporting more than the midwest
and the west. The midwest reporting started to rise in November after a low concentration through
previous months. Similarly, the west reported with a high concentration in November 1918 and
continued with the same concentration through January 1919.
15
0"
0.05"
0.1"
0.15"
0.2"
0.25"
0.3"
Jan+18"
Feb+18"
Mar+18"
Apr+18"
May+18"
Jun+18"
Jul+18"
Aug+18"
Sep+18"
Oct+18"
Nov+18"
Dec+18"
Jan+19"
Feb+19"
Mar+19"
Apr+19"
May+19"
Jun+19"
Jul+19"
Aug+19"
Sep+19"
Oct+19"
Nov+19"
Dec+19"
Normalize
d"Re
porFng"Num
bers" East"
West"Midwest"
Figure 2.4: Distribution of influenza reporting over the year of 1918 and 1919.
Jun$18 Jul$18 Aug$18 Sep$18 Oct$18 Nov$18 Dec$18 Jan$19 Feb$19 Mar$19 Apr$19 May$19 Jun$19 Jul$19 Aug$19
East
Midwest
West
Figure 2.5: Reporting concentration across locations and time.
2.3 Presidential Campaigns Press Releases
Political speeches, especially during an election campaign, are particularly interesting document
collections to analyze because the political discourse tends to change and evolve as different
candidates respond and challenge each other over the course of the campaign.
The U.S. presidential election takes place every four years (starting in 1792) in November
(the 2012 election day is November 6), and is an indirect vote on members of the U.S. Electoral
College, who then directly elect the president and vice president. Electoral College members can
vote for any eligible candidate, but are typically pledged to a particular candidate that has been
officially nominated by a political party.
16
Table 2.2: Daily newspapers summary.Period Pages Relative Pages Paragraphs
East Coast NewspapersNew-York tribune. (New York, NY) 1842-1922 313953 2768 1586The Washington times. (Washington, DC) 1894-1939 143520 2755 9764Evening public ledger. (Philadelphia, PA) 1914-1942 57602 2693 2230The sun. (New York, NY) 1859-1920 225723 3027 5794Midwest NewspapersThe Evening Missourian. (Columbia, MO) 1917-1920 4294 661 2518El Paso herald. (El Paso, TX) 1901-1931 55154 2502 5096The Corpus Christi caller. (Corpus Christi, TX) 1918-1987 8521 1087 2562The Bismarck tribune. (Bismarck, ND) 1873-Current 15437 1044 2844Tulsa daily world. (Tulsa, Indian Territory, OK) 1905-1919 38379 2020 5116The Bemidji daily pioneer. (Bemidji, MN) 1904-1971 30037 891 2356The evening herald. (Albuquerque, NM) 1914-1922 23714 1719 3600West Coast NewspapersRogue River courier. (Grants Pass, OR) 1913-1918 9093 224 494The Tacoma times. (Tacoma, WA) 1903-1949 25172 187 336Bisbee daily review. (Bisbee, AZ) 1901-1971 52713 998 3354
In 2012, the Republican and Democratic (the two dominant parties in the United States,
representing conservative vs. liberal agendas) conventions were held on the weeks of August 27 and
September 3, respectively. The two opposing candidates were Republican nominee Mitt Romney,
and Democratic nominee Barack Obama (incumbent President of the United States).
In collecting data for the United States presidential election, campaign speech transcripts for
both candidates from the UCSB American Presidency Project were collected. For Mitt Romney,
transcripts from 46 speeches over a 62- week period were used: from announcing candidacy on June
2, 2011, to September 10, 2012. This corpus included speeches from both the Republican primary
election (settled on May 14, 2012 as the main competing nominee Ron Paul withdrew). For Barack
Obama, transcripts from 41 speeches over a 7-week period were used: July 5, 2012 to September
10, 2012. This time period was much shorter due to the UCSB website only containing campaign
speeches from July and onwards (earlier speeches were presumably given in an official capacity as
sitting president). Significantly, no speeches were included from the Democratic primary elections,
which Barack Obama secured on April 3, 2012.
17
2.4 Tweets from Latin America
This dataset consists of tweets collected from Latin American countries. Examples of countries in-
cluded in this dataset are Brazil, Honduras, Colombia, Mexico, El Salvador, Costa Rica, Guatemala,
Chile, Paraguay, Argentina, Venezuela, and Ecuador. This data was provided by EMBERS (Early
Model-Based Event Recognition with Surrogates), a Virginia Tech project funded by the Intelligence
Advanced Research Projects Activity (IARPA) OSI (Open Source Indicators) program. The main
goal of EMBERS is to develop early warning indicators of significant population-level events such
as civil unrest in countries of interest.
Tweets from these countries are filtered using a keyword list that help indicate the relationship
of the tweet to any civil unrest event. Metadata about each tweet is captured. Metadata includes the
tweet location, language, date, tweet author, and the number of author friends. Tweets are crawled
daily and saved in a JavaScript Object Notation (JSON) format. Each tweet is represented as a
JSON Object.
Tweets in this dataset were enriched with extra metadata. Each tweet JSON object include
but not limited to the following tags:
• City: the city where the tweet originated from.
• Country: the country where the tweet originated from.
• Longitude and latitude: the geolocation of the tweet (if available).
• Interaction: this tag keeps track of the mentions in the tweet.
• Basis Enrichment: this tag includes automatic detected/extracted noun phrases and part of
speech tags for each word in the tweet.
• Date: the date of the tweet.
• Twitter: this is the original tweet data that was crawled from Twitter without enrichment.
18
• EmbersId: a unique id for the tweet.
Along with the daily tweets, the EMBERS project also supplies a dataset of daily reported
civil unrest events and their locations. The events serve as a ground truth source for validating the
results of the developed models and algorithms. Reference events metadata include but not limited
to the following information:
• Event Id: a unique id assigned for each event.
• Entry/Revision Date: when the event were recorded.
• Country: the country in-which this event took place.
• State: the state in-which this event took place.
• City: the city in-which this event took place.
• Population: type of population that is involved in this event. Population types include:
General, Business, Labor, Agriculture, Religious, Medical, Education, and Legal.
• Date: the actual date of the event.
• Earliest reported date: the date the event were reported.
• News source: the source where the event were publicized.
• Headline: the headline used when the event was publicized.
• Event Description: a full description of the event.
Chapter 3
Survey of Related Research
This chapter serves as a survey for the related research to the different parts of this dissertation.The
most used Topic Modeling approach was LDA, first presented by [Blei et al., 2003], where docu-
ments are modeled as distributions (mixtures) over topics, and topics in turn are distributions over
the vocabulary used in the corpus.
LDA assumes that documents are generated in two stages: randomly specify a topic-terms
distribution. Then, for each word in a document: randomly choose a topic from the chosen
distribution and then randomly choose a word from the topic distribution. The generative process
for LDA can be expressed as a joint probability distribution over the observed documents. Words
in documents are the only observed valuables and the topic-term distribution, document-topic
distribution, topic assignment, topics priors, and document prior are all latent variables in the LDA
model.
The main goal of this model is to calculate the posterior probability of topic assignment for
each word in the observed document conditioned on all other variables. This probability is very
hard to calculate. To solve the posterior probability they use Collapsed Gibbs Sampling.
There are three main research areas surveyed here: Temporal Topic Modeling, Topic
Modeling Extensions, and Temporal Text Visualization. In the following sections, each area will be
19
20
address in details.
3.1 Temporal Topic Modeling
Temporal topic modeling algorithms started to appear around 2006, most of them being generaliza-
tions of static topic models. The difference between our goals and those of previous work is that we
aim to automatically identify segment boundaries that denote shifts of coverage, and, in this manner,
extract temporal relationships for examination. Therefore, we are not proposing a new topic model
but instead proposing how we can “wrap” around existing topic modeling algorithms to segment
time-stamped data.
Classical work in this space was done by Blei and Lafferty [Blei and Lafferty, 2006], who
extended traditional state space models to identify a statistical model of topic evolution. They also
developed techniques for approximating the posterior inference for detecting topic evolution in a
document collection.
Documents in Blei and Lafferty model are modeled independently at each time slice. Mod-
eled topics from time t are evolved from topics modeled from t−1. They replaced the Dirichlet
distribution used in classic LDA to draw the topic distribution over the vocabulary with a Gaussian
distribution. The word in a document is the only observed variable in their model. Latent variables
in their model are: topic assignment, topic-term distribution, document-topic distribution, and the
Dirichlet prior for the document-topic distribution.
The generative process for documents at time t is similar to the LDA generative process,
but the difference is that here the priors for each topic are chained in a state space model and it
evolve with Gaussian noise. Also, a document-topic distribution are drawn from a logistic normal
distribution to express uncertainty over proportions. They presented a variational method for
approximate posterior inference.
Wang and MacCallum [Wang and McCallum, 2006] proposed a non-Markov model for
21
detecting topic evolution over time. They assume that topics are associated with a continuous
distribution over timestamps and that the mixture distribution over topics that represent documents
is influenced by both word co-occurrence relationships and the document timestamp. In their model,
thus, topics generate both observed timestamps and words. Iwata and Yamada [Iwata et al., 2010]
also proposed a topic model that enables sequential analysis of the dynamics of multiple time scale
topics. In their proposed model, topic distributions over words are assumed to be generated based
on estimates of multiple timescale word distributions of the previous time period. Finally, Wang
and Blei [Wang et al., 2008] have recently proposed a model that replaces the discrete state space
that was originally proposed by Blei and Lafferty [Blei and Lafferty, 2006] with a Brownian motion
law [Lawler, 1995] to model topic evolution. They assume that topics are divided into sequential
groups so that topics in each slice are assumed to evolve from the previous slice. This line of
research has been extended to mining text streams, e.g., as done in [Wang et al., 2013]. Here,
the authors study the problem of mining evolving multi-branch topic trees inside a text stream
by proposing an evolutionary multi-branch tree clustering method. In their method, they adopt
Bayesian rose trees to build multi-branch trees and use conditionals prior over tree structures to
keep the information from previous trees as well as maintain tree smoothness over time. To keep
the consistency of trees over time, they define a constraint tree from triples and fans to compute the
tree structure differences.
Some recent papers have targeted the goal of modeling multiple information sources along
with capturing topic evolution. Zhang et al. [Zhang et al., 2010] have proposed an evolutionary
hierarchal Dirichlet process (EvoHDP) model which extends the hierarchical Dirichlet process
(HDP) to take time into account [Teh et al., 2004]. Inference of EvoHDPs is conducted through
a cascade Gibbs sampling strategy. Hong, Dom, Gurumurthy, and Tsioutsiouliklis [Hong et al.,
2011] have also addressed multiple streams and the temporal dynamics of topics detected from
these streams. They tackle the multiple stream problem by allowing each text stream to have both
local topics and shared topics. Each topic is associated with a function that characterize the topic
popularity over time and this function is time-dependent.
Other work focused on multiple text streams. Wang et al. [Wang et al., 2009] and [Wang
22
et al., 2007] aim to “align” time series streams so as to identify correlated and/or common topics
across disparate streams. Our algorithm was designed to analyze a single time-indexed corpus as
opposed to multiple time (and asynchronous) text streams. The work of Leskovec et al. [Leskovec
et al., 2009] strings together individual tweets (meme) into a thread. The authors place additional
constraints in identifying these threads (e.g., that they originate in a single meme). The granularity
of analysis in our work are clusters rather than tweets, and the desired output is rich scatter-gather
relations between clusters rather than simple branching patterns. The work presented by Gao et
al. [Gao et al., 2011] is the closest to our work. The authors conduct topic modeling, and investigate
both scattering and gathering possibilities of cluster organization. The difference is that our work
automatically determines segmentation boundaries where significant shifts of topic distributions
occur whereas the work of Gao et al. incrementally clusters every time point separately and then
aims to make splitting and merging decisions.
3.2 Topic Modeling other Extensions
3.2.1 Topic Modeling for Short Text
Short text (microblogs) like tweets and Facebook feeds are a great source for mining public opinions
and interests. A number of research was done in order to merge the benefits of enabling the classic
LDA to work with short text. An empirical study of topic modeling in microblogging environments
was introduced by [Hong and Davison, 2010]. No changes to the topic models were introduced.
They introduced ways to conduct existing topic models on short text. They demonstrated that
aggregating short messages is a valid way to solve the document length dependency of topic
models problem by conducting extensive qualitative and quantitative experiments on three proposed
schemes based on classic LDA.
Another model, Temporal-LDA (TM-LDA), was proposed by [Wang et al., 2012] for mining
streams of social text. The main idea was based on modeling the topics and topic transitions in data.
23
The proposed model detect transition between topics by minimizing the prediction error on topic
distribution in subsequent postings. It can also predict the expected topic distribution in future posts.
A transition parameters updating algorithm was also developed to reach a more efficient prediction
in online streaming settings.
3.2.2 Syntactic Topic Modeling
The first to take syntactic level information into account in topic modeling was [Boyd-Graber and
Blei, 2008]. They developed a nonparametric Bayesian model of parsed documents. The topics
discovered with their model were based on thematic and syntactic constrains. In their model words
order was assumed to be generated based on their order in a parse tree.
Another research by Wallach et al. [Wallach, 2008] considered language and document
structure in building their model. Low-level structures including word order and syntax were
considered. Higher-level structures, such as relationships between documents were also considered.
Latent topics was combined with information about document structure, ranging from local sentence
structure to inter-document relationships. Three different structured topic models were introduced.
The first is a topic-based language model that captures both word order and latent topics. The second
is a dependency parsing topic model that is based on a Bayesian reinterpretation of a dependency
parsing model. The third and last is a high-level relationships between documents topic model that
uses a nonparametric Bayesian priors and Markov chain Monte Carlo methods to infer topic-based
document clusters.
Guo et al. in [Guo and Ramakrishnan, 2010] also developed a Latent Dirichlet Allocation
(LDA) that uses linguistic dependency information as replacement of the features extracted from
Bag-of-Words (BOW) representations. They applied the proposed algorithm on movie reviews for
spoiler detection.
24
3.2.3 Sentiment Analysis and Topic Modeling
Research on combining sentiment analysis and topic modeling can be divided into to two main
tracks. First, topic modeling was used to detect sentiment. Second, sentiment was extracted
simultaneously while discovering topics.
Some research was doing both, for example, Mei et al. [Mei et al., 2007] proposed a
probabilistic model called topic-sentiment mixture to capture topics and sentiments simultaneously.
There is research that linked sentiment analysis with topic modeling, but from the perspective of
using topic modeling for feature extraction to accurately detecting sentiments. Liu provides a survey
on such work in his book chapter [Liu, 2010].
Opinion mining is a field that is highly related to sentiment analysis. Lu and Zhai, in [Lu and
Zhai, 2008] were using expert reviews to mine text data for ordinary opinions. The main challenge
they faced was the need to align ordinary opinions to an expert review and separate similar and
supplementary opinions. For this propose, they used a semi-supervised topic modeling approach to
solving these challenges.
Lin and He in [Lin and He, 2009] proposed joint sentiment/topic model (JST), a probabilistic
modeling framework that is based on LDA. It can detect sentiment and topics simultaneously from
text. They focused on document level sentiment an in order to detect sentiment they added sentiment
detection layer on top of a classic LDA.
3.2.4 Author Topic Model
Rosen et al. [Rosen-Zvi et al., 2004] proposed a topic model that extract topics and authors
information from a text collection. The model generates two major distributions: The first is the
topic-terms distribution, which represents the topic distribution over the terms in the text collection
vocabulary. The second is the author-topic distribution which is the author distribution over the
discovered topics. In their model, the only observed variables are document authors and the words
25
in a document. According to their model, documents are generated as following: first, an author
distribution over topics is randomly chosen. Then for each topic a distribution over the vocabulary
distribution is randomly chosen. In a multi author document, the probability distribution over topics
is a mixture of the distributions associated with the authors. Finally, for each word in a document:
an author is randomly chosen, then the topic is randomly drawn from author-topic distribution.
Then draw a word from the chosen topic. The main probability they are trying to calculate is the
probability of assignment the word in a document to the topic and author respectively conditioned
on all other variables. The conditional probability approximation in this work is done using a
Markov chain Monte Carlo algorithm. They applied their model on three datasets: emails from
a corporation, abstracts from CiteSeer library, and papers from Neural Information Processing
Conference. The quantitatively evaluated their model by calculation perplexity and comparing their
models to other models.
3.2.5 Spatial Topic Modeling
Space is another aspect that has caught the interest of topic modeling researchers. Two algorithms
were introduced by Pan et al. [Pan and Mitra, 2011] for spatio-temporal event modeling from
the text. The first was a three-Step LDA algorithm, based on combining the popular LDA model
with temporal segmentation and spatial clustering. Here, LDA is first used to obtain document
distributions over topics, and temporal and spatial references in documents within each topic are
then resolved. The second algorithm, a space-time LDA algorithm was based on work introduced
in [Wang et al., 2009]. This approach was originally used to encode spatial structure among visual
words for image segmentation. The generative procedure of this algorithm was partitioning visual
words that are close in space into the same documents. The main difference between this work and
our DSTM is that our proposed DSTM uses a classical discrete time approach to capturing topic
evolution, but words are reliant on both topic distributions and location distributions.
26
3.3 Temporal Text Visualization
Text visualization uses interactive visual representations of text that go beyond merely the words
themselves to show important features of a large dataset [Don et al., 2007]. The simplest feature is
word frequency; tag clouds (or word clouds) [Hearst and Rosner, 2008, Lohmann et al., 2009] are
accordingly the most common text visualization technique, and use graphical variables like font,
color, weight, and the size of the word to convey its importance in the source text. Variations include
Wordle [Viégas et al., 2009] and ManiWordle [Koh et al., 2010] for producing more compact and
aesthetically pleasing layouts.
More advanced text visualization techniques convey not only word frequency, but also
structural features of the text corpus (or corpora), often using a graph structure. Examples include
WordBridge [Kim et al., 2011], Phrase Nets [van Ham et al., 2009], and GreenArrow [Wong et al.,
2005], which all use a variant of a dynamic graph to convey the structure in the document. Clustered
word clouds [?, Hassan-Montero. and Herrero-Solana, 2006] take another approach by using word
structure to modify the layout in a tag cloud, which is similar to the themescape representation used
by IN-SPIRE [Wise et al., 1995].
Text data may often be analyzed over time to expose aspects of the data that evolved during
a time period; for example, recent studies have highlighted the importance of time and causality in
investigative analysis [Kwon et al., 2012]. While less saturated than general text visualization, there
exists several pieces of work that visualize text over time in this field.
Perhaps one of the earliest and most seminal of these is ThemeRiver [Havre et al., 2002],
which essentially can be thought of as a horizontal arrangement of a large number of one-dimensional
tag clouds over time; the keywords become bands in a stacked graph evolving on a timeline. Byron
and Wattenberg [Byron and Wattenberg, 2008] later formalized these representations into so-called
streamgraphs, and applied it to many other types of data.
Several approaches exist that are similar to the basic ThemeRiver stacked chart. Tufte [Tufte,
1983] showed an illustration of changing themes of music through the ages. NewsLab [Ghoniem
27
et al., 2007] applies a ThemeRiver representation to visualize a collection of thousands of news
videos over time. NameVoyager [Wattenberg, 2006] uses streamgraphs to show temporal frequencies
of baby names. TIARA [Wei et al., 2010] blends tag clouds onto temporal stacked charts for
important themes. However, because stacked charts group themes into a single visual entity, none
of these techniques is capable of conveying the structural features of the corpus.
A select few techniques provide both temporal and structural information. Parallel tag clouds
(PTCs) [Collins et al., 2009b] integrate one-dimensional tag cloud layouts on a set of parallel axes,
each which could potentially be used for a time or date. However, while visually similar to our
ThemeDelta technique, PTCs do not explicitly convey the grouping of words into topics. Another
similar design is TimeNets [Kim et al., 2010], which uses colored lines to show grouping over time
for genealogical data. In fact, “Movie Narrative Charts” (comic 657) of the web comic XKCD1
also uses sinuous lines to convey groupings of characters in time and space for famous movies.
Finally, Turkay et al. [Turkay et al., 2011] present two techniques for visualizing structural changes
of temporal clusters that are remniscient of ThemeDelta; while not specifically designed for text,
the clusters used in their work could very well stem from textual corpora. ThemeDelta takes a
similar visual metaphor, but focuses on text and is intrinsically tied to the scatter/gather temporal
segmentation component as well.
A very relevant work is ParallelTopics, a visual analytics system that integrates LDA with
interactive visualization [Dou et al., 2011]. The system uses the parallel coordinate metaphor to
present document-topic distributions, with applications to exploring National Science Foundation
awarded proposals, VAST publications, as well as tweets. This system presents the underlying
probabilistic distributions in the LDA model from a temporal perspective using multiple aggregation
strategies and interactions. The system can capture the topics and their evolution over time, but only
using fixed time frames. In contrast, our ThemeDelta approach discovers time frames automatically
based on topics reorganizations across time.
Finally, TextFlow [Cui et al., 2011] is perhaps the closest related work to ThemeDelta, and
1http://www.xkcd.com/
28
uses tightly integrated visualization and topic mining algorithms to show an evolving text corpus
over time. However, whereas we draw upon the same basic visual representation as TextFlow, our
focus in this work is segmenting time based on topic shifts and then interfacing with standard topic
modeling using a novel algorithm. Furthermore, ThemeDelta does not aggregate keywords into
stacks or glyphs, and puts more emphasis on interactive layout.
Chapter 4
Dynamic Temporal Topic Modeling
The main goal behind the work presented here is to examine the ability to identify segment
boundaries that detect significant shifts of topic coverage. The motivation that drove this work is
the significant research done in detecting topic evolution in a text corpora. Research in the literature
focused on extending the Latent Dirichlet Allocation (LDA), the classic topic model proposed
by [Blei et al., 2003].
In this chapter, we will present a time-series segmentation algorithm that identify segmen-
tation boundaries. These segmentation boundaries are detected when a significant shift of topics
coverage occurs. To detect shifts in topics, we embed a topic modeling algorithm within a segmenta-
tion algorithm. To contrast our approach with the work mentioned in the literature 3.1, the goal is to
not simply to track the temporal evolution of topics, but to identify segments that denote significant
shifts in their content (distribution).
We use the algorithm to study Internet use in advantaged and disadvantaged communities.
The dataset used for this application was the i-Neighbors dataset 2.1 We also applied the algorithm
on paragraphs extracted from The Washington Times newspaper. The newspaper data was extracted
from the Historical Newspaper dataset presented in 2.2. This application focused on studying the
coverage of the influenza epidemic in 1918.
29
30
4.1 Segmentation Algorithm
Our segmentation algorithm expects the input data to be in a bag-of-word format. The preprocessing
needed is thus to tokenize the text into individual words, followed by standard processing steps
such as: lower case conversion, stemming, stop words removal, spell checker, and punctuation
removal. The main task of the segmentation algorithm is to automatically partition the total time
period defined by the documents in the collection such that segment boundaries indicate important
periods of temporal evolution and re-organization.
Segment 2
w6w2w1w4
w5w2w3w6
w1w2w3w4w8
Segment 1
w5w1w3w6
w5w1w3
w7w2w4w8
Z1
Z2
Z3
Z`1
Z`2
Z`3
2 1 2
3 2 1
2 2 3
Z1
Z2
Z3
Z`3Z`1 Z`2
Segm
ent
1
Segment 2
Figure 4.1: Contingency table used to evaluate independence of topic distributions for two adjacentwindows [Gad et al., 2012].
The algorithm moves across the data by time and evaluates two adjacent windows assuming
a given segmentation granularity (e.g., discrete days, weeks, or months). This granularity varies
from application to another and it is decided by domain experts. We evaluate adjacent windows by
comparing their underlying topic distributions and quantifying common terms and their probabilities.
We chose to quantify common terms based on the overlap between them. The overlap can be
captured using a contingency table. Figure 4.1 shows a simplified example of two segments, each
comprising three topics and the corresponding contingency table measuring the overlap between
these distributions. For example, topic 1 (Z1) in segment 1 and topic 1 (Z′1) in segment 2 overlap
in w1 and w6. This resulted in adding the count 2 in the contingency table cell that corresponds to
the overlapping cell between the two topics from the two segments. We would like the topic models
of the two adjacent windows to be maximally independent, which will happen if the table entries
31
are near uniform.
Formally, given the input data to be indexed over a time series T = {t1, t2, . . . , tt}, the
segmentation problem we are trying to tackle is to express T as a sequence of segments or windows:
ST = (Stat1 ,S
tbta+1
, . . . ,Stltk) where each of the windows Ste
ts , ts ≤ te denotes a contiguous sequence of
time points with ts as the beginning time point and te as the ending time point.
Each window Stets has a set of topics that is discovered from the set of documents that fall
within this window. The topics are discovered by applying LDA (Latent Dirichlet Allocation) [Blei
et al., 2003]. Applying this algorithm will result in two main distributions: document-topic
distribution (representing the distribution of the discovered topics over the documents) and topic-
terms distribution (representing the distribution of the discovered topics over the vocabulary).
Topics within each window is represented as Stets = {z1,z2, . . . ,zn} where n is the number of
top topics z discovered. Each topic is represented by a set of terms w as follow: zi = {w1,w2, . . . ,wm}
where m is the number the top terms extracted from the topic-terms distribution resulted from
applying LDA on the documents within a window. Number of top topics n and top terms representing
a topic m vary from application to another.
We represent two adjacent windows as Ste1ts1
and Ste2ts2
. To evaluate two adjacent windows, we
construct the contingency table for two windows. The contingency table is of size r× c where rows
r denote topics in one window and columns c denote topics in the other window. Entry ni j in cell
(i, j) of the table represents the overlap of terms between topic i of Ste1ts1
and topic j of Ste2ts2
.
We used a contingency table because it enable the replacement of LDA with any emerging
topic modeling variants. As presented in [M. Shahriar Hossain, 2013] we can embed any vector
quantization clustering algorithm in a contingency table framework. For instance, distributions
inferred from a more sophisticated model can be compared using the contingency table formulation
introduced here.
Then to check the uniformity of the table, three steps should be accomplished:
32
First, calculate the following two quantities:
• Column-wise sums ni. = ∑ j ni j
• Row-wise sums n. j = ∑i ni j
These two quantities will be used to quantify the overlap between the topics discovered from
two adjacent windows. In our implementation for this step, each topic is represented by its top
assigned terms. The contingency table is created from these terms (here we chose 20 terms and the
choice of the number of terms is inherently heuristic and specific to the application). A probabilistic
similarity measure such as the KL- or JS-divergence between the distributions being compared is
another possibility.
Second, we define two probability distributions, one for each row and one for each column:
p(Ri = i) =ni j
ni.,(1≤ j ≤ c) (4.1)
p(C j = j) =ni j
n. j,(1≤ i≤ r) (4.2)
Third, we calculate the objective function F to capture the deviation of these row-wise and
column-wise distributions with regard to the uniform distribution.
The objective function is defined as follows:
F =1r
r
∑i=1
DKL(Ri‖U(1c))+
1c
c
∑j=1
DKL(C j‖U(1r)) (4.3)
where
DKL(P‖Q)) = ∑i
P(i) logQ(i)P(i)
(4.4)
33
This objective function can reach a local minimum, which is acceptable given that we are
trying to segment time based on shifts in topics and this approach capture the first shift in topics
(as opposed to detecting an optimal segmentation which would require a more exhaustive search
through breakpoint layouts).
Algorithm 1. Topic Modeling Based SegmentationInput: T = {t0, t1, t2, t3, . . . , tt}
x = min. window size.y = max. window size.
Output: ST = {} //Set of all segments between t0 and ttW1Start = t0W1Size = xF =Initialize objective function with a large number.while W1Start +W1Size+ x≤ tt and W1Size≤ y do//x is added to W1 to take into account the data availability for W2.
Conversion = False//Reset start and size of W2.W2Start =W1Start +W1Size+1dayW2Size = xwhile W2Start +W2Size≤ tt and W2Size≤ y do
Apply LDA separately on W1 and W2Calculate F ′ for W1 and W2if F ′ > F or W1Size == y or W2Size == y do//Conversion or max. window size limit reach.
Add W1 and W2 to STW1Start =W2Start +W2Size+1dayW1Size = xConversion = TrueBreak
F = F ′
W2Size+= x //Expand W2.if !Conversion do
W1Size+= xif leftover data exists do
//leftover data starts at W1Start and ends at tt .Apply LDA on leftover data.Add window of leftover data to ST .
return ST
Here, DKL denotes the KL-divergence that is used to calculate the distance between the
34
row-wise and the uniform distribution. Likewise, it is used to calculate the distance between the
column-wise distributions and the uniform distribution. Then the values resulting from using the
DKL will be used in calculating the objective function F .
The algorithm repeats the above mentioned steps for all permutations of the two sliding
window sizes. The goal is to minimize F , in which case the distributions observed in the contingency
table are as close to a uniform distribution as possible, in turn implying that the topics are maximally
dissimilar.
There are two stopping conditions for this algorithm: (1) if conversion of F is achieved,
or (2) the maximum size for both windows was achieved. Detailed description of the algorithm
is shown in Algorithm 1. In the following section, two applications for this algorithm will be
presented.
4.2 Algorithm Applications
4.2.1 Bridging the Divide in Democratic Engagement: Studying Conversa-
tion Patterns in Advantaged and Disadvantaged Communities
This work was done as a collaboration with Naren Ramakrishnan (Department of Computer Science,
Virgina Tech), Keith N. Hampton (School of Communication and Information, Rutgers University)
and Andrea Kavanaugh (Department of Computer Science, Virgina Tech). And was published in
the ACM Social Informatics 2012 [Gad et al., 2012].
The Internet offers opportunities for informal deliberation, and civic and civil engagement.
However, social inequalities have traditionally meant that some communities, where there is a
concentration of poverty, are both less likely to exhibit these democratic behaviors and less likely to
benefit from any additional boost as a result of technology use. We argue that some new technologies
afford opportunities for communication that bridge this divide. Using temporal topic modeling,
35
we compare informal conversational activity that takes place online in communities of high and
low poverty. Our analysis is based on data collected through iNeighbors, a community website
that provides neighborhood discussion forums. We examine the adoption of iNeighbors by poverty
level, and apply our algorithm to six neighborhoods (three economically advantaged and three
economically disadvantaged) and evaluate differences in conversations for statistical significance.
Our findings suggest that social technologies may afford opportunities for democratic engagement
in contexts that are otherwise less likely to support opportunities for deliberation and participatory
democracy.
Democratic engagement, at both the individual and community levels, is one of the strongest
predictors of well-being [Helliwell and Putnam, 2004]. While political behaviors, such as voting,
are among the most studied aspects of democratic engagement, they are only a small subset of
the behaviors that contribute to a democracy. Participation in a democracy involves more than the
occasional selection of representatives. Citizens and their communities benefit from individual
and collective action to address issues of common concern through activities outside of elections
and government [Carpini and Keeter, 1996]. Participatory democracy includes a range of civic
behaviors, including membership in institutions that address public issues, such as a neighborhood
watch [Putnam, 2000], as well as civil behaviors, such as helping a neighbor in an emergency
[Klinenberg, 2002]. These behaviors are intertwined with casual conversations, that, although not
overtly deliberative or political, are a part of the “incomplete” [Fishkin and Stone, 1995] forms of
political deliberation that are key to shaping social identities, friendships, and trust [Walsh, 1992].
This combination of informal participation and casual, public deliberation provides for the social
mixing that is important for opinion formation, awareness of common interests, social tolerance,
and the ability to act on collective goals [Dewey, 1927]. Unfortunately, like so many forms of
democratic engagement, civic and civil behaviors and informal opportunities for deliberation are
unequally distributed.
Civic and civil behaviors, including opportunities for informal deliberation, are stratified
by class [Uslaner and Brown, 2005]. Those of lower income are significantly less likely to
exhibit attitudes and behaviors for democratic engagement [Carpini and Keeter, 1996]. In addition,
36
inequality is not equally distributed across the country, but concentrate in geographic areas of
concentrated disadvantage; neighborhoods that are high in poverty, racial segregation, and social
problems, such as crime [Sampson, 2011]. The concentration of inequalities is associated with
structural instability that reduce the ability of residents to form the local social bonds necessary
for collective action [Sampson, 2011]. As a result, those communities with the greatest need for
informal discussion and participatory democracy are typically those where it is most absent.
Research on the role of new information and communication technologies (ICT s) and
democratic engagement have generally found positive relationships between exposure to online
political information and democratic behaviors [Shah et al., 2005, Boulianne, 2009]. Participation
in online activities that support informal deliberation, such as social networking services, has also
been found to contribute to political participation [Hampton et al., 2011]. However, there is almost
no evidence that the use of ICTs overcomes existing socioeconomic inequalities associated with
democratic engagement [Hargittai and Shaw, 2011]. Indeed, there may be a “Matthew effect”
[Merton, 1968], such that those who are already the mostly likely to express democratic behaviors
gain further as a result of new ICTs, while those who have little gain little as a result of ICT use.
We argue an alternative theory. We believe that new ICTs, specifically social media, offer new
affordances for group interaction, informal deliberation and democratic engagement [Kavanaugh,
2013]. Unlike some other Internet technologies, social media afford contact in contexts where
individuals have a shared affinity – through geography, political interests, or other interest – but
previously lacked the means or ease of access for connectivity (in-person or online). We focus on
how these affordances reduce the cost of communication for urban communities with concentrated
inequalities.
This reduction in the cost of communication helps residents overcome established structural
barriers to social tie formation, informal deliberation and participatory democracy. The result is a
set of opportunities for democratic engagement among people and in areas previously constrained
by structural barriers to collective action. When such social media that are designed to bring local
people together are made available to people in urban neighborhoods with high socioeconomic
37
inequalities, we expect to find democratic engagement that is as high as what is typical of areas
where such inequalities are less concentrated.
Specifically, our goal is to study the adoption of a tool for informal deliberation at the
neighborhood level and to compare conversation patterns across advantaged and disadvantaged
communities based on their level of concentrated poverty. Our aim is to characterize differences
in informal deliberation, if any, between these advantaged and disadvantaged neighborhoods, as
well as to detect common interests between them. This will provide insight into how neighborhoods
with different poverty levels use ICTs for informal deliberation.
In order to be able to detect deliberation and common interests, we applied our temporal
segmentation algorithm.The objective of applying the algorithm is to detect segments where there
are significant concordances of topics, but such that segment boundaries identify significant shifts
in topics.
Once a neighborhood discussion is characterized in this manner, we can: compare the time
duration of topics in neighborhoods with different poverty levels, identify differences in topics
discussed between neighborhoods of different poverty levels, and identify differences in topics
discussed between neighborhoods of similar poverty levels.
Our goal is to identify segments that denote significant shifts of content (distributions). In
turn, this will help to detect differences in deliberation and common interests between advantaged
and disadvantaged neighborhoods. This requires us to capture similarities and distinctions between
neighborhoods based on: the amount of time neighborhoods with different poverty levels spent
discussing the same topics, average similarity in topics discussed between neighborhoods with
different poverty levels, and average similarity in topics discussed between neighborhoods with the
same poverty levels.
Using the segmentation algorithm we aim to identify segmentations such that segment
boundaries indicate qualitative changes in topic distributions. Every neighborhood in the analysis is
characterized in this manner and the resulting segmentations are then clustered with a view toward
identifying enrichments that hold (or do not) at different poverty levels.
38
Internet use in communities
This study builds on prior research that explores the relationship between Internet use and local
engagement [Hampton and Wellman, 2003, Hampton, 2007, Kavanaugh et al., 2000, Kavanaugh
et al., 2007, Kavanaugh et al., 2008, Hampton, 2010]. In particular, we focus on the uneven impacts
that Internet use may have on participatory democracy and informal deliberation for communities
with a concentration of poverty.
A number of studies have demonstrated that the availability of a relatively simple neigh-
borhood website and discussion forum can increase local tie formation, informal deliberation, and
civil and civic behaviors [Hampton, 2007, Hampton and Wellman, 2003, Hampton, 2010]. For
example, a longitudinal study of how local social networks changed as a result of a neighborhood
email list found that the average person gained over four new local social ties for each year that
they used the intervention [Hampton, 2007]. Moreover, the type of discussion that was common in
these forums was found to promote collective action and civic engagement [Hampton and Wellman,
2003, Hampton, 2007]. A recent, large, random survey of American adults found that of those who
use an online neighborhood discussion forum, 60% know all or most of their neighbors, 79% talk
with neighbors in person at least once a month, and 70% had listened to a neighbor‘s problems in the
previous six months. This compared to the average American, 40% of whom knew their neighbors,
61% talked in-person, and 40% listened to a neighbor‘s problems [Hampton et al., 2009].
Characterizing Neighborhoods
We used our segmentation algorithm to track discussions across each individual neighborhood; the
next step is to compare such segmentations across neighborhoods.
Recall that since LDA topics are characterized in terms of distributions over terms (p(w|zn))
and that such distributions are weighted to yield the joint distribution:
p(w,zn) = p(zn).p(w|zn) (4.5)
39
These distributions (one for each segment of each neighborhood) must now be compared with
an aim toward identifying commonalities and discrepancies. However, before we capture distinctions
between such distributions, we must ensure that the underlying distributions are expressed over the
same vocabulary (terms). To this end, we use the superset of terms from both distributions as the
sample space over which two segments induce their respective distributions.
Most clustering algorithms require a symmetric measure of association and we employ the
Jensen-Shannon Divergence (JSD):
JSD(P‖Q) =12
DKL(P‖M)+12
DKL(Q‖M) (4.6)
where
M =12(P+Q) (4.7)
Note that the Jensen-Shannon divergence is just a symmetrized version of the KL-divergence.
The dissimilarity matrix constructed in this manner can be used as input to any clustering algorithm,
e.g. an agglomerative clustering with single-linkage criterion is used here.
Qualitative Methods
To test our hypothesis, that social media can afford democratic engagement in areas of concentrated
poverty, we focus our analysis on where the iNeighbors intervention has been a success. By focusing
on the 20 most active iNeighbors groups, previously described in 2.1, we identify local areas that
have successfully adopted social media for civic and civil engagement. Traditionally, we would
expect to find very few examples of engagement in areas where poverty rates are high-nearly all
successful iNeighbors groups should be in areas where there is little concentration of inequality.
However, our hypothesis runs counter to this traditional expectation, we expect social media to
afford successful democratic engagement in areas where poverty rates are high.
40Lo
w1
- Dogs waste issue.- Elementary and middle schools related discussions (e.g. daycare services, celebrations)- Home owners meeting setup.
- Announcements about Fitness/workout classes.- Users trade things .- Users sharing doctors contacts information.
- Smashed and stolen pumpkins.- Users share their email in discussions.- Cars broken into - police reports.
- Holidays greetings.- Encourage donations for troops.- Donations for families in need.
- Home owner association discussions about new buildings issues.- Corruption acts by contractor who works for HOA.- Handover HOA to a new management.
2009/10/03 2009/12/03 2010/08/04 2010/09/05 2010/12/06 2010/12/29
@@@@
Figure 4.2: Partial segmentation output from a low-poverty neighborhood.
To test our hypothesis that informal deliberation in areas of high poverty would be similar to
deliberation that takes place in areas where poverty is low, we modeled how long neighborhoods
with different poverty levels spent discussing topics, the average similarity in topics discussed
between neighborhoods with different poverty levels, and the average similarity in topics discussed
between neighborhoods of similar poverty levels. For the application specific purpose, we used
the dataset presented in 2.1. this dataset consists of six neighborhoods, three advantaged and three
disadvantaged.
Our goal is to study two basic questions:
• What lengths of time neighborhoods with different poverty levels spend discussing topics?
• What is the average similarity in topics discussed between neighborhoods with different
poverty levels, and the average similarity in topics discussed between neighborhoods with
similar poverty levels?
41
- Sustainability plan draft discussions.- Water leakage issues.- Budgets discussions.- Elementary and middle schools events and renovation.- Arrange civic association and city delegation meeting.
- Water related discussions (e.g. toxins and pressure ).- Pets related discussions (e.g. lost pets and shelters).
- Discussions about recycling. - Pets Shelters and animal rescue.- Water infrastructure discussions.- Asking for volunteers.
- Trash schedule.- Problems with neighborhood youth (e.g., crime).- Water bills and new pipes.- Animal shelters.
2009/01/01 2009/02/01 2009/10/02 2009/12/03 2010/08/04
Hig
h3
@@@@ @@@@@@@@
@@@@
Figure 4.3: Partial segmentation output from a high-poverty neighborhood.
Low
3
- Neighborhood watch meeting setup.- Petition for commercial vehicles parking.- Several cars break-ins.
2009/01/28 2009/02/28 2009/11/01 2010/01/02
- New development company building low income rentals.- Discussion related to the legality of soliciting.
- Christmas greetings and announcements that Santa will be at the clubhouse.- Bad homes built by a contractor causing bad publicity for the neighborhood.
@@@@
@@@
@@@ @@@@@@@
Figure 4.4: Partial segmentation output from a low-poverty neighborhood.
Findings
We applied our temporal segmentation algorithm on the six selected neighborhoods. The output of
the algorithm is a set of segments from each neighborhood, a dissimilarity matrix, and a dendrogram
depicting the clustering of all segments across neighborhoods. Some segments were examined
manually, by checking the original text to validate the segmentation output. A partial segmentation
output is shown in Fig. 4.3 for a disadvantaged neighborhood and in Fig. 4.2 and Fig. 4.4 for a more
advantaged neighborhood.
• Characterizing Segment Durations
42
2009/01/01& 2010/12/30&2010/01/01&
&&&2009/09/10& 2010/01/11& 2010/09/11& 2010/10/12& 2010/12/13& 2010/12/29&
Low&2&
2009/01/01& 2009/02/01& 2009/10/02& 2009/12/03& 2009/08/04& 2009/09/05& 2009/12/06& 2009/12/29&
Low&1&&
&&&& && &&& && &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&2009/01/28& 2009/02/28& 2009/11/01& 2010/01/02& 2010/09/03& 2010/11/04& 2010/12/22&2010/12/05&
Low&3&
&& &&&&&&2009/01/01& 2009/05/01& 2010/01/02& 2010/02/03& 2010/10/04& 2010/11/05& 2010/12/06& 2010/12/30&
High&2&
2009/01/01& 2009/02/01& 2009/10/02& 2009/11/03& 2010/07/04& 2010/10/05& 2010/12/06& 2010/12/29&
High&1&
2009/01/01& 2009/02/01& 2009/10/02& 2009/12/03& 2010/08/04& 2010/11/05& 2010/12/06& 2010/12/30&
High&3&
Figure 4.5: Durations of segments in advantaged (low poverty) and disadvantaged (high poverty)neighborhoods.
Fig. 4.5 depicts the segmentation outputs for the six disadvantaged and advantaged neighbor-
hoods for the one year period in which messages were exchanged within the communities.
The segmentation algorithm was applied on each neighborhood separately to identify shifts
in topics. Segments identified from each neighborhood are aligned so that vertical ordinates
denote the same time point globally. The dashed vertical lines in each segmentation denote
the algorithm-picked boundaries. There is not a significant difference in segment durations
across the two classes of neighborhoods. The average length of segments from advantaged
neighborhoods is 3.24 months, whereas the average length of segments from disadvantaged
neighborhoods is 3.38 months. (Note that the segment features a collection of topics during
its time, but this does not mean that all these topics were discussed during the entire duration
of the segment.)
• Characterizing Topical Content of Segments
We employed our inferred topic models to construct the dissimilarity matrix across neighbor-
hood segments using the approach described earlier. Topics ranged in similarity from 0 to
4.43, where zero means that the two segments are identical.
If discussion topics within disadvantaged neighborhoods were substantively different from
43
High3&[2009+01+01&–&2009+02+01]&
Low3&[2009+12+04&–&2010+08+04]&
!“Elementary!and!!
middle!school!(events!!and!issues).”!!
Low3&[2009+03+01&+&2009+11+01]&&
High2&[2009+01+01&+&2009+05+01]&
!“Setup!a!neighborhood!watch!
!mee=ng!!and!pe==ons!!!(for!commercial!vehicles!parking!
!and!saving!a!theater).”!
Low2&[2010+01+11&+&2010+09+11]&
!“Gunshot!and!a!number!!
of!burglaries.!”!!
High2&[2009+05+02&+&2010+01+02]&
!!!!!!!“Intense!discussion!aBer!an!ar=cle!!
appeared!in!the!local!newspaper!!asking!people!to!vote!!
for!either!closing!a!public!library!!or!increasing!taxes!to!cover!!
the!expenses.”!
Low2&[2010+10+13&+&2010+12+13]&
&“Several!dog!aDacks!in!!
the!neighborhood,!problems!!with!the!dog!owner,!!
and!safety”!
Low3&[2009+01+28&+&2009+02+28]&
High2&[&2010+10+05&–&2010+11+05]&
!“Lawyers!pos=ng!messages!!
related!to!legal!issues!!(e.g.!legal!parking,!!units!ren=ng!and!!solici=ng!legality.”!
High3&[2009+02+02&–&2009+10+02]&
&“Water!!
infrastructure!!problems.”!
Figure 4.6: Example clusters of discovered segments across neighborhoods.
topics within neighborhoods that have lower poverty levels, the divergence coefficient would
be significantly higher between advantaged and disadvantaged neighborhoods than it is
within neighborhoods that are similar in poverty. That is, we would expect topics within
neighborhoods of similar poverty level to be more similar to each other than they are with
neighborhoods that are substantively different in poverty.
Across neighborhoods, dissimilarity in segments ranges from 0 (identical) to 4.43, the mean
difference is 2.19 (SD = 1.09). The mean divergence coefficient between all discussion
topic pairs within communities that are low in poverty is 2.18 (SD = 1.09), ranging from
0.11 to 4.42. The average divergence between all neighborhoods low in poverty is not
significantly different from the average divergence of topics within neighborhoods low in
poverty (M = 2.11, SD=1.01; one-way ANOVA > .05). Topics discussed within low poverty
neighborhoods are similar across all low poverty neighborhoods.
The average divergence coefficient between all topic pairs across all high poverty areas
ranges from 0.09 to 4.20 with a mean of 2.26 (SD = 1.09). Looking within high poverty
neighborhoods, the mean divergence is 2.35 (SD = 1.07), which is not significantly different
44
from the divergence between topics in similar high poverty areas (M = 2.21, SD = 1.10; one-
way ANOVA > 0.05). The variation in topics discussed within high poverty neighborhoods
is consistent across high poverty neighborhoods.
Comparing discussion topics in high and low poverty areas, divergence ranges from 0.20
to 4.43 with a mean divergence of 2.16 (SD = 1.09). There was no significant difference
between divergence within neighborhoods of similar poverty level in comparison to divergence
between neighborhoods of contrasting poverty (one-way ANOVA > 0.05). Consistent with
our hypothesis, the variation in topics discussed within advantaged and disadvantaged areas
is not statistically different than the variation in topics between areas of high and low poverty.
The range and nature of topics is the same in high poverty areas as was found in more
advantaged areas.
A flat clustering of segments reveals congruences as well as outliers. Fig. 4.6 depicts some
segments that were clustered together and the topics that contributed to their clustering. Other
outliers segments are also shown in the figure. Non-outliers reveal common discussions about
topics.
A flat clustering of segments reveals congruences as well as outliers. Non-outliers reveal
common discussions about topics.
For example, in Neighborhood 7 [2009-03-01 - 2009-11-01] and Neighborhood 4 [2009-01-01
- 2009-05-01], there were messages discussing the setup of a neighborhood watch meeting
and messages discussing a petition. The petition was for commercial vehicles parking in
Neighborhood 7 and in Neighborhood 4 it was to save a theater. In Neighborhood 5 [2009-
01-01 - 2009-02-01] and Neighborhood 6 [2009-12-04 - 2010-08-04], there were many
messages about elementary and middle school events and issues. On the other hand, outliers
reveals discussions about an unusual topic. For example, in Neighborhood 3 [2010-01-11 -
2010-09-11], we found a lot of messages discussing a gunshot and a number of burglaries.
In this segment, a lot of messages discuss how to buy a gun or a dog. Another example is
Neighborhood 4 [2009-05-02 - 2010-01-02], which had an intense discussion after an article
45
appeared in the local newspaper asking people to vote for either closing a public library or
increasing taxes to cover the expenses. The last example is Neighborhood 3 [2010-10-13 -
2010-12-13], which had many messages discussing several dog attacks in the neighborhood,
problems with the dog owner, and safety.
Discussion
Here we address the divide in democratic engagement that exists between advantaged and disadvan-
taged communities. We look for evidence that the gap between high and low poverty communities,
in democratic participation and deliberation, is affected by the use of a social media intervention.
Specifically, we have argued that new communication technologies afford civic and civil behaviors
and informal deliberation in high poverty communities, similar to what is experienced in commu-
nities that are low in poverty. Our approach compares the adoption of a new technology across
neighborhoods of high and low poverty. We use a unique algorithm to:
• Detect differences in deliberations activity between neighborhoods with different poverty
levels.
• Detect whether there are more or less common discussion topics between communities with
different poverty levels.
We did not find significant differences between high and low poverty neighborhoods in terms
of either the lengthy of discussion periods or the overall topics of discussion. In addition, we found
that the rate of adoption of a communication tool for participatory democracy was much higher than
would be expected based on established theories pertaining to the digital divide and concentrated
inequality. This is not the usual finding in studies of the digital divide, where lower socioeconomic
status populations typically have fewer opportunities to participate in public deliberation.
In the past structural constraints internal to disadvantaged communities limited opportunities
for deliberation and democratic participation. Social technologies may make communication
46
possible where it was not before. One possible explanation, as to why social media may be such
an important tool for engagement among this population, may relate to the way these technologies
bring people together. Previous findings, that use of the Internet as an information tool has a
modest positive relationship to engagement for those who are already likely to be engaged [Shah
et al., 2005, Boulianne, 2009, Hargittai and Shaw, 2011], do not extend to the truly disadvantaged.
However, when the Internet is used as a social tool, a means to communication between people who
are “locally" embedded in existing social structures (even if those structures are loosely connected)
it affords social cohesion, discussion, and engagement. Technologies that facilitate communication
among a population that shares geography, or possibly other sources of affiliation, enables contact
that may previously have been desired, but was constrained by physical and structural barriers. It
may not be surprising that, when barriers to contact are reduced, we find that residents of high
poverty areas are as motivated to participate and deliberate about local issues as people of other
communities. If these findings are generalizable, the policy implications are significant. Insuring
equal access to social media, across socioeconomic divides, has the potential to reduce persistent
inequalities in democratic engagement.
4.2.2 Digging into Historical Newspaper Archives using Dynamic Temporal
Segmentations over Topic Models
This work was done as a collaboration with Michelle Seref (Department of English, Virginia Tech),
Tom Ewing (Department of History, Virginia Tech), Laura West (Department of History, Virginia
Tech), Naren Ramakrishnan (Department of Computer Science, Virginia Tech), Bernice L. Hausman
(Department of English, Virginia Tech)
The 1918 influenza epidemic, which killed as many as 50 million people worldwide, has
long been recognized as one of the most deadly disease outbreaks in modern world history. This
epidemic occurred in the last months of the Great War, which always overshadowed, yet also shaped,
discussion of the threat of illness. Because this outbreak occurred at a time when newspapers pro-
vided extensive local reports while also communicating national and international news, historians
47
are interested in understanding how newspaper coverage of the influenza epidemic was shaped by
the war-time context, or what we might today call national security threats. For instance, previous
research has identified the American and Canadian public’s commitment to war efforts even in the
face of serious health threats. Crosby [Crosby, 1989] found that Liberty Loan drives continued to
be held in major U.S. cities during October, just as the epidemic was about to take hold. With the
greater availability of digitized newspapers, it is of immense interest to analyze ever increasing
collections of text archives to shed insight into news coverage and capture important periods in the
progression of the epidemic.
One of the key questions that historians would like to answer, as they dig through digital
archives, is: what are the key stages of progression in coverage of an issue or phenomenon? Which
stage occurred before which other, and do they correspond to known externalities or other factors?
Are there critical time points that establish compartmentalization over the full temporal course?
Such information, if automatically extracted using analytic techniques, can complement close
reading traditions familiar to humanists.
We demonstrate a successful application of our algorithm to archives of the Washington
Times. By studying the ebb and flow of ideas in the Fall of 1918 we illustrate how our segmentation
algorithm extracts important qualitative features of news coverage of the pandemic.
How Historians Currently Do Analysis: Perspectives on Rhetorical Research
Rhetorical and historical research are reiterative practices that spiral through a process that begins
with preconceptions, selects data based on these preconceptions, develops new frameworks as a
result of data analysis, and moves back to a rethinking of the initial preconceptions based on newly
developed frameworks and the knowledge that results from them. Preconceptions comprise prior
knowledge, theoretical frameworks, and existing research questions. Prior knowledge on any given
topic includes both what we know and what we think we need to learn before we can address the
research questions.
48
Here we use specific examples from our case study described later to elaborate general
points. With respect to prior knowledge, for example, we know that the second wave of influenza in
the United States passed through the eastern seaboard in the early fall of 1918. We also know that
this period coincides with the final months of World War I. Based on this prior knowledge and some
initial scanning of various newspapers, we chose to focus on The Washington Times, a daily paper
in Washington, DC, which at this time published an evening edition. We limited our analysis to
September, October, November, and December 1918.
Our research questions concern what is now thought of as “national security interests” but
which at the time would have been understood as the “war-time context.” We are interested to know
how newspaper coverage of the influenza pandemic was shaped by the war-time context, as well as
how coverage of the war was shaped by the influenza threat. These questions send us to the data
looking for the impact of context on the reporting of influenza.
Our theoretical frameworks are based in ideological analysis, semiotics (the study of sign
systems), narrative analysis, rhetorical analysis, and historical analysis. Ideological analysis address
politics, power, and social dynamics, including the analysis of gender, race, and class, and paying
attention to vectors of power as they are produced by particular social circumstances. Semiotics
studies sign systems, that is the use of words and images to signify particular ideas or frameworks.
Semiotics is useful in studying advertising, as well as news journalism more generally. Narrative
analysis attends to recurrent themes, repeated word use, typical story lines or plots, and is useful in
identifying underlying patterns that are not evident at the literal level of textual content. Rhetorical
analysis pays special attention to genre and discourse use in specific situations and contexts. For
example, we have identified a number of forensic terms, such as “victim”, “investigate”, and
“suspected”, used to refer to influenza in addition to appearing in articles about crime on the front
pages of The Washington Times during this period. An understanding of the historical context
provides the basis for all of these language-oriented interpretations.
At this point in our research, we are only working with context analysis in order to determine
the timeline of events which occurred once the flu hit a particular region and became an epidemic.
49
Doing so allows us to calibrate theâ “manual analysis” with the algorithmic elements of the research.
An example of context analysis would pay attention to the overall concerns as exhibited on the
front pages of papers during this period. Specifically, in every paper in October in which influenza
appears on the front page, the banner headline is nevertheless about the war. In addition, October
was the last month of the fourth Liberty Loan drive, which was undersubscribed until close to
the end of the war. These concerns are interwoven with concerns about the influenza epidemic in
Washington, given concerns about crowds and contagion.
In order to read and analyze articles from The Washington Times on influenza during this
period, we needed to decide how to select appropriate issues of the newspaper. We did a keyword
search in the Chronicling America database (described in detail later) exclusive to The Washington
Times between August and December 1918, using the terms, “grip”, “grippe” and “influenza.” A
quick scan of the resulting issues determined that most articles of interest were on front pages, so
we made an initial decision to exclude non-front-page articles from the analysis. We found that uses
of these terms that were not on front pages tended to be advertisements or in articles continued from
the front page. We altered our initial decision to include August when we discovered that there was
only one instance of the use of “influenza” in that month, and it was in an advertisement.
Historical and rhetorical analysis depends on close reading of data. When we read, we
look for patterns (i.e. repetition) of word use, topics, and themes. In rhetoric, this practice can be
systematically applied as “coding.” We look for both expected and unexpected patterns of usage.
Our expectations are based on prior knowledge and our theoretical frameworks, which tell us what
we think we will find. Thus when we find such information in our data, we note it. However, we
also pay attention to findings that we do not expect – what seems unusual or contradictory to what
we think we know. Unexpected findings might be words used that we didn’t think to search for, an
example might be, “flu”, which in these articles and titles, is always placed within quotation marks.
We are still not sure what to make of this finding. We also look for the placement of articles on the
page. In addition, we often have to conduct new research to make sense of findings whose meaning
is not entirely clear to us. For example, we are currently investigating the extent of newspaper
censorship during this period, given that most of the coverage of the flu in October 1918 seems to
50
be very local to Washington.
To analyze our findings once they have been determined from the data, we use theory and
prior knowledge as frameworks to narrate our explanations. Analysis must account for both the
expected and the contradictory or new information. Analysis creates new narratives bringing latent
elements from the data to the level of manifest content. Rhetorical analysis pays special attention
to the contexts of discourse and the influence of context on reception, understanding, and use.
How do people use the discourses available to them to make arguments, explain things, or justify
themselves? What is the purpose of specific forms of discourse use and are they successful or not?
How do unintended meanings (ideology) make their way into utterances and written discourse and
what are their modes of circulation and influence? These are the questions we seek to answer using
the segmentation algorithm.
In this work we used the Chronicling America Dataset 2.2. Two projections (sub-datasets)
from this collection to apply the segmentation algorithm on were created. First projection is The
Washington Times front pages and the second is Influenza paragraphs Extracted from the same
newspaper. The focus of this work was on The Washington Times for the period from September
1918 to December 1918.
To decide whether our segmentation approach reveals important insights we compared it’s
output with a manual analysis conducted by a group of three historians. The goals of the study was
to understand the event timeline and obtain a conceptual understanding of the coverage of influenza
in The Washington Times during this period. The manual analysis steps involved identifying the
sequence of events in Washington following the outbreak of flu in late September through the end
of the epidemic in late October. We follow the discussion of influenza, policies to close schools,
theaters, and churches, and other public health decisions in the city. Our goal is to see if topic
modeling and segmentation can provide results that mirror analysts’s manual traditional analysis
of the papers – i.e., actual reading and interpretation. The event timeline created from manual
rhetorical and historical analysis thus far is shown in Table 4.1
51
Table 4.1: Event Timeline created from Front Pages of The Washington Times (1918).
September 1918Sept 11 Reports of influenza in BostonSept 19 Believe germs spread by German submarines; hospitals quarantinedSept 20 First day disease is, “discovered,” in DC; first, “fatal case” reported; believe source is from NYSept 21 Believe,“the ailment can be cured”; university student quarantined in ChicagoSept 24 Boston schools closed, “until disease is stamped out”Sept 25 Soldiers to wear “anti-grip masks”Sept 26 Gauze masks given to soldiers in DCSept 27 Senator of Mass asks for $1 million, “appropriated to fight the spread”Sept 28 Speaker of the House and Majority Leader get disease; $1 million joint resolution to, “fight Spanish Influenza epidemic” to be, “rushed,”
October 1918Oct 1 Plague closes 6 schools in Virginia (Alexandria county)Oct 2 All public schools closed indefinitely due to epidemic; stores to open at 10am starting Oct. 3; changes to working hours of government employees in order to relieve
congestion in public transportationOct 3 Private schools asked to closeOct 4 Churches and playgrounds closed; theaters, motion-picture houses, and dance halls closed; indoor assemblies, “public menace”; congressional and public libraries and
Corcoran gallery closed; 175,000 cases of influenza in U.S. (mentions the Spanish influenza sweeping through big cities)Oct 5 Freight service into Washington crippled and passenger service is threatened with curtailment because railroad workers sick; officials consider closing GW University;
churches plan open air meetingsOct 6 Sudden increase in spread of diseaseOct 8 Continued increasing number of deaths; end of Liberty Loan rallies, religious services, and all meetings of all kindsOct 10 25,000 gauze masks to be distributed among government employees the next dayOct 11 Commissioner orders landlords in DC to furnish heat; all government depts. to close the following day in order that employees may buy Liberty loan subscriptions at
banksOct 12 US PHS and DHD set up stations in city for influenza sufferers; war workers are barred from entering WashingtonOct 14 96 flu deaths in 24 hoursâ - biggest toll recorded yet; warrants issued for lunch counters where glasses not properly cleaned; plan to rearrange lunch hours of government
employeesOct 15 Inspectors of PHS make circuit of city, directing barbers, dentists, and elevator, “girls and men” to ”get gauze masks”; all people urged to wear masksOct 16 Mansion opened for “girl war workers” to recuperate once released from hospitalOct 17 No new war clerks allowed to enter DC; supply of gauze masks exhausted (50,000 given out); instructions for making masks includedOct 18 Cartoon: “Closed to Prevent Spread of Pan-German Influence Plague”; large increase in number of deaths wipes out hope of epidemic reaching stationary point; new
influenza hospital; need 70 more nursesOct 19 Health department in Chicago reported to announce vaccination against pneumonia for all citizens; flu deaths show declineOct 20 Gas masks help keep nurses from being infected with influenza; epidemic has reached peakOct 21 Epidemic recedingOct 22 Sudden jump in deaths due to influenza, but increase seen as temporaryOct 23 Epidemic abates among civilians; theaters, schools, and churches to reopenOct 24 Churches and theaters to reopen next week; decrease in deathsOct 25 New cure for TB (collapsing lung)Oct 26 Flu postpones murder trial (not enough jurors), a prominent flu victim dies, and a “crazed mother kills her babies” in ConnecticutOct 27 Pneumonia vaccine saves 10,000 troopsOct 29 Churches to open Friday, theaters on Monday; public school terms may be extended
November 1918Nov 3 Army officials believe influenza epidemic “under control” 290,000 draftedNov 7 War is overNov 16 200,000 soldiers in camps to be demobilizedNov 21 German fleet surrenders to US, France, and Britain; Wilson attends peace conference; Col. E. M. House, a US representative at the conference, is suffering from
influenzaNov 30 300,000 soldiers expected to come home each month; former Kaiser reported to be ill with influenza; manufacturing of beer and wine ceases tomorrow
December 1918Dec 6 Bolshevist Revolution spreading over GermanyDec 8 Reprise of influenza outbreak in San Francisco, city masked today; ex-Kaiser William of Germany to be placed on trial at VersaillesDec 9 25,000 cases of influenza reported in AsunciÃsn, ParaguayDec 10 Washington school officials consider opening schools on Saturdays to make up lost days when closed because of influenza pandemic; martial law declared in BerlinDec 12 Occupation of German territory completedDec 19 Former Emperor Karl and children ill with influenzaDec 20 American league umpire in Boston dies of influenza
Finidings
We now outline below comparisons between the segmentations discovered by the segmentation
algorithm and their relationship to the manual analysis.
• Modeling front pages of The Washington Times
The 1918 September through December topic modeling with segmentation of front pages of
52
Topic Clouds - Segmentation View
1918-09-16 1918-09-23 1918-10-22 1918-10-30 1918-11-28
Some reports on Influenza cases
The epidemic develops and reach its peak -
schools, churches, and theaters close
The epidemic is waning - reopenning schools, churches, theaters
Symptoms of influenza discussed
Figure 4.7: Segmentation results for The Washington Times Influenza paragraphs from September1918 to December 1918.
Topic Clouds - Segmentation View
Some discussions about war and Influenza cases in Washington
Outbreak of the flu in washington - this is the most virulent epidemic period
Intense negotiations about the armistice
Reports about the ending of war
1918-09-1 1918-09-22 1918-10-21 1918-11-05 1918-12-04
Figure 4.8: Segmentation results for The Washington Times front pages from September 1918 toDecember 1918.
The Washington Times demonstrates that the war is the main contextualizing topic throughout
this entire period. Segmentation output is shown in Fig. 4.7. The segmentation at 9/22 and
10/21 roughly corresponds with the outbreak of the flu in Washington and its most virulent
epidemic period. The period from 10/21 to 11/05 corresponds with the intense negotiations
about the armistice. The war was reported to end on 11/7, but in actuality the full armistice
occurred on 11/11.
• Modeling paragraphs with Influenza
The segmented topic modeling of influenza paragraphs on the front pages of The Washington
Times track the influenza epidemic well. Segmentation output is shown in Fig. 4.8. The first
53
segment, 9/16-9/23, concerns outbreaks in other cities and the first outbreak of influenza in
Washington on 9/20. The second segment, 9/23-10/22, concerns the epidemic as it develops
until its peak and then waning, which occurred 10/22. The following segments (10/22-10/30
and 10/30-11/28) demonstrate the waning epidemic and its aftermath, including reopening
of schools, churches, and theaters. By the last segment, 11/28-12/20, the flu is not really a
viable topic, as those topic clouds that were developed are simplistic, contain a small number
of terms, and are largely the same.
The topic clouds in the first section name the first victim (Henne), although they do not
include place names (like New York and Chicago) that were mentioned in articles. Four of
the five topic clouds have the word, ’Spanish,’ in them, referring to the, ’Spanish Influenza.,’
We see indication of the first reported cases in Washington, DC, in another cloud.
The month-long segment from 9/23 to 10/22 covers the main period of influenza epidemic in
Washington, DC, and includes the following clearly delineated topics: (1) closing of schools,
churches, and other public meeting places, as well as government offices, (2) health care
facilities, personnel, and treatment, (3) economic concerns of the war in conjunction with
influenza, i.e., the Liberty Loan drives, and (4) reporting that describes the number of cases
in the district, usually in a 24-hour period. There is one cloud with an unclear topic, - it
seems to include preventive measures but also indicates more global concerns (mention of
Germans and the world). We think that this topic cloud may not really be a unique or separate
topic. Also, in this segment we did not notice a discussion of, “masks,” although there is a
discussion of gauze masking in the newspaper. Indeed, the word, “mask,” does not appear in
any of the clouds.
The next segment is only a week long, from 10/23 to 10/30, and seems to correspond with
the immediate waning of the epidemic and the reopening of closed schools, churches and
theaters. We can detect five separate topics in the clouds, yet it was harder to determine these
than in the previous segment. The newspaper reported the epidemic receding on October
21, with a sudden jump in deaths the following day, which were nevertheless interpreted as
not indicating a change in the reduction of the illness’s spread. Reporting on October 23
54
indicates that the epidemic is receding and at the very end of the month the public buildings
and meeting places are to be reopened. The topics we identified include (1) time, which we
think references the future reopening of schools, churches, and theaters, (2) sports and leisure,
with some mention of army camps, (3) government and administration, which may have to do
with the return of normal working hours, (4) reporting of cases and the continued reopening
of public venues, and (5) a general discussion of flu in the city.
In the segment from 10/30 to 11/28 it is very difficult to determine separate and unique topics
for the clouds. There does not seem to be either commonality or distinction. We think that
there may not really be five topics, but actually perhaps just one or two. Preventative themes
are prevalent in at least two clouds. There also seems to be mention of things starting up again.
The following segment, from 11/28 to 12/20, really does not seem to have more than one
conversation, if that, as the clouds are minimal and uninformative. The segment following,
12/20-12/30, only has two words “year and influenza,” in the entire grouping of five clouds,
two of which have only one word (year). These diminishing topic modeling cloud groupings
indicate that the flu is truly becoming less of a topic of conversation on the front page of The
Washington Times during this period.
Discussion
Our results demonstrate that using the segmentation algorithm enabled us to clearly follow the rise,
height, and fall of the influenza epidemic in Washington, DC, as reported in The Washington Times
throughout the fall of 1918. The segmentation strategy appears to be most successful in capturing
conversations during the period in which there was the most reporting on the epidemic, i.e., the
period of its greatest virulence and spread in the city, 9/23-10/22. Before and after this period, the
topic modeling clouds are more difficult to interpret, but the segmentation does seem to clearly
follow events.
55
4.3 Summary
In this chapter, we presented a time series segmentation algorithm that segment time based on shifts
in topics. We applied the algorithm on two different datasets: Historical newspapers dataset and
i-Neighbors dataset. The algorithm served different purposes in these applications.
In the digging into Historical Newspaper archives application the goal was to understand
the progression in coverage on the 1918 influenza from The Washington Times newspaper. In this
application, the algorithm was successful in detecting the ebb and flow of ideas and in extracting
the pandemic qualitative features of news coverage.
In the i-Neighbors application, our goal was to study the conversation patterns in advantaged
and disadvantaged communities and how new communication technologies changed the behaviors
and informal deliberation across neighborhoods. The algorithm was successful in capturing the
similarities between neighborhoods and the time duration in which they spent on discussing topics.
Through these two applications, we showed that the algorithm was a great assistance to
experts working on these projects. We approached the evaluation of the segmentation algorithm
from a qualitative perspective by experts closely examining of the algorithm output.
Chapter 5
New Visual Analytic Representations
Data mining algorithms have evolved greatly over the past years, especially for topical text model-
ing [Blei et al., 2003]; however, capturing key breakpoints in topic evolution and defining appropriate
visual representations for such breakpoints is an understudied problem.
On the other hand, one of the most frequently used visualization tools for topics is Tag
clouds. Tag clouds are visualizations for keyword groupings (topics), where the size of each keyword
represents its relative importance within a topic. Unlike ThemeDelta, there is no visual method to
track topic evolution and the scattering and gathering of keywords. Others, like ThemeRiver [Havre
et al., 2002] use Streamgraphs to visualize topics across time. Streamgraphs does not show the
scattering and gathering of keywords into topics (trends), however, is capable of conveying the
structural features of the corpus. Although much research exists in data mining and visualization,
we posit that they are insufficient to address the needs of these emerging applications.
ThemeDelta, a new temporal topic modeling approach, will be presented here. The main dif-
ference between ThemeDelta and existing approaches is that it can automatically identify segments
where significant topic shifts occur. To capture topic shifts, we embed a temporal segmentation
algorithm around a topic modeling algorithm, as discussed in the previous chapter. We use the
width of each trend line to communicate the prominence of its trend in the dataset at a particular
56
57
time, and its color to communicate category or overall weight of the trend. A heuristic layout
technique calculates the order of trends for each timestamp while minimizing the number of trend
line crossings. Interaction techniques allow for highlighting individual trend lines, changing the
layout order, and drilling down into the data. Figure 5.1, shows a ThemeDelta visualization for
Barack Obama’s speeches during the U.S. 2012 presidential election campaign. Green lines are
shared terms between Obama and Romney. The figure shows the scattering and gathering of
keywords to form trends.dodo dodo do dodo
middlemiddle
middlemiddle
thankthankthankthank
thank
fairfair
fairfair
peoplepeople
peoplepeople
peoplepeople
hardhard
hardhard
hard
hardhard
classclass
classclassclassclass
visionvision
visionvision
visionvision
reasonreason
reason
reason
reasonreason
reasonreason
crisiscrisis
crisis
crisiscrisis
trytry
trytry
lostlost
lostlost
peoplepeopleofficeoffice
officeoffice
investinvest
investinvest
chancechance
chancechance
millionmillion
million
millionmillion
million
million
thankthank
getget
getget
getget
reformreform
reformreform
littlelittle
littlelittle
littlelittle
littlelittle
futurefuture
futurefuture
future
futurefuture
future
governmentgovernment
governmentgovernment
warwar
warwar
peoplepeople
believebelieve
believe
believe
believebelieve
believebelieve
electelect
elect
elect
electelect
electelect
congresscongress
congress
congresscongress
ideaidea
idea
idea
ideaidea
racerace
race
republicanrepublican
republican
taxtax
tax
tax
tax
taxtax
americanamerican
american
american
americanamerican
extraordinaryextraordinary
extraordinary
extraordinary
extraordinaryextraordinary
unitunit
unit
unit
unit
unit
unitunit
educationeducation
education
educationeducation
education
jobjobjobeconomyeconomy
economyeconomy
carecarecare
carecare
meanmean
meanmean
meanmeancampaigncampaign
campaign
campaigncampaign
startstart
start
start
worldworld
world
world
world
worldworld
world
companycompany
companycompany
companycompany
companyfriendfriend
friendfriend
issueissue
issueissue
worldworldcontinuecontinue
continue
cutcut
cutcut
cut
cut
cut
cut
cut
businessbusiness
businessbusiness
understandunderstand
understandunderstand
lovelove
lovelove
builtbuiltbuilt
built
paypaypay
paypay
pay
paypay
fightfight
fightfight
changechange
changechange
change
changechange
househouse
househouse
healthhealth
healthhealth
energyenergy
energyenergy
energyenergy
energy
energy
toughtough
tough
toughtough
createcreatecreatecreate
chicagochicago
chicagochicago
energyenergytelltell
tell
tell
tell
tell
tell
tell
valuesvalues
valuesvalues
helphelp
helphelp
affordafford
affordafford
systemsystem
systemseenseenseenseen
kidkid
kidkid
mattermattermattermatter
happenhappen
happenhappen
choicechoice
choicechoice
choicechoice choice
runrun
run
runrun
forwardforward
forwardforward
forward
billbill
bill
willwill
will
willwill
familyfamily
familyfamily
familyfamily
family
opportunityopportunity
opportunityopportunity
opportunityopportunity
sensesense
sense
reasonreason
billbill
bill
trilliontrillion
trillion
trilliontrillion
collegecollege
collegecollege
ableable
ableable
planplan
planplan
planplan
millionmillion
middlemiddle
booboo
boo
timetime
time
timemoneymoney
money
applauseapplause
applause
basicbasic
basic
basic
Nov 07 - Jan 02
8 weeks
Jan 03 - Feb 28
8 weeks
Feb 29 - Apr 04
5 weeks
Apr 05 - May 31
8 weeks
Jun 01 - Jun 15
2 weeks
Jun 16 - Aug 11
8 weeks
Aug 12 - Aug 19
1 week
Aug 20 - Sept 10
3 weeks
Sept 11 - Sept 17
1 week2011 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012
Figure 5.1: ThemeDelta visualization for Barack Obama campaign speeches during the U.S.2012 presidential election (until September 10, 2012). Green lines are shared terms betweenObama and Romney. Data from the “The American Presidency Project” at UCSB (http://www.presidency.ucsb.edu/).
5.1 ThemeDelta Overview
ThemeDelta is intended to convey local and global temporal changes in the distribution of evolving
trends. The system detects and visualizes how different trends converge and diverge into groupings
at different points in time, as well as how they appear and disappear during a time period. The
58
system consists of two major components: a backend data analytics component, and a frontend
visualization component:
• The analytics backend is responsible for accepting a large temporal text corpus and automat-
ically identifying segments that characterize significant shifts of coverage. The algorithm re-
sponsible for this task was originally developed to detect deliberation in social messages [Gad
et al., 2012] and presented in the pervious chapter.
• The visualization frontend is responsible for graphically representing the discovered trend-
ing of topics. While originally designed for timestamped text collections, we see many
additional applications such as for genealogy (e.g., [Kim et al., 2010]), communication graphs
(e.g., [Elmqvist and Tsigas, 2003]), and general dynamic graphs.
5.1.1 Data Format
The backend accepts a text dataset consisting of timestamped data. The frontend takes the output of
the backend for visualization. The backend output consists of trends , topics (groups of trends at a
specific point in time), and segments (a closed interval of time, modeled as a group of topics).
The exact mapping of these general concepts to a dataset is domain-specific. For example, for
a timestamped document collection, trends could represent the terms extracted from the documents,
and the topics would model how these terms converge into groupings at different points in time.
5.1.2 Implementation
ThemeDelta is web-based application that is built to be capable of running in any modern web
browser. To realize this the backend of ThemeDelta was build using Java and this include all data
preprocessing, clustering, and segmentation.
The frontend was built using JavaScript and SVG. The current implementation is fully
59
interactive and animated, and is built using the Raphaël1 toolkit for scalable vector graphics.
Figure 5.2: Basic visual representation used by ThemeDelta.
5.2 ThemeDelta: Visual Representation
ThemeDelta’s visual representation draws on TextFlow [Cui et al., 2011], and uses a basic visual
encoding consisting of sinuous trendlines—each representing a trend in the dataset—stretching
from left to right along a timeline mapped to the horizontal axis (Figure 5.2). The horizontal space
along this axis is divided equally among different time segments (t1, t2, and t3 in Figure 5.2). Topics
for each segment are perceptually conveyed by clustering the trendlines for the grouped trends
next to each other along the vertical axes, leaving a fixed amount of empty vertical space between
adjacent topics. Vertical lines, one for each time segment, partition their horizontal positions.
5.2.1 Visual Design
Given this basic design, many design parameters remain open. Below we review the most important
of these and motivate our decisions for the visualization technique. A visualization developer
1http://raphaeljs.com/
60
using the same basic visual representation may make different choices than these depending on the
application.
Shape. To communicate the organic nature of evolving trends, we use splines to yield
smooth curves. The resulting lines are continuous, predictable, and appealing. An alternative
design would have used rectilinear or sharp angles, but curves are likely easier to perceive and more
aesthetically pleasing.
Thickness. Trendline thickness is a free visual variable. While it is possible to use a uniform
thickness for all trendlines, it can also be used to convey scalar data for each time segment. Because
increasing thickness will raise the visual salience of a trendline, we tend to use it to convey the
weight of each keyword calculated by our segmentation algorithm.
Furthermore, our visual representation uses vertical dashed lines to partition time segments
on the visual space. The thickness of these lines is another free variable that can, e.g., be used
to indicate the relative extent of each time segment. This is useful since time segments may be
irregular; some segments are significantly longer than others.
Color. Color is another free parameter in our visual representation, and can convey either
a quantity (using a color scale) or a category (using discrete colors). The choice depends on the
application. For example, we use it both to show the strength of a correlation, as well as to convey
which entity class a particular trendline belongs to.
Discontinuities. Trendlines can begin and end at any time segment, sometimes only to
reappear later in time. We communicate this using a tapered endpoint of the line (see borders in
Figure 5.2). An alternative design could have dashed the trendlines for the periods of time where
there is no associated value, similar to the use of different trend shapes in TextFlow [Cui et al.,
2011]. We chose to avoid this to minimize visual complexity.
Labels. We draw the names of each trendline on the line itself for each time segment. While
this is redundant (one instance of the label is sufficient) and potentially a source of visual clutter, it
prevents the user from having to trace an undulating trendline back to a single label at the far end of
61
the visualization. We also scale the label size based on the trendline’s thickness, similar to word
scaling in word clouds.
Duplicated Trends. Sometimes a trend may exist in more than one topic for a particular
time segment (see trend A at time t2 in Figure 5.2). To make the visualization consistent, as well as
to convey the fan-out, we are forced to fork the trendline into two or more pieces. Analogously, in a
time segment following a duplicated trend instance, the trendlines should be merged to maintain
consistency. In situations when there is more than one candidate to fork from or merge to, we
choose the two trend instances that are vertically closest to each other (see the layout algorithm
discussed below).
5.2.2 Interaction
Several interaction techniques are meaningful for ThemeDelta frontend. First of all, geometric
zoom and pan allows for being able to magnify a certain part of the visualization to see details.
Furthermore, hovering over a trendline will highlight the line, including all of its branches in other
parts of the visualization (even past a discontinuity). Figure 5.1 shows this interaction, where
the trend lines associated with the keyword energy are highlighted in response to a mouse hover
interaction.
The interface also supports searching for trendlines by name. In addition, we provide a
combined filtering and resorting operation. Clicking on a trendline will add it to a filter box, causing
the layout to be recomputed with the selected trendline at the top of the screen. The new layout
will only include trendlines that are connected to the selected trendline, i.e., which in at least one
time segment belong to the same topic as the selected trendline. Following the example presented
in Figure 5.1, clicking on the trend line for the keyword energy performs the filtering operation,
and the visual layout is changed such that the filter keyword is positioned at the top (Figure 5.3).
Additional trendlines can be added to the filter box, yielding a conjunctive filter (only trendlines
which are connected to all selected trendlines are shown).
62
peoplepeople
peoplepeople
peoplepeople
reasonreason
reason
reason
reasonreason
reasonreason
investinvest
investinvest
millionmillion
million
millionmillion
million
million
littlelittle
littlelittle
littlelittle
littlelittle
warwar
warwar
believebelieve
believe
believe
believebelieve
believebelieve
electelect
elect
elect
electelect
electelect
meanmean
mean
mean
meanmean
worldworld
world
world
world
worldworld
world
lovelove
lovelove
changechange
changechange
change
changechange
energyenergy energyenergy energyenergy energy energy
toughtough
tough
toughtough
energyenergy
telltell
tell
tell
tell
tell tell
tellvaluesvalues
valuesvalues
healthhealth
healthhealth
happenhappen
happenhappen
willwill
will
willwill
sensesense
sense
collegecollege
collegecollege
timetime
time
time
moneymoney
money
applauseapplause
applause
reformreform
reformreform
systemsystem
system
dodo
dodo
do
dodo
NovS07S-SJanS02
8Sweeks
JanS03S-SFebS28
8Sweeks
FebS29S-SAprS04
5Sweeks
AprS05S-SMayS31
8Sweeks
JunS01S-SJunS15
2Sweeks
JunS16S-SAugS11
8Sweeks
AugS12S-SAugS19
1Sweek
AugS20S-SSeptS10
3Sweeks
SeptS11S-SSeptS17
1Sweek2011 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012
Figure 5.3: ThemeDelta visualization after performing a filtering operation, based on the keyword“energy”, in the visualization presented in 5.1.
5.2.3 Layout
The ThemeDelta frontend visualization layout divides the available horizontal space equally between
time segments, while vertical space is divided locally between the topics associated with each time
segment. Due to the ever-changing topic groupings over time as well as the dynamic appearance
and disappearance of trends, it is typically not possible to represent trends as straight lines. In fact,
a single trend could appear in a different topic and at a different vertical position with each new
time segment. This is the reason for using smooth splines to convey this organic trend evolution.
Of course, this in turn means that trendlines will frequently cross one another while con-
necting the multiple occurrences of a single term across different time segments. Research in graph
drawing has shown that the ease with which a user can follow an edge depends on the number of
crossings with other lines in its path [DiBattista et al., 1998].
Tanahashi and Ma [Tanahashi and Ma, 2012] discussed a set of layout design principles
for better legibility of storyline visualizations like ThemeDelta. However, the complexity of their
algorithm makes it difficult to achieve real-time layout updates. Other work proposed by Liu et
al. [Liu et al., 2013] trade-off optimal layout with algorithm performance to achieve real-time
updates. The algorithm used for ThemDelta is similar to the one proposed by Liu et al. [Liu et al.,
2013]. However, contrary to their algorithm, we do not have hierarchical relationships in the
underlying data and to facilitate the identification of individual topics by supporting a constant
63
reasonable vertical space between them, the layout algorithm used in this chapter does not perform
the topic alignment step. Moreover, to achieve real time interactivity our implementation minimizes
line crossings through a single iteration across different time segments.
In particular, ThemeDelta relies on a deterministic layout algorithm that minimizes trendline
crossings by first sorting the vertical positioning of different topics, followed by sorting the trends
within each topic. While sorting topics at a particular time segment ti, a topic p1 is placed before
another topic p2 if the average vertical position of the trends contained in topic p1 is less than the
terms present in topic p2 at the previous time segment ti−1. Topic position in the first time segment
is either determined randomly, or using some attribute of the underlying data.
After sorting topics it is time to sort the trends within each topic. Except for the first time
segment, trends within a topic are sorted such that their relative vertical position remains the same
as it was in the previous time segment. Once all trends are sorted, the trends contained within topics
of the first time segment are sorted such that their vertical position remains the same as in the second
time segment.
t1 t2
AA
A
BB
B
CC
C
DD
DEE
E
FF
F
(a) Without sorting.t1 t2
AA
ABB
B
CC
C
DD
D
EE
E
FF
F
(b) Topic sorting only.t1 t2
AA A
BB
BDD
D
CC
CEE
E
FF F
(c) Topic plus term sorting.
Figure 5.4: Comparison of different stages of the layout sorting algorithm used for the ThemeDeltatechnique.
Figure 5.4 shows the progressive decrease in the number of trendline crossings at different
stages of the layout. In Figure 5.4(a) the dataset is visualized without any sorting. This results in
a total of twelve crossings between trendlines, connecting multiple occurrences of terms across
the two time segments t1 and t2. Figure 5.4(b) shows the resulting layout after topic sorting. As
shown in the figure, the topics within time segment t2 are now ordered based on the average vertical
64
position of their corresponding terms within time segment t1. This ordering of topics has reduced
the number of line crossings from 12 to 6. Finally, Figure 5.4(c) shows the resulting layout after
trend sorting. Here again it is evident that the number of line crossings is reduced even further. All
in all, as a result of the layout algorithm, the number of trendline crossings has been reduced from
12 to 2.
JulM29M-MSeptM23
7Mweeks
SeptM24M-MNovM19
7Mweeks
NovM20M-MJanM15
7Mweeks
JanM16M-MFebM20
4Mweeks
FebM21M-MAprM17
5Mweeks
AprM18M-MMayM02
2Mweeks
MayM03M-MJunM28
7Mweeks
JunM29M-MAugM03
5Mweeks
AugM04M-MAugM14
1Mweek2011 2011 2011 2011 2011 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012
lifelife life life life lifelifecutcut
cutcut
cut
partypartypartyparty
budgetbudgetbudgetbudget
opportunityopportunity
opportunityopportunity
governmentgovernment
government
governmentgovernment
governmentgovernment
jobjob
job
jobjob
job
jobjob
freedomfreedom
freedom
freedom
freedom
freedom
freedomfreedom
freedom
peoplepeople
people
convictconvict
convictconvict
libertyliberty
libertyliberty
liberty
faithfaith
faith
faithfaith
chinachina
china
callcall
call
generategenerate
generate
obamaobama
obamaobama
createcreate
create
strengthenstrengthen strengthenstrengthen
godgod
god
godgod
god
valval
valval
economyeconomy
economyeconomyeconomy
energyenergy
energy
energy
energy
timetime
timetime
rememberremember
rememberremember
prosperprosper
prosperprosper
prosper
prosperprosper
securitysecurity
futurefuture
future
futurefuture
pathpath
path
standstand
stand
standstand
resourcesresourcesresourcesunionunion
unionunion
unionunion
peoplepeople
peopledayday
day
day
agoago
agoago
hardhard
hard
liftlift
liftlift
tonighttonight
tonight
tonight
tonight
debtdebt
debt
debt
debtdebtcampaigncampaign
campaigncampaign
campaigncampaign
americanamerican
american
americanamerican
american
american
unitunit
unitunit
friendfriend
friendfriend
realreal
real
businessbusiness
business
business
businessbusiness
businessbusiness
educationeducation
educationeducation
spendspend
spend
spendspend
thankthank
thank
thank
specialspecial
specialspecial
promisepromise
promisepromise
promise
failfail
fail
failfail
hopehope
hope
hopehope
militarymilitary
militarymilitary
constitutionconstitution
constitution
enterpriseenterprise
enterprise
successsuccess
successsuccess
choicechoice
choicechoice
economiceconomic
economiceconomic
economic
crowdcrowd
crowdcrowd
percentpercent
percentpercent
percent
romneyromney
romney
romneyromney
unemploymentunemployment
unemploymentunemployment
countrycountry
countrycountry
rightright
rightright
historyhistory
historyhistory
finalfinal
finalfinal
womenwomen
womenwomen
raterate
raterate
littlelittle
littlelittle
helphelp
helphelp
help
help
powerpower
powerpower
policiespolicies
policies
schoolschool
schoolschool
studentstudent
student
publicpublic
public
familyfamily
family
tradetrade
trade
securesecure
leadlead
businessbusiness
tradetrade
trade
strengthenstrengthenprosperprosper
leadlead
freefree
freefree
believebelieve
believebelieve
believebelieve
believebelieve
congresscongress
congress
religionreligion religion
Figure 5.5: ThemeDelta visualization for Mitt Romney campaign speeches for the U.S. 2012presidential election (as of September 10, 2012). Green lines are shared terms between Obamaand Romney speeches. Data from the American Presidency Project at UCSB (http://www.presidency.ucsb.edu/).
5.3 Domain Specific Applications
5.3.1 U.S. 2012 Presidential Campaign
Political speeches, especially during an election campaign, are particularly interesting document
collections to analyze because the political discourse tends to change and evolve as different
candidates respond and challenge each other over the course of the campaign. Visualizing the
speeches of different candidates would allow for comparing the trends of each candidate with each
65
other. To study such effects, we used the U.S. 2012 presidential election campaign speeches.
The U.S. presidential election takes place every four years (starting in 1792) in November
(the 2012 election day was November 6), and is an indirect vote on members of the U.S. Electoral
College, who then directly elect the president and vice president. In 2012, the Republican and
Democratic (the two dominant parties, representing conservative vs. liberal agendas) conventions
were held on the weeks of August 27 and September 3, respectively. The two opposing candidates
were Republican nominee Mitt Romney, and Democratic nominee Barack Obama (incumbent
President of the United States). The ThemeDelta for both candidates is shown in Figures 5.1 and
5.5.
In collecting data for the United States presidential election, we used campaign speech
transcripts for both candidates, first presented in 2.3. For Mitt Romney, we used transcripts from 46
speeches over a 62-week period: from announcing candidacy on July 29, 2011, to August 14, 2012.
This corpus included speeches from both the Republican primary election (settled on May 14, 2012
as the main competing nominee Ron Paul withdrew). For Barack Obama, we used transcripts from
40 speeches over a 44-week period: November 7, 2011 to September 17, 2012.
Visualizations of the two candidates Barack Obama and Mitt Romney are shown in Figure 5.1
and Figure 5.5. Trendlines in both visualizations represent characteristic keywords that each
candidate uses as a theme in his speeches. Democratic trendlines are colored blue, Republican ones
are red, and trendlines for keywords that both candidates share are green.
For the Romney dataset (Figure 5.5), there is a clear impact of time on keywords and topics
that the candidate is using. Romney’s message starts out relatively simple with only two main
topics, but quickly branches out in complexity as time evolves. The effect of main competitor Ron
Paul withdrawing in May is clear: before this date, Romney is trying to win the party nomination,
whereas afterwards, he is going for the presidential seat. As a result, his message becomes more
simple again: both the number of keywords and the number of topics decreases during the last three
segments, presumably to focus on key issues in the Republican election platform.
For the Obama dataset (Figure 5.1), a good portion of the identified keywords are common
66
with Mitt Romney (i.e., green in color). This could be seen as Obama discussing many of the issues
that has become central to the U.S. presidential race. Furthermore, there is a clear presence of
keywords such as “health,” “insurance,” and “care,” which may refer to the president’s health care
reform from 2010 (informally called Obamacare). This is a controversial issue that still causes
a major divide between voters; a Reuters-Ipsos poll in June 2012 indicated that a full 56% of
Americans were against the law.
Taken as a whole, both datasets have a heavy emphasis on economics keywords. This is
commensurate with the overall theme of the 2012 presidential race, which largely has focused on
the poor economic situation of the United States.
5.3.2 i-Neighbors Social Messages
The Internet facilitates informal deliberation as well as civic and civil engagement. Web-based
applications for informal deliberation (e.g., i-Neighbors [iNe, 2012]) facilitate the collection of
data that we can analyze to provide insight into how neighborhoods with different poverty levels
use ICTs for informal deliberation. Using ThemeDelta, we can characterize differences and detect
common interests in informal deliberation between advantaged and disadvantaged communities.
The goal of this application was to study two basic questions: what lengths of time neighbor-
hoods with different poverty levels spend discussing topics? And what is the average similarity in
topics discussed between neighborhoods with different poverty levels, and the similarity in topics
discussed between neighborhoods with similar poverty levels?
The data for this application was collected through the i-Neighbors system, first presented
in 2.1. When we collected the data in 2010, the i-Neighbors website had over 100,000 users who
had registered more than 15,000 neighborhoods. Over 1,000 neighborhoods were active with more
than 7,000 unique messages contributing to neighborhood discussion forums. We collected data
from six geographically diverse communities located in Georgia, Maryland, New York, and Ohio.
We selected the three groups located in areas with concentrated levels of poverty (a poverty rate of
67
25% or more, 2009 American Community Survey, US Census Beureau) who exchanged the most
messages, and the three most active groups in more advantaged areas.
Jan 28 - Feb 28
1 months
Feb 01 - Nov 01
9 months
Nov 02 - Jan 02
2 months
Jan 03 - Sept 03
8 months
crestcrestcrestcrest
thankthank
thankthank
emailemail
emailemail
lawlaw
law
hoahoa
hoahoa
communitycommunity
community
community
community
residereside
residereside
clubhouseclubhouse
clubhouse
clubhouseneighborneighbor
neighbor
neighbor
watchwatch watchwatch
2009 2009 2009 2009 2009 2010 2010 2010
Figure 5.6: Result of searching for the word “watch” in low-poverty neighborhood.
We applied our temporal segmentation algorithm on the six selected neighborhoods. Topics
within each segments can be examined using the visualization to find topic similarities between
neighborhoods. Segmentation labels indicating segments size can be used for comparing the time
spent by different neighborhood discussing certain topics.
A partial segmentation output is shown for a disadvantaged neighborhood in Figure 5.7 and
Figure 5.8 for a more advantaged neighborhood. From these two examples, the segments sizes are
not very different and we can conclude that both the disadvantaged and advantaged neighborhoods
spend similar amounts of time discussing topics.
Examining the words groupings in both neighborhoods can lead to discovering differences
and similarities in their discussions. For example, in the low-poverty neighborhood in segment [Feb
1, 2009 to Nov 1, 2009], there is a topic that has the words “watch” and “neighbor,” which lead us
to conclude that there were some arrangements or discussions about a neighborhood watch. This
topic is not found in the the disadvantaged neighborhood visualization. If the user searched for the
word “watch” this will result (Figure 5.6) in only showing the topics that has the this word and any
other related topic.
Similarly, an example of similarities of topics discussed between neighborhoods can be
68
Jan 01 - May 01
4 months
May 02 - Jan 02
8 months
Jan 03 - Feb 03
1 month
Feb 04 - Oct 04
8 months
schoolschool school schoolschooldistrictdistrict
districtdistrict
moneymoney
moneymoney
money
budgetbudget
budgetbudget
budget
yearyear
yearyear
year
meetmeet
meetmeet
librarylibrarylibrary
dayday
day
day
projectproject
projectproject
buildbuild
build
policepolice
police
police
carcar
car
carcar
dogdog
dogdog
dog
nightnight
nightnight
aveave
aveave
informinform
inform
votevote votecitycitystatestate
statestate
increaseincrease
increase
familyfamily
family
streetstreet
street
street
neighborneighbor
neighbor
neighbor
wayway
way
millionmillion
million
animalanimal
animal
centralcentral
central
thankthank
househouse
house
avenueavenue
callcall
call
studentstudent
student
elementaryelementary
elementary
peoplepeople
people
childrenchildren
children
teacherteacher
teacher
thinkthink
think
kidkid
kid
yearyear
year
messagemessage
message
taxtax
tax
taxtax
averaver
hopehope
hopehope
parkpark
park
parkpark
issueissue
issue
councilcouncil
council
communitycommunity
community
community
2009 2009 2009 2010 2010 2010 2010 2010
Figure 5.7: Partial output from a high-poverty neighborhood.
shown by examining the segment [Jan 03, 2010 to Sept 3, 2010] in the advantaged neighborhood
(Figure 5.8) and segment [Feb 4, 2010 to Oct 4, 2010] in the disadvantaged neighborhood (Fig-
ure 5.7). In both segments, there exist two topics in which both communities discuss a park-related
project.
5.3.3 Historical U.S. Newspapers
Newspaper stories are precisely the type of ongoing, evolving trend datasets for which ThemeDelta
was designed. Below we review the source, segmentation, and visualization for a dataset consisting
of historical U.S. newspaper stories from 1918.
Our data source was a historical newspapers database, first presented in 2.2. Some of
69
Jan 28 - Feb 28
1 months
Feb 01 - Nov 01
9 months
Nov 02 - Jan 02
2 months
Jan 03 - Sept 03
8 months
mailmail
mailmailparkpark
parkpark
livelive
livelive
neighborneighbor
neighbor
neighbor
nightnight
night
nightnight
frontfront
front
front
front
incomeincome
incomeincome
emailemail
emailemail
crestcrestcrestcrestsitesite
site
communitycommunity
community
community
community
doordoor
doordoor
door
numbernumber
numbernumber
callcall
call
call
call
wayway
way
wayway
lawlaw
law
addressaddressaddress
residereside
resideresidehoahoa
hoahoa
watchwatch
watchwatch
goodgood
good
good
peoplepeople
people
officeoffice
officeoffice
househouse
house house
carecare
care
phonephone
phonephone
thankthank
thankthank
seesee
carcar
car
carcar
forestforest
forest
websitewebsite
website
clubhouseclubhouse
clubhouse
clubhousemeetmeet
meet
noticenotice
notice
2009 2009 2009 2009 2009 2010 2010 2010
Figure 5.8: Partial output from a low-poverty neighborhood.
newspapers included in this example are: The Washington Times (Washington, DC), Evening Public
Ledger (Philadelphia, PA), The Evening Missourian (Columbia, MO), El Paso Herald (El Paso,
TX), and The Holt County Sentinel (Oregon, MO). We gathered data from them, restricting the
time to the period September 1918 through December 1918. From this dataset, we extracted only
paragraphs that mention the word “influenza” resulting in 2,944 paragraphs. This corresponds to the
1918 flu pandemic (also known as the “Spanish flu”) which spread around the world from January
1918 to December 1920, resulting in some 50 million deaths.
Applying the dataset to ThemeDelta using a weekly segment granularity yields four discrete
time segments over the four-month time period. Figure 5.9 shows a visualization of the result, where
the transparency value of each trendline has been mapped to the global ranking of the keyword
corresponding to the trendline. The thickness of the trendline conveys the ranking of each keyword
for a particular time segment, calculated by our segmentation algorithm.
70
twotwo twotwo
thotho
thotho
presidentpresident
presidentpresident
hospitalhospital
hospital
hospital
hospital
citycity
city
city
boardboard
board
epidemicepidemic
epidemicepidemic
hourhour
hour
pneumoniapneumonia
pneumonia
pneumonia
twentytwenty
twenty
workwork
workwork
work
quotaquota
quotaquota
countycounty
county
campaigncampaign
campaign
loanloan
loan
committeecommittee
committee
committeecommittee
churchchurch
churchchurch
universeuniverse
hospitalhospitalyearyear
yearyear
serviceservice
service
service
service
germangerman
german
generalgeneral
general
general
maskmask
mask
maskmask
courtcourt
courtcourt
businessbusiness
business
eveneven
even
even
daughterdaughter
daughter
daughter
wifewife
wifewife
familyfamily family
family
homehome
home
menmen
men
campcamp
camp
callcall
callarmyarmy
army
reportreport
report
companycompanycompany
casecase
case
diseasedisease
diseasedisease
crosscross
cross
cross
redred
red
red
peoplepeoplepeoplebanban
banmeetmeetmeet
towntowntownopenopen
open
afternoonafternoon
afternoon
boyboy
boy
givengiven
given
churchchurch
universaluniversal
fourfour
four
fourfour
todaytoday
today
numbernumber
number
number
spentspent spent
spent
sonson son
son
libertyliberty
liberty
friendfriend
friend
friend
Sept 09 - Oct 09
4 weeks
Oct 10 - Dec 05
8 weeks
Dec 06 - Dec 13
1 week
Dec 14 - Dec 28
2 weeks1918 1918 1918 1918 1918 1918 1918 1918
Figure 5.9: ThemeDelta visualization for newspaper paragraphs during the period September toDecember in 1918. Color transparency for different trendlines signify the global frequency for thatkeyword.
Figure 5.9 offers several observations that summarize the qualitative nature of trends exposed
by ThemeDelta. The output is showing many events that were related to the 1918 pandemic in
the data. For example, in the first time segment, September 9 until October 9, there are a topic
that contain the terms “mask” and “German.” This corresponds to advisories and guidelines
recommending people to use masks to protect themselves from the ongoing influenza pandemic
during World War I. In the same segment, the words “liberty,” “loan,” and “campaign” appeared in
one of the topics, and continued appearing in the following segment, October 10 until December 5,
because a liberty loan campaign were issued to support the army during World War I. Also, in the
October 10 to December 5 segment, the army men left the camps to go back home from service and
stay with their families; this explains the topic with the words “family,” “home,” “serves,” “spent,”
“wife,” and “son.” This topic appeared along with the topic with the words “case,” “disease,” “mask,”
71
“cross,” and “red” because the returning soldiers were exposed to the disease and some of them were
sick. As a result, families were advised to take protective measures.
World War I ended on November 11, 1918, which explains the disappearance of the word
“German,” but the country continued suffering from the disease. The word “mask” reappeared back
along with “epidemic,” “hospital,” and “disease” in the December 14 until December 28 segment,
which aligns with the second influenza wave. Again, during this time people were advised to wear
masks to slow down the spread of the disease. The Red Cross was frequently mentioned in the
last three segments, which is indicative of the second, deadlier wave of the pandemic that began
in October. In both the December 6 to December 13 and December 14 to December 28 segments,
the terms “people,” “ban,” and “meet” appeared because people were banned from meeting each
other as a precaution measure to limit the spread of the disease. The term “president” appeared in
the last segment along with “service” appeared initially in the first segment and then returned with
significant strength in the last segment, illustrating the seriousness accorded to the national scale of
the pandemic.
5.4 Qualitative User Study
To validate the utility of the ThemeDelta system, including both its temporal segmentation algo-
rithm as well as its visual representation, we conducted a qualitative user study involving expert
participants. The purpose was to study the suitability of the approach for in-depth expert analysis of
dynamic text corpora. Because of our existing collaboration with historians (the sixth author of this
work is a historian), we opted to use the historical U.S. newspaper dataset and engage experts from
the history department at one of our home universities.
We used historical data from five U.S. newspapers for our qualitative evaluation from three
different areas: New York, Washington, D.C., and Philadelphia. The data was collected from the
Chronicling America website2 and focused on the 1918 influenza epidemic, which killed as many
2http://chroniclingamerica.loc.gov/
72
as 50 million people worldwide and has long been recognized as one of the most deadly disease
outbreaks in modern world history. Historians are interested in reconstructing the timeline of events,
with a view to understanding previously concealed or neglected connections between public opinion,
health alerts, and prevailing medical knowledge.
5.4.1 Method
We recruited three graduate students as participants: one from the history department and two from
the English department at our university. The participants were all required to have prior knowledge
of America around the Great War/First World War period. Two participants were Ph.D. students
and one was a Masters student. We required no particular technical skill prior to participation.
While the number of study participants may appear to be low, we want to emphasize that these
participants represent a highly expert population and that our study protocol is focused more on an
expert review [Tory and Möller, 2005] rather than a comparative or performance-based user study.
The total study time was an hour. The procedure was as follows: Participants were first
asked to fill out a background questionnaire. Then the study moderator explained the tool and its
features, followed by the task the participants were asked to perform using the tool. After that, the
participants were asked to solve several high-level tasks (reviewed below) using the tool. Finally,
they were asked to complete a post-session questionnaire to collect feedback on the tool.
The tasks that we asked the participant to accomplish with the help of our system was
answering some questions on the 1918 influenza pandemic. Participants were encouraged to refer
to the visualization in their answers by mentioning segments names, giving examples, or taking
screen captures from the visualization. Tasks were divided into change and connection questions, to
allow us to determine whether the visualization and algorithmic choices we made were helpful or
not. The change-focused questions were:
• How did the newspapers describe the spread of influenza?
• How does the description of the pandemic change over time?
73
• Are there different times when the influenza pandemic becomes less important? What are
those time periods?
Questions that were focused on connections were:
• What are the categories that appear to be associated with influenza in different newspapers?
• Was there a specific feeling that surrounded the influenza reporting in the newspapers?
5.4.2 Results
All three participants were successful in accomplishing the task using ThemeDelta. We determined
this by comparing their answers to the task questions with model answers provided by the history
faculty collaborator (reviewed in Section 5.3.3). They correctly reported the sentiments that
surrounded the influenza from the five newspapers. They also successfully described the change in
reporting of the influenza spread. Finally, they all succeeded in discovering the connection between
influenza and other categories (e.g., schools, war, and hospitals).
The subjective results of the study were overall positive and the participants all vouched
for the helpfulness of the system and the need for such systems in their research. None of the
participants had previous experience using any visual analytics systems. This implies that the
participants found ThemeDelta to be understandable and easy to use.
All the three participants finished the tasks within the allocated time. They also uniformly
reported that the same type of task, if done manually as part of their own research, would normally
take several days if not weeks. This highlights an additional strength to our system: minimizing the
time spent on manual analysis of large amounts of text, allowing the analyst to focus on collecting
insight instead.
In the post-session questionnaire, participants were asked to give their feedback on specific
ThemeDelta features. The features that were reported as very useful were labels, line thickness,
74
duplicate trends, and discontinuations. Participant ratings for other features ranged from very useful
to not useful at all, the latter typically because they did not use that particular feature. Some of the
identified weaknesses of the tool included not being able to see full phrases or word combinations,
managing keyword filtering, controlling the dynamic layout, and high complexity for large datasets.
5.5 Summary
We presented ThemeDelta a visual analytics system we built to help detect the scatter and gather
of trends in text corpora. We used the system for three different scenarios; each had its dataset.
Datasets used in the scenarios were historical newspaper dataset, presidential campaign dataset, and
i-Neighbors dataset.
First scenario was historical U.S. newspaper Spanish flu pandemic coverage. Here we were
focused on how newspapers in year 1918 discussed the second wave of the pandemic topic and how
these topics temporally evolved. Second scenario was Barack Obama and Mitt Romney U.S. 2012
presidential campaigns. In this scenario, our focus was to identify the similarities between the two
candidates and how the topics they discussed in their campaigns evolved over time. Third and last
scenario was social messages exchanged between virtual communities via the i-Neighbors. The
focus here was on comparing advantaged and disadvantaged neighborhoods from the topics, and
the time duration spent on topics perspectives.
The system showed great success in identifying trends and their temporal evolution in the
three scenarios. We qualitatively evaluated the system by running an expert user study. The study
results showed how successful the system was in helping experts reach conclusions and identify key
trends.
Chapter 6
Dynamic Spatial Topic Model
The main goal of this chapter in to extend the basic topic model to accommodate location and
temporal distinctions in large document sets. In this chapter, we present a new dynamic spatial
topic model (DSTM), a true spatio-temporal model. DSTM can model relationships between
locations, topics, documents, and terms in a dynamic fashion. The model enables summarizing and
navigating unstructured time stamped text documents while capturing the evolution of topics along
with location distribution over these topics.
Previous work in Temporal topic models by [Blei and Lafferty, 2006, Wang and McCallum,
2006, AlSumait et al., 2008, Gohr et al., 2009, Zhang et al., 2010, Hoffman et al., 2010, Hong et al.,
2011] and in Spatial topic models [Pan and Mitra, 2011, Wang et al., 2009] do not model the
decomposition of topic models into specific topics for specific locations over time. Tracking the
evolution of topics and their location overtime is a critical step toward understating major events
such as an epidemic or an unrest.
The DSTM model assumes words in a document are reliant on both topic distributions and
location distributions. Unlike LDA, this model results in topics distribution over the vocabulary and
location distribution across all topics and the evolution of topics and their locations are captured
over time. Our model inherits some features from both Author-Topic Model previously proposed
75
76
by [Rosen-Zvi et al., 2004] and Dynamic Topic Model previously proposed by [Blei and Lafferty,
2006]. One of the advantages of our model over these two models is that it companies the power of
both. We applied the algorithm on multiple newspapers from the Chronicling America repository
introduced in 2.2 to understand the differences between those papers in the coverage of the flu as it
spread.
6.1 Proposed Model
Here we propose a dynamic spacial topic model (DSTM) that incorporate reporting locations into
the process of inferring topics. Fig. 6.1 presents our proposed model for modeling time-stamped
data. A (Dirichlet) distribution over topics is first organized and, concomitantly, a (Dirichlet)
distribution over locations is organized. Next, a (multinomial) topic distribution and a (multinomial)
location distribution are picked. The first to incorporate information about a document in the topic
inference was [Rosen-Zvi et al., 2004]. Finally, we select a word from the topic distribution and
location from the location distribution. Specific model notation is given in Table 6.1.
In order to capture the evolution of topics and locations over time, we assume that φt and
λt are Dirichlet distributions that evolve by adding white (Gaussian) noise at each time step to the
distributions resulting from the previous time slice as in [Blei and Lafferty, 2006]. This is done by
chaining φt and λt :
φt,k|φt−1,k ∼ Dir(φt−1)+N(µ,δ2)
where N(µ,δ2) reflects the added gaussian noise.
The generative process for time slice t of a chronologically ordered time stamped documents
in a corpus is as follows:
1. Randomly draw K multinomial distributions from φt , where φt,k|φt−1,k∼Dir(φt−1)+N(µ,β2).
77
Table 6.1: DSTM notation
N number of words in a document.D number of documents in a corpus.O number of locations in a corpus.K number of topics (constant across time slices).L list of locations in a document (observed).l location assignment for topic j.z topic assignment for word i.λ distribution of locations over topics.φ distribution of topics over the terms.β Dirchlet prior (hyperparameter) for φ.δ Dirchlet prior (hyperparameter) for λ.w word (observed).t time.T length of time represented by the model.
L z
w
Φ
λO
DN
K
δ δ
β β
t-1 t T
l L z
w
DN
l
Φ
δ
β
t + 2
L z
w
DN
l
λλ
Φ
Figure 6.1: Graphical model representation of the DSTM for three consecutive time slices.
78
2. Randomly draw Ot multinomial distributions from λt , where λt,k|λt−1,k∼Dir(λt−1)+N(µ,δ2).
3. For each document d, then for each word w in the document:
(a) Draw location l and z.
(b) Draw word w from topic z.
Here φt ,λt ,zt ,wt , and lt are hidden variables, and wt and Lt are the only observed variables. β
and δ are considered fixed here as recommended in literature for simplicity. The generative process
for DSTM yields the distribution p(φt ,λt ,zt ,wt , lt |Lt ,β,δ) which can be decomposed according to
the chain rule as follows:
p(φt ,λt ,zt ,wt , lt |Lt ,β,δ) =Nt
∏i=1
p(zt,i|λt , lt)p(wt,i|zt ,φt)p(lt,i|Lt)K
∏j=1
p(φt, j|β)Ot
∏y=1
p(λt,y|δ) (6.1)
The main inferential problem we are trying to solve is computing the posterior distribution
of the hidden variables. To derive their posterior distribution from the joint distribution (Eqn. 6.1)
we use Bayes’ rule:
p(φt ,λt ,zt , lt , |wt ,Lt ,β,δ) =p(φt ,λt ,zt ,wt , lt |Lt ,β,δ)
p(wt , lt |Lt ,β,δ)(6.2)
This distribution is very hard to calculate. To solve this problem we use an approximation
technique.
6.2 Parameter Approximation
Given that we need to sample multiple parameters at once, Gibbs Sampling is an appropri-
ate choice for approximating the hidden parameters. We can infer φt and λt by first infer-
ring the topic and location assignment (zt , lt) pairs per word conditioned on all other variables
79
p(zt , lt |wt ,zt,−i, lt,−i,wt,−i,Lt ,β,δ). By applying Bayes rule we can obtain p(zt , lt) assignments as
follows:
p(zt,i, lt,i|wt,i,zt,−i, lt,−i,wt,−i,β,δ,Lt) =p(zt , lt ,wt |β,δ,Lt)
p(zt,−i, lt,−i,wt,−i|β,δ,Lt)
This yields to:
p(zt,i, lt,i|wt,i,zt,−i, lt,−i,wt,−i,β,δ,Lt) ∝ p(zt , lt ,wt |β,δ,Lt) (6.3)
In the above equations, zt,−i denotes the assignment of all topics except the current instance and
lt,−i denotes the assignment of all locations except the current instance. Now we can obtain the
conditional probability of (zt,i, lt,i) based on zt,−i, lt,−i and wt from equation 6.3 by integrating over
the continuos distributions (Dircihlet distributions) φt and λt .
p(zt , lt ,wt |β,δ,Lt) =∫
φt
∫λt
p(zt , lt ,wt ,φt ,λt |β,δ,Lt)dφtdλt (6.4)
We can expand 6.4 using the joint probability distribution (Eqn. 6.1) and grouping terms by
their dependent variables:
p(z, l,w|β,δ,L) =∫
φ
p(w|z,φ)p(φ|β)dφ
∫λ
p(z|λ, l)p(l|L)p(λ|δ)dλ
These two integrations represent a multinomial distribution multiplied by a Dirichlet prior.
p(z, l,w|β,δ,L) =∫
φ
(N
∏i=1
p(wi|φzi)
)p(φ|β)dφ
∫λ
(N
∏i=1
p(zi|λyi)
)(D
∏m=1
N
∏i=1
p(li|Lm)
)p(λ|δ)dλ
(6.5)
80
Working with the first term in Eqn. 6.5
∫φ
(N
∏i=1
p(wi|φzi)
)p(φ|β)dφ =
∫φ
(K
∏j=1
V
∏n=1
φCV K
n jn j
)(K
∏i=1
(Γ(V β)
Γ(β)V
V
∏n=1
φβ−1n j
))dφ
=∫
φ
(Γ(V β)
Γ(β)V
)K(
K
∏i=1
V
∏n=1
φCV K
n j +β−1n j
)dφ
=CONST1
∫φ
(K
∏i=1
V
∏n=1
φCV K
n j +β−1n j dφ
)
where CONST1 =(
Γ(V β)Γ(β)V
)Kand CV K
n j is the number of times word n in Vocabulary V was assigned
to topic j (among K topics).
=CONST1
K
∏i=1
∫φ
(V
∏n=1
φCV K
n j +β−1n j
)dφ
Given that the term φ is a Dirichelt distribution, the resulting integral will be as follows:
=CONST1
K
∏j=1
∏Vn=1 Γ(CV K
n j +β)
Γ(∑n‘CV Kn‘ j +V β)
(6.6)
where the Dirichlet integrals are obtained by applying the following rule:
∫ X
∏x=1
[akx−1x dax] =
∏Xx=1 Γ(kx)
∑Xx=1 kx
We can use the same machinery for the second term in Eqn. 6.5:
81
∫λ
(N
∏i=1
p(zi|λyi)
)(D
∏m=1
N
∏i=1
p(li|Lm)
)p(λ|δ)dλ
=∫
λ
(N
∏i=1
λyi
)(D
∏m=1
N
∏i=1
1Lm
)(O
∏y=1
(Γ(Kδ)
Γ(δ)K
K
∏j=1
λδ−1jy
))dλ
=∫
λ
(Γ(Kδ)
Γ(δ)O
)O(
D
∏m=1
1
LNmm
)(O
∏y=1
K
∏j=1
λCKO
jy +δ−1jy
)dλ
=CONST2
∫λ
(O
∏y=1
K
∏j=1
λCKO
jy +δ−1jy
)dλ
where CONST2 =(
Γ(Kδ)Γ(δ)k
)O(∏
Dm=1
1LNm
m
)and CKO
jy is the number of times location y was assigned
to topic j (among O locations and K topics).
=CONST2
O
∏y=1
∫λ
(K
∏j=1
λCKO
jy +δ−1jy
)dλ
=CONST2
O
∏y=1
∏Ki=1 Γ(CKO
jy +δ)
Γ(∑ j′CKOj′y +Kδ)
(6.7)
Substituting equations 6.6 and 6.7 in equation 6.5 we can obtain the following equation for
p(w,z, l|β,δ,L):
p(z, l,w|β,δ,L) = (6.8)
CONST
(K
∏i=1
∏Vn=1 Γ(CV K
n j +β)
Γ(∑n‘CV Kn‘ j +V β)
)(O
∏y=1
∏Ki=1 Γ(CKO
jy +δ)
Γ(∑ j′CKOj′y +Kδ)
)
where
82
CONST =
(Γ(V β)
Γ(β)V
)K(Γ(Kδ)
Γ(δ)K
)O(
D
∏m=1
1
LNmm
)
Finally, by substituting equation 6.8 in 6.2 and using the identity Γ(K +1) = KΓ(K), we
obtain the Gibbs sampling
p(zt,i, lt,i|wt,i,zt,−i, lt,−i,wt,−i,β,δt ,Lt) ∝CVtK
t,i j +β
∑t,i‘CVtKt,i‘ j +Vtβ
COtKt,y j +δ
∑t, j‘COtKt,y j‘ +Kδ
CVtKt,i j is the count of word i from vocabulary V assignments to topic j from topics K. COtK
t,y j is
the count of location y from locations O assignment to topic j from topics K. The apostrophe on j
and i denote all instances except the current one and t denotes the counts at time slice t.
6.3 Model Evaluation
In order to quantitatively evaluate our proposed model, we compare its predictive power against
other models. Our model inherits its dynamic nature from Dynamic Topic Model (DTM) proposed
by Blei in [Blei and Lafferty, 2006]. DTM does not take into account any extra information in the
process of inferring topics, and this gives our model an advantage over it. Another parent to our
model is the Author Topic Model (ATM), a non dynamic model proposed by Rosen in [Rosen-Zvi
et al., 2004]. Our model inherent the ability of integrating extra information into the topic inference
from ATM. In our case, the extra information is the location. ATM is non dynamic, which also
gives our model an advantage over it.
We compare our model against ATM and our baseline model is LDA. For the purpose of
comparing the three models, we ran each model separately on the same dataset (described in 2.2)
while fixing the hyper parameters of the model. The hyper parameters are fixed as follow: β is
calculated as 50/K and δ is fixed to 0.01, where K is the number of topics. The comparison was
based on calculating the perplexity for each model while varying some model parameters. The
83
perplexity score for an unseen document conditioned on observed locations is calculated as follow:
Perplexity(w|l) = exp(− log p(w|l)N
) (6.9)
where p(w|l) is the probability that the word w appear in the unseen document conditioned
on the location observed from the document and the pre-trained model. N is the total number of
words in the unseen document. To calculate the perplexity of the testing set, we average over the
documents as follow:
Perplexity(DTest) =∑
Dd=1 Perplexity(w|l)
D(6.10)
Where D is the number of documents in the test set.
Comparing our model against LDA model with respect to the vocabulary size (number of
unique words in dataset) revealed improvement in perplexity to our model advantage when the
dataset size grows bigger. Fig.6.3 reports on perplexity with respect of vocabulary size. Both models
show slightly close performance until the size of the dataset gets bigger.
Measuring performance while varying the number of topics our model shows better per-
formance than LDA model with fewer number of topics Fig.6.2. The performance of our model
decreases with a higher number of topics. The results show that the optimal number of topics is
under 10 topics.
6.4 Model Applications
Here we take a qualitative approach to evaluating our model by exploring the applicability of our
model on historic newspapers. In this section, we present two applications. The first application
is concerned with news coverage of three newspapers from the east, midwest, and west. The
model output was used to understand the differences in reporting between these newspapers. The
84
500 600 700 800 900
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
5 10 20
Perp
lexi
ty
Number of Topics
LDA
ATM
DSTM
Figure 6.2: Perplexity as a function of number of topics.
0
1000
2000
3000
4000
5000
6000
7000
8000
5000 15000 25000 35000 45000 55000
Perp
lexi
ty
Vocabulary Size
LDA
DSTM
Figure 6.3: Perplexity as a function of vocabulary size.
second application focused on understanding the tone usage in relationship to discovered topics
85
and locations. In this application, the dataset was divided based on tone resulting in four different
datasets. Each dataset was fitted to the model and the output was analyzed and compared.
6.4.1 East, west, midwest 1918-1919 news coverage
In this application, we ran the model on three individual historic newspapers. We chose the newspa-
pers randomly from three different areas of the united states: east, mid-west, and west. Newspapers
used in this study were: New York Tribune, NY from east, The Evening Missourian, MO from
mid-west, and Bisbee Daily Review, AZ from west. We analyzed the topics and locations discovered
from these newspapers and then mapped to historical facts reported in published reports by the
Navy Department Library [Rep, c], United States Department of Health and Human Services [Rep,
b], and National Archives and Records Administration [Rep, a]. We extracted influenza paragraphs
from the three newspapers and then divided them into months. In this application, we discarded
European locations in the location detection phase because the main focus of this research is the
united states influenza pandemic. We also discarded paragraphs with no location mentioning and
focused on explicitly mentioned locations. We visualized topics and locations from September 1918
through January 1919 for the output analysis. To highlight locations from the three different US
areas, we color coded locations as following: Green to denotes cities in and around the west, Black
to denotes cities from mid-west, and Blue to denotes cities from east.
Examining the topics from New York Tribune September 1918 segment the first topic/loca-
tions indicate that cases of pneumonia were reported by the health department in Manhattan, NY
and Somerville, NJ. Officials at Boston and Washington reported ill people with influenza and death
cases. They also reported on their concerns of the spread of spanish influenza disease. There were
reports on the spread of spanish influenza in the German army, and then to England. Reports from
New york, NY on establishing a quarantine to stop the disease spread. The second topic in the same
segment included the words Copeland, commission, steamer, symptom, nasal, germ, and isolate
because Copeland, city health commissioner, reported on the isolation of people who came to the
states (from France) by a steamer. Copeland also said symptoms of spanish influenza include nasal
86
discharges and that the germ is carried in the nose and mouth. Again here the appearance of the
locations Minnesota, Iowa, California, Connecticut, and Camp Lee, VA was due to reports on the
number of cases, concerns and deaths.
In the January 1919 segment first topic, the words pathetic and wald emerged because there
were some press releases by miss Wald, founder of Henry Street Settlement in New York, on the
difficulties faced because of the influenza situation. In the second topic, some words related to
advertisement showed up in this topic as color, main, and floor. Example on an advertisement:
“Snug Bath Robes - Such a comfort to lounge in, such protection, too, when severe cold seems such
an epidemic. $4.89 to $19.74 Macy’s Main Floor, 35th Street.” The words authority, roosevelt,
association, and loss, mead are due to reports from authorities from Bulgaria on losses due to
influenza around the same time Franklin d. Roosevelt, assistant secretary of the navy, contracted
influenza on his trip to Europe. In January 1919, there were reports on the death of Theodore
Roosevelt who served as the 26th President of the United States. S.C. Mead, the secretary of the
merchants’ association, commented on his death.
From The Evening Missourian newspaper results we examined closely on the following
months: September, October, and December of 1918. During the month of September 1918, the
topics and assigned top locations were the results of major reporting on a liberty loan parade
happening in Columbia, MO. Schools were kept out of the parade to minimize the spread of the
disease. Around the same month Ferguson tablets were advised to be taken for cold and grippe.
The President of Columbia board of health announced news on spanish influenza cases. There were
reports on an influenza outbreak and death cases reporting in the great lakes naval training school
and station. The word enemy and location New York emerge in the second topic due to reports on
an attempt of bombing New York by the enemy that was Germany at this time. Since this topic had
some war related words this explains the emerging of the location Springfield, MA, were an army
camp was located, in the topic top five locations. The rest of the locations had reported influenza
cases or related deaths.
In October’s topics and locations. Influenza cases were reported in New York, West Virginia,
87
and at the parker memorial hospital. Troops were supposed to go to Camp Bowie in October, but
they did not because of the epidemic. Board of heath in San Francisco, California demanded people
to wear masks. In the second topic the words nurse, school, and ill appeared as an indication of
reports on influenza cases in nurses school. The words student and house appeared in this topic
because students were advised to stay in their houses and not go to school. Also, there were reports
on students of the school of medicine nursing influenza cases at hospitals. Announcements from
the army were made through the Arlington naval radio station. This explains the emerging of the
location Arlington in the second topic. Many reports on influenza cases in the army and some
reports on the secure arrival to the US from sea.
In December 1918 topics, topic one contains the words red, cross, receive, Christmas,
service, influenza, soldiers, and epidemic. Those words indicate the involvement of the red cross in
helping the soldiers during Christmas time. In the second topic, Kansas city rises in the top five
locations. The second topic contains the words vaccine, university, louis, student, and die because
there were some death cases. These locations and set of words confirms with the reports, during
December 1918, on the availability of the vaccine in Kansas city and St. Louis. The university of
Missouri developed the vaccine and served as the main source of the vaccine. There were demands
to release doctors from army in Illinois to match the needs of the community. Des Moines, Spokane,
Philadelphia also appeared in the top locations due to reports on ill and dying people with influenza.
In the Bisbee Daily Review Newspaper, the September 1918 first topic terms and locations
confirmed that the epidemic of spanish influenza and pneumonia appeared in army camps, and it is
causing many deaths. As a result, Camp Meade was placed under general quarantine. Reports from
mass. (Massachusetts) on the outbreak and number of new cases reported from Camp Devens, MA.
At this time, Massachusetts was believed to be the center of the epidemic in the east. Reports from
Richmond, VA health officers on their measures to prevent the spread of spanish influenza. There
were also reports on the state of the influenza in Connecticut, CT, and New Hampshire, NH. From
the second topic/locations, the terms were a guidance to many facts. In Norfolk, VA cases of spanish
influenza have developed among enlisted men at the Hampton roads naval base. Influenza spread
over the country and reached Boston in September. In this month, several cities and towns within a
88
twenty five mile radius of Boston reported deaths from influenza and pneumonia. The commissioner
of Boston gave some press releases to tame the panic of the public. One of his statements was: “fear
would lower the vitality of those exposed”. On the other hand, cases of epidemic reported in army
technical school at the university of Colorado. It was reported by review leased wire in Washington
that El paso, phoenix, and jerome closed all public places. There were also reports on deaths in
Montana and Warren district. The words enemy, french, british, attach, and german emerged in the
same topic which confirm major war related reporting during this month. Reporting on American
troops attacking the west of the Verdun region in co-operation with the french. Around the same
time, British troops made a powerful attack against the german lines.
In January 1919, the second peak of influenza was striking. The words in the first topic are
related to some news about Campbell, Arizona’s governor. His sons were seriously ill, and he was
leaving the office. The word attack appeared because spinach influenza was described as attacking
people and resulted in their death. There was many reporting on the death in New England. Number
of locations like Seattle, Arkansas, and Maricopa were assigned to this topic because during this
time Seattle Washington former mayor died from influenza. Chandler William Clemens who was
born in Arkansas also died after an attack of influenza. There were reports on officials visiting
the county of Maricopa and the influenza epidemic situation. During this period, it was believed
and announced by authorities that spanish influenza is just grip camouflaged under a new name
which explain the appearance of the word grip in the same topic. Again in the second topic it is
very obvious that influenza attacks are happening. There were reporting on the attacks in Douglas,
a city in Arizona. Reports during this month was as follow: “a fresh outbreak of influenza is shown
in phoenix by the report of 117 new cases during the last 72 hours.” Placards were being posted on
all the houses in the warren district where cases of influenza were found. There was some weather
related reporting from Holbrook during this period. The words war and court appeared in this topic
due to reporting on collecting of war profits and the involvement of the supreme court in the process.
The supreme court was also mentioned in freedom of speech related reports. Both supreme court
reports were not directly related to influenza but happened around the same time.
A great number of reports on the 1918-1919 influenza epidemic were published by the Navy
89
Department Library [Rep, c], United States Department of Health and Human Services [Rep, b],
and National Archives and Records Administration [Rep, a]. These reports summarized the history
of the epidemic by identity the peaks and spreading path over the country. Those reports were
developed based on close reading of newspapers. They can help in identifying the usefulness of the
discovered topics and their location by mapping the stated facts in these reports to the findings from
the model output.
For clarity of reporting and the specific mentioning of areas where the pandemic appeared
we closely followed the report published by the Navy Department Library under the title "The
pandemic of influenza in 1918-1919". The report mentioned two major statements:
1. First statement: “The peak of the epidemic was reached in September in Navy personnel and
about the middle of October in the Army.”
2. Second statement: “In September, it appeared in rapid succession in other Army camps and
the civilian population along the Atlantic seaboard and the Gulf of Mexico and spread rapidly
westward over the country.”
These two statements mention the peak of the pandemic and where it started and to which
directions it did spread. We examined the topics and locations starting September 1918 through
January 1919 in the three newspapers shown in Fig. 6.4, 6.5, and 6.6.
In the three newspapers September and October topics navy and war related terms like war,
army, attack, and naval emerged in top 15 terms. Other epidemic related terms like report, death,
pneumonia, spread, and disease also emerged. The two sets of terms emerging in September and
October time slices confirms with the first statement in the report.
Examining the top 5 locations in each time slice from the three newspapers location camps
bubbled up to the top locations. Also, locations from the midwest and west of the country started
appearing in later time slices that confirm with the second statement. Examining the output, we
found that all three newspapers mainly reported on local and surrounding areas. Few national (any
90
location outside the newspaper publishing state) locations have emerged in the top five locations.
The emerged national locations in the three newspapers do confirm with the disease spread over the
country. Locations from the east emerged in the September and October time slices in the midwest
and west newspapers. On the other hand, in the east newspaper there were no national locations,
and this indicates the intensity of reporting given that the epidemic at this time was concentrating
in this area. Similarly, locations from the west emerged in January 1919 in the east and midwest
newspapers, and this confirms with the epidemic spreading toward the west during this time.
October 1918 November 1918 December 1918
Topics
Locations
September 1918 January 1919
influenzaspanishdiseasereporthealthspreadgermandeathauthoritybostonenglandillofficialwashingtonquarantine
manhattan_nysomerville_njwashington_dcboston_manew york_ny
epidemiccopelandseriousinfectiongermstartstrengthliecommissionerbattalionsymptomsteamernasalisolatsuffer
minnesota_mniowa_iacamp lee_vaconnecticut_ctcalifornia_ca
epidemicinfluenzapublicdiewashingtonspanishclosediseasecampwarschoolspreadbostonauthorityloan
trenton_njvirginia_vaarlington_vawestfield_njauburn_ny
influenzahealthdeathpneumoniabrooklynreportbronxmanhattanrichmondqueenscommissionerspanishdepartmentcopelandjersey
new rochelle_nynew hampshire_nhbroadway_nyquantico_vacincinnati_oh
influenzatribunebrooklyncampaignbronxreportdeathwestcommissionerfundcountryqueensunitedmeetnational
pennsylvania_pephiladelphia_panew york_nybroadway_nybrooklyn_ny
warrobertnationalstraussclaimpolicy partynurseformerhalfminotjonesemildestructplague
pelham_nymassachusetts_manew hampshire_nhbroadway_nybrooklyn_ny
epidemicinfluenzapneumonialifeamericanfrancehospitalgripdeathillworldhundrednavyheldboy
hollywood_flmorristown_njfall river_maconcord_nhnew york_ny
influenzaepidemicmetropolitandeathcompanyfightwaldcommissionsouthofficermeetpatheticresultpneumoniakill
brooklyn_nyhuntington_nyarizona_azdelaware_denew york_ny
rooseveltassociationlossmainfloorauthority cutamericangripinfluenzacoursecolorarnoldepidemicmead
new jersey_njdelaware_debrooklyn_nyhuntington_nynew york_ny
influenzagamepneumoniacoachdawsonspanishlelandheaddiekielchildmidwayprivateexterminatorunion
evanston_ilwest point_nytipton_iaohio_ohmanhattan_ny
Figure 6.4: New York Tribune, NY DSTM Output.
The analysis of the topics and locations discovered from New York Tribune, NY from east,
The Evening Missourian, MO from mid-west, and Bisbee Daily Review, AZ from west confirmed
with major events in reports on the epidemic. Mapping the timeline of influenza peaks and spread
was also confirmed.
6.4.2 1918-1919 Influenza related tones, topics, and locations
In this application, our main goal is to understand the shifts of topics based on tones used by
newspapers from different parts of United States. Newspapers here were also divided into three
91
October 1918 November 1918 December 1918
Topi
csLo
catio
ns
September 1918 January 1919
influenzahospitaluniversityparkerwarilladmit memorialheldstudentcountypneumoniawomenhousedie
monroe county_moomaha_nekansas_ksst paul_mnboone county_mo
columbiaschoolepidemicmissouricouncilunitedclubcollegecommitteechildrenpeoplepublicmeetmaskfrench
minnesota_mnbaltimore_mdchicago_ilphiladelphia_past louis_mo
columbiastudentnavallakestationhelddeathpresidentparadecaptainchristiandetermineschoolissuehospital
new york_nykansas city_txcamp pike_arboone county_momissouri_mo
influenzaspanishuniversityepidemicunitedlibertysouthspreadloanhousemeasurediseaseparadefergusonenemy
washington_dcspringfield_manew york_nykansas city_ksmissouri_mo
influenzacolumbiahospitalspanishboardhealthepidemicpneumoniauniversitycountystudentwarparkersecuresea
west virginia_wvsan francisco_canew york_nychicago_ilcamp bowie_tx
reportdistrictinfluenzamissouriarmyschoolfollowillheldservicehousestreetwomennursestudent
providence_rinew york_nycalifornia_caarlington_vamissouri_mo
influenzacolumbiahospitalboardhealthgamebanschoolnightsectionschoolvocationalstudentactionheld
texas_txmissouri_mo atlanta_gacamp merritt_njphoenix_az
universityinfluenzaepidemicwarcamphousegirldiewomenquarantineparkerpneumoniacompanyreturnreport
camp pike_txiowa_iaindianapolis_injefferson city_mophiladelphia_pa
influenzaepidemichealthpublicmeetchristmasservicediseasedeathunitedredreceivecrossreportsoldier
monroe county_mocamp pike_txchicago_ileureka springs_arboone county_mo
influenzauniversitycountyhospitalillwarpneumoniamissourireportstudentdieschoollouisdoctorvaccine
des moines_iaillinois_ilspokane_waphiladelphia_pakansas city_ks
Figure 6.5: The Evening Missourian, MO DSTM Output.
October 1918 November 1918 December 1918
Topics
Locations
September 1918 January 1919
influenzaillcountyhomedistrictwarrencourtwarcampbellreturnbusinessdouglasattackcompanylocal
boston_maholbrook_azphiladelphia_panew york_nyarizona_az
influenzaphoenixepidemicreviewhealthdiseaseofficedieboardarizonagripattackspanishgovernorpublic
arkansas_arwashington_wamaricopa_azwisconsin_wiarizona_az
influenzareportdeathcampspanishpneumoniahealthepidemicdiseasenavalarmydevensmedicalmassauthority
richmond_vanew hampshire_nhconnecticut_ctcamp meade_pamassachusetts_ma
bostonwirereviewarmydistrictwashingtonenglandamericangermancountryservicebritishillfrenchenemy
norfolk_vamontana_mtjerome_azwarren district_azcolorado_co
reportinfluenzapneumoniacampwashingtonarmydiseasecontinuedeathcampreviewservicecountryleasewire
virginia_valake city_utgrand junction_cocamp dix_njcamp bowie_tx
epidemicinfluenzahealthpublicspanishdeathdiseaseloanwirelibertyreviewspreadcoloradoreportchicago
colorado springs_cocamp shelby_msnew jersey_njlouisiana_laseattle_wa
influenzahomeilldistrictattacktombstonediequarantinedouglaswarrenflupneumoniadiseasebisbeehospital
brooklyn_nyoklahoma_oksanta monica_camichigan_micamp custer_mi
countyreceivereportdeathdenvercampbelljohnarmywarfigurecochiseminingbancampaignissue
indianapolis_inmesa_aznew york_nyvirginia_vamissouri_mo
influenzaepidemicdiseasebisbeeamericanpneumoniadangercountryspanishchiefservicefoodattackthroatgerm
yavapai county_azsalt lake city_uttacoma_wanewark_njpennsylvania_pa
influenzahealthreviewdouglascountydistrictpublicdeathwarphoenixboardquarantinereportleaseheld
spokane_wautah_utbroadway_nyalaska_akmesa_az
Figure 6.6: Bisbee Daily Review, AZ DSTM Output.
parts: east, midwest, and west. Details on these newspapers were previously summered in Table 2.2.
We present a supervised learning approach to tone detection. Finally, we explain the added insight
offered by detecting the tone prior to applying our model.
Tones used in this application were identified by four domain experts: one historian, one
librarian, and two rhetoricians. We focused on four main tones: alarmist, explanatory, reassuring,
92
and warning. They first identified tones from advertisements then applied them to a broader sample
of texts. Here is a brief description of tones:
1. Alarmist: uses fear or urgency, often mentioning number of sick or dead; induces a sense of
panic.
2. Reassuring: comforting; implies threat is diminishing; addresses fears with soothing sensibil-
ity; typically conveys the idea that if one takes a recommended action; motivates action with
a sense of hopefulness, improvement, or possibility of avoidance of disease; involves sense
that an action will lead to betterment.
3. Explanatory: discourse as a source of information; lacks a distinctive affect.
4. Warning: serious but not urgent; cautioning; advises the reader what to do; mentions measures
being taken, but conveys no sense that the threat is diminishing.
For the classifier, we used a Multinomial Naive Bayes classifier. It is first trained using the
features extracted from label data. For features extraction we used a Tfidf from a 1- to 2-grams
language model. Using the labeled data, we trained the classifier to detect the four tones (alarmist,
explanatory, reassuring, and warning), using approximately 300 cleaned sentences from newspapers
and four coders, who attained a moderate level of agreement in their classifications (Kappa=0.47).
Kappa is a measure used to assess the degree of agreement among raters.
The dataset used here is the same data described in the previous section. After running
the tone classifier, we use the results to divided the data into four datasets, one for each tone. We
then apply our model on each dataset separately and compare the results. Figure 6.7 shows the
different tones distribution over Influenza reporting starting January 1918 until December 1919. The
Explanatory tone peaks when the Influenza peaks. The gap between the explanatory tones peaks
is bigger than other tones in the west and midwest, but not in the east. The east coast reporting
seems to exhibit a variety of tones, but the explanatory tone still dominant. The alarmist tone is the
least prominent tone in the three parts. The explanatory tone is the most prominent tone in west,
93
midwest, and west. This is because the dataset we are applying the tone classifier on is a newspaper
dataset and newspapers tend to be neutral in their reporting.
Applying the DSTM model resulted in topics and locations for each time slice in each tone.
Here we show the output for east coast newspapers. Figure 6.8 is a visualization of the model output
for each tone starting September 1918 until January 1919.
0
50
100
150
200
250
300
350
400
Jan-
18
Feb-
18
Mar
-18
Apr
-18
May
-18
Jun-
18
Jul-1
8 A
ug-1
8 Se
p-18
O
ct-1
8 N
ov-1
8 D
ec-1
8 Ja
n-19
Fe
b-19
M
ar-1
9 A
pr-1
9 M
ay-1
9 Ju
n-19
Ju
l-19
Aug
-19
Sep-
19
Oct
-19
Nov
-19
Dec
-19
Tone
Ass
ignm
ents
Cou
nt
USA West Coast Tone Distribtion over the years 1918 and 1919
Alarmist Explanatory Reassuring Warning
(a) West coast.
0
200
400
600
800
1000
1200
1400
Jan-
18
Feb-
18
Mar
-18
Apr
-18
May
-18
Jun-
18
Jul-1
8 A
ug-1
8 Se
p-18
O
ct-1
8 N
ov-1
8 D
ec-1
8 Ja
n-19
Fe
b-19
M
ar-1
9 A
pr-1
9 M
ay-1
9 Ju
n-19
Ju
l-19
Aug
-19
Sep-
19
Oct
-19
Nov
-19
Dec
-19
Tone
Ass
ignm
ents
Cou
nt
USA Midwest Tone Distribtion over the years 1918 and 1919
Alarmist Explanatory Reassuring Warning
(b) Midwest.
0
50
100
150
200
250
300
350
400
Jan-
18
Feb-
18
Mar
-18
Apr
-18
May
-18
Jun-
18
Jul-1
8 A
ug-1
8 Se
p-18
O
ct-1
8 N
ov-1
8 D
ec-1
8 Ja
n-19
Fe
b-19
M
ar-1
9 A
pr-1
9 M
ay-1
9 Ju
n-19
Ju
l-19
Aug
-19
Sep-
19
Oct
-19
Nov
-19
Dec
-19
Tone
Ass
ignm
ents
Cou
nt
USA East Coast Tone Distribtion over the years 1918 and 1919
Alarmist Explanatory Reassuring Warning
(c) East coast.
Figure 6.7: Tones distribution over Influenza reporting.
94
October 1918 November 1918 December 1918September 1918 January 1919
War
ning
Rea
ssur
ing
Exp
lana
tory
Ala
rmis
t
Topi
csLo
catio
nsTo
pics
Loca
tions
plagueepidemicinfluenzaincreaseunjustifiabletroy
diseasewraithprovemediaevalineffectualdeath
Topi
csLo
catio
nsTo
pics
Loca
tions
No Data
influenzawashingtonfrenchwarthousandsincrease
epidemicragepopulationlibertyhospitalequitable
throatsufferauthority increasebodymedical
throatremedysterilizespanishmaladyfaith
No Data
medicalinspectorconferencearrestsituationoutbreak
physicianinfluenzaextractdiseasesituationresult
influenzareportpneumoniadeathhealthwar
epidemicgovernmentbusinesswesternpresidentill
influenzabostonnavalwashingtonspanishdie
warspanishflunavyschoolcamp
influenzadiewilliampneumoniahospitaldeath
armycampreportinfluenzadeathwar
influenzacompanyamericangamesmithfootball
epidemicinfluenzahealthreportdeathpneumonia
christmascaliforniawarfootballsilkchurch
influenzaepidemicwashingtonhospitalhealthreport
coldquininethroatinfluenzagriplaxative
flucoldwashingtonteainfluenzagrip
spanishinfluenzadepartmentflucampaigndoctor
britishgripcreditamericangermparty
presidentwarwilsonberlinwashingtonworld
gripinfluenzalibertypricecoldloan
influenzathroatlibertyspraypreventivegargle
coldquininegriptabletlaxativebromo
warinfluenzaunitedamericangripchristmas
attacksorepreventflunauseacough
diseasedrinkphelpspatientgripfowler
influenzaspreadspanishmedicalpublicserious
warepidemicgermanlibertyloandisease
influenzaspreaddiseasefoodcloseair
warboardmaskchildrengamegrip
healthepidemicwaterinfluenzaincrease playground
influenzaepidemicexpecttradehandoutbreak
epidemicgiftarmisticefuneralclothspread
diseasetroubleweakvitalityremedyprotect
washingtonweakflucoughepidemicstatement
Figure 6.8: East coast newspapers discovered topics and locations grouped by tones.
As expected for east coast newspapers, the east locations dominated the maps in every month
for every tone. The interesting pattern to notice is the different locations appearing in different tones
and times. More locations appeared from the west and midwest in the explanatory and reassuring
95
tones. This is because east coast newspapers still report about the situation in west and midwest, but
in an explanatory and/or reassuring tone. Given that the epidemic reached the west around January
1919, west coast locations started to appear in alarmist, warning, and reassuring, but the situation
did not change for explanatory.
In the alarming output, words suffering, rage, and thousands appeared in topics. These
words did not appear in any of the other tones. In the month of September 1918 the first peak of
the Influenza epidemic was around this time which explain the appearance of the words epidemic,
influenza, death, and increase. At this time, influenza was described as "the dim and ineffectual
wraith of the mediaeval black death". It was reported by all authorities that 1510 bubonic plague
(historically called black death) was the influenza. There were also arguments that the plague that
devastated the Greek armies at the siege of troy was influenza which explains the grouping of the
words disease, wraith, mediaeval, ineffectual, and death in one topic. Top locations were only in the
east, and that was expected because it was the first peak of the epidemic and it started from the east
side of the country. In December 1918, topics were result of reports of an influenza outbreak in
some west coast locations and medical supplies were shipped to them. This explains the emerging
of west locations in the top locations. Around the same time, there were reports on the arrest of a
post office employee who was extracted money from envelopes. December is Christmas time in
which families send money to each other as a gift. The man was caught by post office Inspectors.
Examining the reassuring output we found words like preventive, laxative, tablet, sore,
nausea, tea, germ, and cough. Those set of words show that there were a great number of reporting
on Influenza and grip symptoms and some preventive measures. For example in November 1918
there were reports on laxative bromo quinine tablets that were advised to be taken as a preventive
measure from grip. Liberty catarrhal cream was also advised to be used to kill germs in the nose
throat and intestines also as a preventive of influenza and other infectious diseases. People were
advised to never let a cough or cold or tease of grippe get serious. A spray called “Tonsiline” was
also advised for gargle to relieve sore throat upon its first appearance. None of the other tones has
this amount of preventive measures related words. The goal behind all these preventive measures
were to reassure the public that there were hope that they can escape the disease or at least easy its
96
symptoms.
From the warning output, the word spread appeared in most time slices. For example, in
September 1918 words like influenza, speed, spinach, and serious appeared together in one topic
which indicates the size of the problem. As a result, there were reports referencing doctors as: Dr.
Phelps and Dr. Fowler to warn people from being in crowded places. They advised people to say at
home. There were also reports on a grip available medicine ordered by doctors and that it does not
cure it. The word drink emerged in the same topic the word flower and patient because there were
reports from Dr. Flower on where hot drinks are good for influenza patients or not.
During January 1919, there were reports in the form of warnings. The words weak, cough,
and vitality appeared in one of the topics as a result of many reporting on the influenza symptoms.
There were statements in Washington on an available remedy it was called "hypo-cod". Again
January was around the time when the epidemic reached the west.
Detecting tone and dividing the dataset based on it gave a different perspective to discovered
topics and their locations. Topics discovered from the same time slice in one tone were different
from others discovered from another tone. Adding the tone aspect gave us a chance to see the data
from a different angle and derive different conclusions.
6.5 Summary
Here we introduced a probabilistic model that can model relationships between locations, topics,
documents, and terms in a dynamic fashion. The model enables summarizing and navigating
unstructured time stamped documents while capturing the evolution of topics along with location
distribution over these topics.
We presented two different applications of the DSTM. The first application focused on
understanding the differences in news coverage between the east, west, midwest parts of U.S. in
1918 and 1919. The second application focused on discovering the differences in news reporting
97
between the three parts of U.S. from the reporting tone perspective.
We evaluated the DSTM qualitatively and quantitatively. The quantitative evaluation was
done by comparing our model to existing models using perplexity. The model showed better
performance over the basic topic model (LDA) and slightly better performance than the Author-
Topic model (ATM). We qualitatively evaluated the model by closely examining the output from
the two applications, described above, and mapping results to published reports on the influenza
epidemic peaks and spread pattern. Overall the model was successful in identifying the trends
and their locations which helped in studying the difference between news reporting in the two
applications.
Chapter 7
Predictive Analysis
In this chapter, we will present the fourth and final part of this dissertation, which is enabling the
Dynamic Spatial Topic Models (DSTM) for predictive analysis. The motivation behind this work
is to describe a powerful model that predict a major event and where this event will happen from
unseen documents. Documents can be in the form of social messages, newspaper articles, blogs, or
any form of textual data. The main research question we are trying to answer is: how can we predict
what and where a major event will happen? To predict future topics and their locations from unseen
tweets, we have adapted the work from [Wang et al., 2012], in which they proposed training a basic
topic model (LDA) using data from seven days to calculate a transition parameter form discovered
topics. This transition parameter is then used to predict the distribution of topics in unseen tweets
from the 8th day. The transition parameter needs to be updated every time new data is streamed.
There are two major drawbacks of the Wang et al. work: it is based on basic LDA (a
non-dynamic and non-spatial topic model), and updating the transition parameter is computationally
intensive. In this part of our work, we overcome those drawbacks by training the model using
DSTM (our model) instead of LDA. Using DSTM, enabled topic and location discovery from data
collections and there is no need to update the transition parameter, since our model is dynamic;
topics discovered at time t are evolved from topics discovered at t− 1. Although the resulting
framework is broadly applicable, we apply it primarily over our Latin American tweet collection,
98
99
previously described in 2.4.
7.1 Prediction Approach
The prediction approach we are presenting here is for streaming data. We train the model from
seven days by applying DSTM on the available data then calculate the transition parameter. Using
the transition parameter, we can predict topics from unseen tweets from the 8th day. Figure 7.1 show
three sample runs, showing the data that will be included in the transition parameter calculation
phase and data used in the prediction phase for each run. In Wang approach, in the second run, the
transition parameter will be updated using a computationally intensive method, discussed in [Wang
et al., 2012].
Each run consists of a document collection D of seven days worth of data. D consists of
m+1 documents. We apply DSTM on the documents to generate the topic-term distribution and
the location-topic distribution. Here a document refer to a location-topic distribution, where each
row represents a topic distribution over the locations.
The training data D, used to calculate the transition parameter, is divided into Dold and
Dnew. Dold is the collection of documents that will be used for training and Dnew is the collection of
documents that will be used for testing. In more details:
Dold = Dt(1,m) (7.1)
Dnew = Dt(2,m+1) (7.2)
As in any prediction problem we need to define/calculate the prediction error and attempt to
minimize it. Some researchers use iterative methods (e.g. gradient decent), but we use an optimal
solution to minimize the prediction error is using a direct method, previously presented at [Wang
et al., 2012]. We calculate the prediction error, as follows:
100
7 days 8th day
Run 1
Run 2
CalculateTransition Parameter
Predict
7 days
8th day
Predict
CalculateTransition Parameter
For this run the topics discovered from the 8th day evolve from topics
discovered from 7th day from Run 1
Run 3
CalculateTransition Parameter
Predict7 days
8th day
For this run the topics discovered from the 8th day evolve from topics
discovered from 7th day from Run 2
Figure 7.1: Experimental setup for predicting topics and their locations from streaming data.
errorPrediction = min||Dpredicted−Dactual||2F
where ||.||2F is the Frobenius matrix norm, Dpredicted is the predicted location-topics distributions,
and Dactual represent the actual location-topic distribution. The predicted location-topic distributions
is calculated as follows:
Dpredicted = Dold ∗T P
where Dold represent the old location-topic distribution, s.t. each row in this matrix represent a
101
location distribution over topics. TP is the transition parameter.
Transition parameter TP, is a matrix of size K×K where K is the number of topics. Number
of topics in Dold and Dnew should be the same and should vary based on the application. We
calculate the transition parameter as follows:
T P = ( ´Dold ∗Dold)−1 ´Dold ∗Dnew
Here ´Dold represent the transpose of Dold , (.)−1 represent a Cholesky Factorization. Unlike the
transition parameter calculation approach presented in [Wang et al., 2012], here we don’t need to
update the transition parameter because DSTM is a dynamic topic model. The same TP equation
will be used for every 7 days worth of data when they become available.
To predict the topic distribution over the unseen documents we use documents from the 8th
day. Documents here are actual documents unlike the analogy we used previously. We divided the
8th day document set into previous documents and future (unseen) documents sets. The prediction
process steps are:
• For each document in the previous document dataset:
– we infer the document-topic distribution using the DSTM.
– Then we multiply the resulting location-topic distribution from the previous tweet by the
transition parameter to get the predicted topic distribution of the future (unseen) tweet.
In the inference phase, our main goal is to calculate the conditional probability of a document
given a topic p(topic|document). We can calculate this probability by inferring the document-topic
assignment as follows:
p(z|d, l) = ∑w
p(z|w, l)p(w|d)
where d represents a document, w represents a word in a document, and z represents a topic. p(w|d)
102
is the normalized word w frequency within document d. We can calculate p(z|w) given the trained
model as follows:
p(z|w, l) = φwz ∗θlz
Where φwz is the word w probability to appear in topic z retrieved from the topic-terms distribution
φ, and θlz is the topic z probability to appear in location l retrieved from location-topic distribution
θ.
The result of the previous process is the predicted topic distribution for the unseen documents.
Now we examine the predicted topic distribution to find a topic with the highest concentration and
assign it to the predicted (unseen) document. Table 7.1 is a sample topic assignment for the unseen
documents mapped to the actual documents. In this example, the document is a preprocessed tweet
with a partial set of words shown for each tweet. These tweets were not observed at the prediction
time, but they are shown here for illustrative purposes.
After having an assigned topic for each unseen document, we can then count each topic
assignment and rank the topics from highest assigned to lowest assigned. Table 7.2 depicts an
example of the topic assignment counts. From this table, topic one is the topic with the highest
count, and this indicate that it is the most probable topic to appear in the unseen documents on June
8th, 2014. The terms and locations for a topic can be retrieved from the topic-document distribution
and location-topic distribution discovered from the trained model.
In the following section we will discuss a use case of the proposed framework and a detailed
discussion of the results will be presented as an illustration on how to interpret the output.
7.2 Latin America Unrest Prediction
In this application, the main goal is to predict civil unrest events happening in Latin America.
The focus here was on Twitter data collected from: Colombia, Mexico, El Salvador, Costa Rica,
Guatemala, Chile, Paraguay, Argentina, Venezuela, and Ecuador. This data was provided by the
103
Table 7.1: Sample topic-document(tweet) assignmentDate Country Tweet Terms Topic
6/8/13 Colombia consigue - mundo - punto - empate - colombia - arquero - argentina 46/8/13 Argentina música - escuchar - pide - loco - tomatela - celular 16/8/13 Argentina salvo - juega - cancha - james - seguir - argentina 26/8/13 Colombia objetivo - logro -tricolor - listo - vamoscolombia - sisepuede - fcfselec-
cioncol - sumar2
6/8/13 Venezuela leña- olor - quedas - aquel - campo 26/8/13 Chile igualar - clasificación - colombia - posterga - mundial - argentina 16/8/13 Ecuador messibelievers - ecuargentinos - quieren - matar 26/8/13 Colombia mejores - ganamos - perdimos - colombia 46/8/13 Chile wazees - muuuuuuucho - acabo - reportar - gauss - miguel - vehículo -
social- san - congestionamiento - gps4
6/8/13 Ecuador eliminatorias - partido - argentina - termina - colombia 26/8/13 Colombia hpta - asistencias - payaso - colombia - messi - pasao - quie - enano -
triple - jugo - ajajajajaja - monda - malparido - goles1
Early Model-Based Event Recognition with Surrogates (EMBERS) a Virginia Tech based project.
The dataset used here were first introduced in 2.4. We filter tweets from these countries using a civil
unrest related keywords.
In this application, location is assumed to be the location the tweet was initiated from. As a
result, there is only one location assigned to each tweet. Number of topics was fixed to five. Results
from our framework are represented as the top two topics assigned to the unseen tweets. Reference
events for specific dates are retrieved from the reference event dataset previously discussed in 2.4.
Reference actual unrest events and their locations were also provided by Early Model-Based Event
Recognition with Surrogates (EMBERS). The top two topics are manually examined to validate
their similarities with the reference events dataset.
Two examples on the framework output are explored as an illustration on how we interpret
the output. The first example is focused on the first week of June, 2013. The first seven days are
used for calculating the transition parameters, and the prediction was done for the unseen tweets
from June 8th, 2013. Table 7.2 shows the resulting counts of the topics assignment to the unseen
tweets. From this table the top assigned topics are topic one then topic four, Figure 7.2 is showing
the top terms (translated to english) and top locations for each topic. Examining the top terms from
104
Table 7.2: Predicated topic assignment counts for June 8th, 2013.Topic Count
Topic 0 4,126Topic 1 24,932Topic 2 14,888Topic 3 17,006Topic 4 23,945
topic one and topic four and their locations, we found out that they confirm with the reference events
(during the same day). Here are some notable observations:
Topic one includes the terms freedom, expression, people, and pathway. These terms are a
clear indication on an unrest and a protest happening. Other terms such as water and environment
were present along with the previous words which confirm with a reference event about group
of people protesting the lack of water and environmental damage on June 8th, 2014. This event
happened in Chile, which is one of the top locations of topic one. Work, money, and power also
appeared in topic one and this also confirms with a protest related to street traders not allowed to
earn a living. This protest event is one of the events in the reference events of the same day, and the
actual event location was Mexico, also appeared in the topic top locations.
In topic four the terms hate, evil, and some inappropriate terms are an indication of public
anger. After examining the reference events, we determined that during the same day there were
a rally by the members of the council of the Sexual Diversity Mexico State and Lesbian Gay
community seeking to legalize marriage between same-sex individuals and to criminalize hate
crimes and homophobia. We posit that the anger related words might be a public reaction to this
protest. These protests happened in Mexico, which is in the top locations of topic four. The term
school appeared in the same topic, and this coincides with a protest event related to school teachers
minimum wage. The protest happened in Venezuela and also Venezuela appeared in the top topic
locations.
Another example is topics predicted from June 29th, 2014 unseen tweets. After predicting
the topic assignment of the unseen tweets, counts can be calculated, and they are shown in table 7.3.
105
Argentina chili party bitch Brazil Paraguay hatred go face eu re school shit happy cute people na
vinotinto wrong messi sos owner mother years old
video leave journalist crazy so mamma
Honduras - Colombia - Mexico - El Salvador - Costa Rica -
Guatemala - Chile - Argentina - Venezuela - Ecuador
Mexico party work freedom pathway people week state expression mexico greetings
power national viola the water city what days love my
years really if road out people environment change money
Brazil - Honduras - Colombia - Mexico - El Salvador - Costa Rica -
Guatemala - Chile - Paraguay - Argentina
Topic 1 Topic 4T
op T
erm
s T
rans
late
d in
to E
nglis
h T
op L
ocat
ions
Figure 7.2: Predicted topics and their locations from the 8th day of June 2013
Topic three and topic two appear to be the most prominent. Figure 7.3 shows the top two topics
assigned to unseen tweets from this day.
Argentina brazil eu bitch hatred face go re people na
independent wrong final do Spain sos das leave school
cute video ah shit meu poor to best um Uruguay love
Brazil - Chile - Colombia - Mexico - Peru - Argentina - Venezuela
Topic 3 Topic 2
Top
Ter
ms
Tra
nsla
ted
into
Eng
lish
Top
Loc
atio
ns
Colombia people god via savior love Paraguay national team what world penalty large party president happy group world family people waiting glory social leave power
wrong faith road watching 20
Brazil - Peru - Mexico - Chile - Colombia - Argentina - Venezuela
Figure 7.3: Predicted topics and their locations from June 29th, 2013.
Manually examining the topic terms and their locations along with the reference events from
the same day we found the following: In topic three the terms school, poor, and wrong appeared
together because at this day school teachers were protesting for living wage. The term school
106
Table 7.3: Predicated topic assignment counts for June 29th, 2013.Topic Count
Topic 0 15,365Topic 1 13,060Topic 2 18,830Topic 3 18,919Topic 4 16,731
appearance along with the words people and best also confirms with another march by high school
students supporting the progress in education. There were also some protests by teachers about
secondary education reform laws. This event happened in Mexico that appeared in topic three top
locations. The terms sos, independent, and hatred are an indication to the feelings that surrounded
these events.
In topic two the terms god, faith, savior, and glory is a good indication on a religious march
reference event called “the March for Jesus” that happened during the fourth week of June. This
event took place in Brazil, which is one of the top locations for topic two. Also, the appearance of
the terms wrong and penalty in the same topic can indicate the feelings some people had toward the
protests and marches by members of the gay, lesbian, bisexual, transgender and intersex (GLBTI).
Both marches happened in the same day. One of these protests happened in Mexico (one of the
topic two locations). Other protests happened in Ecuador and El Salvador, but these two did not
appear in the top locations.
7.3 Summary
Here we presented a prediction approach for predicting topics from unseen documents. We applied
the approach on Latin America tweets dataset with the goal of predicting civil unrest events and
their locations. The applications severed as a qualitative evaluation for the prediction approach in
which we compared the output to a reference event dataset. This event dataset consisted of actual
unrest events and their reported time and location. Our prediction approach was successfully in
predicting major events and their locations from unseen tweets.
Chapter 8
Conclusion
The main goal behind the work presented in this dissertation was improving the basic topic model
(LDA) to help extract greater information from data and improve the utility of the text mining
process. Each part of this work was motivated by answering a specific reach question. The research
questions were the results of formalizing real problems faced by social scientists and humanists
dealing with the ever-growing availability of digital archives and data from social media tools. To
realize this goal we presented and evaluated the following:
1. A Dynamic Temporal Segmentations over Topic Models Algorithm, A time series segmenta-
tion algorithm that segment time based on shifts in topics.
2. ThemeDelta, a visual analytics system for discovering and representing the evolution of trend
keywords into ever-changing topic aggregations over time.
3. Dynamic Spatial Topic Models (DSTM), a new model that incorporate reporting locations of
inferred topics, and captures their evolutions over time.
4. Prediction approach, an approach for predicting topics and their location from unseen docu-
ments.
107
108
All four parts were successful in assisting humanists and social scientists in answering
questions and drive conclusions. They offer great improvement specifically to the basic Topic Model
(LDA) and to the text mining process generally. While these solutions are greatly successful, they
also have imperfections and large space for improvements. In the following section, we will address
each part in details and address its advantages, disadvantages, and suggest future modifications and
extensions.
8.1 Dynamic Temporal Segmentations over Topic Models
This algorithm was successful in extracting important qualitative features in the form of topics
and the duration in which these topics were present from a text corpus. The main idea behind this
algorithm is seamlessly wrapping a time series segmentation algorithm around a topic molding
algorithm. This flexibility allows the freedom of choosing the appropriate topic model to fit the
problem at hand.
This algorithm has limitations stemming from non-adaptive window sizing and the fixed
number of discovered topics. The minimum and maximum window sizes have to be pre-specified
before the segmentation algorithm is run. This introduces the problem of force-adding a segmenta-
tion point when the maximum window size is reached, to overcome this we increase the maximum
window size which can result in a slower running algorithm. The fixed number of topics discovered
from each segment can introduce redundant topics.
Future work direction for this part can focus on extending the segmentation algorithm to
capture not just topic differences but sentiment evolutions. For example, capturing the sentiment
evolution will enable us to measure differences in public perception and attitudes between advan-
taged and disadvantaged neighborhoods. We can also address the pre-specified windows sizes and
the fixed number of topics limitation as another direction of future work.
109
8.2 New Visual Analytics Representations
ThemeDelta excels at discovering trending topics and visualizing not just the discovered topics,
but also their evolution over time. We have demonstrated the utility of our analytics component by
applying it to several types of text corpora. However, while ThemeDelta has many strengths, it is
also balanced by several weaknesses and areas of future improvement.
For the visualization component, limitations appear in the presence of many trends, long
time periods, and high visual complexity. While existing techniques such as TextFlow [Cui et al.,
2011] take a macroscopic approach to summarizing massive text corpora using high-level overviews,
ThemeDelta uses a trend-level design that does not scale as well when the number of trends or
time segments increases. While we have not derived a formal limit, even many of the examples
presented in the New Visual Analytics chapter skirt the boundary of the utility of the technique. In
practice, large datasets (in either trends or time, or both) yield high visual complexity, particularly
in the number of trendline crossings as well as incident trendlines. Such effects make perceiving the
visualized data more difficult. Several possible strategies can solve this problem, such as filtering,
sampling, or aggregation.
We opted to not run a controlled quantitative experiment using ThemeDelta, opting instead
for a qualitative expert review [Tory and Möller, 2005]. One reason for this choice is that we found
no suitable technique to use as a baseline comparison for such an experiment. While techniques
such as TextFlow [Cui et al., 2011] and TIARA [Wei et al., 2010] do provide insight on trends
evolving over time, they cluster keywords together and focus on providing overview instead of
detail at the level of individual keywords. In this sense, parallel tag clouds [Collins et al., 2009b]
are perhaps the closest technique to ThemeDelta in that it visualizes individual keywords, yet PTCs
do not show the clustering of trends into topics over time. This makes direct comparison difficult.
While our qualitative review did not compare ThemeDelta to other techniques, it did give rise to
much more qualitative and generally useful results.
Our future work will study aggregation methods—time-based and keyword-based alike—for
110
ThemeDelta that would increase the scalability of the system. Another focus will be to enable the
automatic detection and visualization of the diffusion of ideas in scientific communities. From an
algorithmic perspective, we will be looking to extend the segmentation algorithm ability to work
with nonparametric bayesian models as Hierarchal Dirichlet Process (HDP) proposed by [Teh et al.,
2006]. Unlike LDA, number of topics in the HDP model is automatically inferred. Currently,
our algorithm only supports parametric Latent Dirichlet allocation given that they are the most
commonly used in practice and research.
8.3 Dynamic Spatial Topic Model
DSTM succeeded in discovering the relationships between locations, topics, documents, and terms
in a dynamic fashion. By applying the model on two text corpora, we demonstrated the model ability
to summarize and navigate unstructured time stamped text documents while capturing the evolution
of topics along with location distribution over these topics. We also quantitatively evaluated our
model by comparing it to LDA and Author-Topic Model. One of the advantages of our model over
these two models is that it companies the power of both. Another advantage is that our model does
have better performance than LDA and a slightly better performance better than the Author-Topic
model.
This model has two major limitations. First limitation is the fixed number of topics, also
appeared in our segmentation algorithm. This limitation exists here because this model is an
extension of the basic LDA, which do not automatically infer the number of topics. This introduces
the problem of redundant topics. Adapting a nonparametric bayesian models as Hierarchal Dirichlet
Process (HDP) proposed by [Teh et al., 2006] is a possible direction toward the solution of this
problem.
Another limitation lays in reading the results of this model. Users need to spend some effort
comprehending the ways they can read the model results and maximize the benefit from the model
results. A well-designed interactive visualization is curial here to give the user the freedom of
111
exploring the results in the way they please and help them answer their questions.
Our future work can take different directions, First, enabling the model to infer automatically
the number topics is a possible future direction for this work. Another direction is designing and
developing an interactive visualization. This visualization will transform this powerful model to a
strong visual analytics tool, and it is a critical future work direction to maximize the benefited and
the ease of reading of the model results. Another direction for this work is to enable the DSTM to
predict topics and their location from unseen documents. This is a direction we explored in the last
part of this dissertation.
8.4 Predictive Analysis
The fourth and last part of this dissertation focused on enabling our DSTM for predictive analysis.
A great advantage of this approach is overcoming the limitations in [Wang et al., 2012] approach by
enabling their model to predict topics locations and overcome the computational intensive approach
they used to update the transition parameter, a parameter calculated from the training data and used
for prediction.
A natural direction for future work is examining the prediction approach applicability on
different domain specific datasets. Some of the applications we will explore are: predicting future
research direction from publications archives and predicting an epidemic outbreak from social
media datasets e.g. tweeter, Facebook, and blogs. Another direction is integrating sentiment into
this framework. This integration will give valuable insights from the predicted events. The result
of this modification will be topics, their locations, and the sentiment that surrounded the predicted
topics.
Bibliography
[Rep, a] The deadly virus: the influenza epidemic of 1918, by: National archives and records
administration.
[Rep, b] The great pandemic: the united states in 1918-1919, by: United states department of health
and human services.
[Rep, c] The pandemic of influenza in 1918-1919, by: Navy department library.
[iNe, 2012] (2012). ineighbor website. http://www.i-neighbors.org/.
[cen, 2012] (2012). Poverty.
http://www.census.gov/hhes/www/poverty/methods/definitions.html.
[Abello and van Ham, 2004] Abello, J. and van Ham, F. (2004). MatrixZoom: A visual interface
to semi-external graphs. In Proceedings of the IEEE Symposium on Information Visualization,
pages 183–190.
[Adar, 2006] Adar, E. (2006). GUESS: a language and interface for graph exploration. In Proceed-
ings of the ACM 2006 Conference on Human Factors in Computing Systems, pages 791–800.
[AlSumait et al., 2008] AlSumait, L., Barbará, D., and Domeniconi, C. (2008). On-line LDA:
adaptive topic models for mining text streams with applications to topic detection and tracking.
In proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM’08,
pages 3–12.
112
113
[Amar et al., 2005] Amar, R., Eagan, J., and Stasko, J. (2005). Low-level components of analytic
activity in information visualization. In Proceedings of the IEEE Symposium on Information
Visualization, pages 111–117.
[Appert and Fekete, 2006] Appert, C. and Fekete, J.-D. (2006). OrthoZoom scroller: 1D multi-
scale navigation. In Proceedings of the ACM Conference on Human Factors in Computing
Systems, pages 21–30.
[Archambault et al., 2007] Archambault, D., Munzner, T., and Auber, D. (2007). Topolayout:
Multilevel graph layout by topological features. volume 13, pages 305–317.
[Auber, 2003] Auber, D. (2003). Tulip : A huge graph visualisation framework. In Graph Drawing
Software, pages 105–126. Springer-Verlag.
[Auber et al., 2003] Auber, D., Chiricota, Y., Jourdan, F., and Melancon, G. (2003). Multiscale
visualization of small world networks. In Proceedings of the IEEE Symposium on Information
Visualization, pages 75–81.
[Axis Maps, ] Axis Maps. Typographic maps. http://www.axismaps.com/
typographic.php.
[Baeza-Yates and Ribeiro-Neto, 1999a] Baeza-Yates, R. and Ribeiro-Neto, B. (1999a). Modern
Information Retrieval. Addison-Wesley.
[Baeza-Yates and Ribeiro-Neto, 1999b] Baeza-Yates, R. and Ribeiro-Neto, B. (1999b). Modern
Information Retrieval. Addison-Wesley.
[Baldonado et al., 2000] Baldonado, M. Q. W., Woodruff, A., and Kuchinsky, A. (2000). Guidelines
for using multiple views in information visualization. In Proceedings of the ACM Conference on
Advanced Visual Interfaces, pages 110–119.
[Bateman et al., 2008] Bateman, S., Gutwin, C., and Nacenta, M. A. (2008). Seeing things in
the clouds: the effect of visual features on tag cloud selections. In Proceedings of the ACM
Conference on Hypertext and Hypermedia, pages 193–202.
114
[Baudisch et al., 2003] Baudisch, P., Cutrell, E., Czerwinski, M., Robbins, D. C., Tandler, P.,
Bederson, B. B., and Zierlinger, A. (2003). Drag-and-pop and drag-and-pick: Techniques for
accessing remote screen content on touch- and pen-operated systems. In IFIP International
Conference on Human-Computer Interaction, pages 57–64.
[Baudisch and Rosenholtz, 2003] Baudisch, P. and Rosenholtz, R. (2003). Halo: a technique for
visualizing off-screen objects. In Proceedings of the ACM Conference on Human Factors in
Computing Systems, pages 481–488.
[Beaudouin-Lafon, 2000] Beaudouin-Lafon, M. (2000). Instrumental interaction: an interaction
model for designing post-WIMP user interfaces. In Proceedings of the ACM Conference on
Human Factors in Computing Systems, pages 446–453.
[Bederson et al., 1996] Bederson, B. B., Hollan, J. D., Perlin, K., Meyer, J., Bacon, D., and Furnas,
G. W. (1996). Pad++: A zoomable graphical sketchpad for exploring alternate interface physics.
volume 7, pages 3–32.
[Bederson et al., 2000] Bederson, B. B., Meyer, J., and Good, L. (2000). Jazz: An extensible
zoomable user interface graphics toolkit in Java. In Proceedings of the ACM Symposium on User
Interface Software and Technology, pages 171–180.
[Berry and Linoff, 1997] Berry, M. J. A. and Linoff, G. (1997). Data Mining Techniques. Wiley.
[Bertin, 1967] Bertin, J. (1967). Sémiologie graphique: Les diagrammes - Les réseaux - Les cartes.
Editions de l’Ecole des Hautes Etudes en Sciences, Paris, France, les réimpressions edition.
[Bertin, 1983] Bertin, J. (1983). Semiology of graphics. University of Wisconsin Press.
[Bezerianos and Balakrishnan, 2005] Bezerianos, A. and Balakrishnan, R. (2005). The vacuum:
facilitating the manipulation of distant objects. In Proceedings of the ACM Conference on Human
Factors in Computing Systems, pages 361–370.
115
[Bier et al., 1993] Bier, E. A., Stone, M. C., Pier, K., Buxton, W., and DeRose, T. (1993). Toolglass
and Magic Lenses: The see-through interface. In Computer Graphics (ACM SIGGRAPH
Proceedings), volume 27, pages 73–80.
[Billard and Diday, 2007] Billard, L. and Diday, E. (2007). Symbolic Data Analysis: Conceptual
Statistics and Data Mining. Wiley Series in Computational Statistics. Wiley.
[Blei and Lafferty, 2006] Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic models. In
Proceedings of the International Conference on Machine Learning, pages 113–120.
[Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation.
volume 3, pages 993–1022.
[Börner et al., 2003] Börner, K., Chen, C., and Boyack, K. W. (2003). Visualizing knowledge
domains. volume 37, pages 139–255.
[Boulianne, 2009] Boulianne, S. (2009). Does internet use affect engagement? a meta-analysis of
research. In political communication, volume 26, pages 193–211.
[Bourgeois and Guiard, 2002] Bourgeois, F. and Guiard, Y. (2002). Multiscale pointing: facilitating
pan-zoom coordination. In Extended Abstracts of the ACM Conference on Human Factors in
Computing Systems, pages 758–759.
[Boyd-Graber and Blei, 2008] Boyd-Graber, J. and Blei, D. (2008). Syntactic topic models. In
Proceedings of the neural information processing systems conference, NIPS’08.
[Byrne, 2002] Byrne, M. D. (2002). Reading vertical text: Rotated vs marquee. In Proceedings of
the Human Factors and Ergonomics Society 46th Annual Meeting, volume 10, pages 1633–1635.
[Byron and Wattenberg, 2008] Byron, L. and Wattenberg, M. (2008). Stacked graphs — geometry
& aesthetics. volume 14, pages 1245–1252.
[Card and Mackinlay, 1997] Card, S. K. and Mackinlay, J. (1997). The structure of the information
visualization design space. In Proceedings of the IEEE Symposium on Information Visualization,
pages 92–99.
116
[Card et al., 1999] Card, S. K., Mackinlay, J. D., and Shneiderman, B., editors (1999). Readings in
information visualization: Using vision to think. Morgan Kaufmann Publishers, San Francisco.
[Card and Nation, 2002] Card, S. K. and Nation, D. (2002). Degree-of-interest trees: A component
of an attention-reactive user interface. In Proceedings of the ACM Conference on Advanced
Visual Interfaces.
[Carpendale and Montagnese, 2001] Carpendale, M. S. T. and Montagnese, C. (2001). A frame-
work for unifying presentation space. In Proceedings of the ACM Symposium on User Interface
Software and Technology, pages 61–70.
[Carpini and Keeter, 1996] Carpini, M. X. D. and Keeter, S. (1996). What americans know about
politics and why it matters. Yale University Press.
[Chau et al., 2002] Chau, M., Xu, J. J., and Chen, H. (2002). Extracting meaningful entities from
police narrative reports. In National Conference on Digital Government Research.
[Chen et al., 2003] Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., and Schroeder, J. (2003).
COPLINK: managing law enforcement data and knowledge. volume 46, pages 28–34.
[Chevalier and Diamond, 2010] Chevalier, F. and Diamond, S. (2010). The use of real data in
fine arts for insight and discovery: Case studies in text analysis. In IEEE VisWeek Discovery
Exhibition.
[Clark, 2008] Clark, J. (2008). Clustered word clouds. http://neoformix.com/2008/
ClusteredWordClouds.html.
[Cleveland and McGill, 1984] Cleveland, W. S. and McGill, R. (1984). Graphical perception:
Theory, experimentation and application to the development of graphical methods. volume 79,
pages 531–554. The American Statistical Association.
[Collins et al., 2009a] Collins, C., Carpendale, M. S. T., and Penn, G. (2009a). DocuBurst: Visual-
izing document content using language structure. volume 28, pages 1039–1046.
117
[Collins et al., 2009b] Collins, C., Viégas, F. B., and Wattenberg, M. (2009b). Parallel tag clouds
to explore faceted text corpora. In Proceedings of the IEEE Symposium on Visual Analytics
Science and Technology, pages 91–98.
[Crosby, 1989] Crosby, A. (1989). America’s forgotten pandemic: The influenza of 1918. Cam-
bridge UP.
[Cui et al., 2011] Cui, W., Liu, S., Tan, L., Shi, C., Song, Y., Gao, Z., Qu, H., and Tong, X. (2011).
TextFlow: Towards better understanding of evolving topics in text. volume 17, pages 2412–2421.
[Darken and Sibert, 1996] Darken, R. P. and Sibert, J. L. (1996). Wayfinding strategies and be-
haviors in large virtual worlds. In Proceedings of the ACM Conference on Human Factors in
Computing Systems, pages 142–149.
[Dewey, 1927] Dewey, J. (1927). The public and its problems. Swallow press, 1 edition.
[DiBattista et al., 1998] DiBattista, G., Eades, P., Tamassia, R., and Tollis, I. G. (1998). Graph
Drawing: Algorithms for the Visualization of Graphs. Prentice Hall PTR.
[Don et al., 2007] Don, A., Zheleva, E., Gregory, M., Tarkan, S., Auvil, L., Clement, T., Shnei-
derman, B., and Plaisant, C. (2007). Discovering interesting usage patterns in text collections:
integrating text mining with visualization. In Proceedings of the ACM Conference on Information
and Knowledge Management, pages 213–222.
[Dou et al., 2011] Dou, W., Wang, X., Chang, R., and Ribarsky, W. (2011). ParallelTopics: a
probabilistic approach to exploring document collections. In Proceedings of the IEEE Conference
on Visual Analytics Science and Technology, pages 231–240.
[Dragicevic, 2004] Dragicevic, P. (2004). Combining crossing-based and paper-based interaction
paradigms for dragging and dropping between overlapping windows. In Proceedings of the ACM
Symposium on User Interface Software and Technology, pages 193–196.
[Dykes et al., 2010] Dykes, J., Wood, J., and Slingsby, A. (2010). Rethinking map legends with
visualization. volume 16, pages 890–899.
118
[Eades, 1984] Eades, P. (1984). A heuristic for graph drawing. volume 42, pages 149–160.
[Ebert and Rheingans, 2000] Ebert, D. S. and Rheingans, P. (2000). Volume illustration: Non-
photorealistic rendering of volume models. In Proceedings of the IEEE Conference on Visualiza-
tion, pages 195–202.
[Eccles et al., 2008] Eccles, R., Kapler, T., Harper, R., and Wright, W. (2008). Stories in GeoTime.
volume 7, pages 3–17.
[Edelsbrunner and Waupotitsch, 1995] Edelsbrunner, H. and Waupotitsch, R. (1995). A combi-
natorial approach to cartograms. In Proceedings of the Annual Symposium on Computational
Geometry, pages 98–108.
[Ellis and Dix, 2007] Ellis, G. and Dix, A. J. (2007). A taxonomy of clutter reduction for informa-
tion visualisation. volume 13, pages 1216–1223.
[Elmqvist et al., 2008] Elmqvist, N., Henry, N., Riche, Y., and Fekete, J.-D. (2008). Mélange:
Space folding for multi-focus interaction. In Proceedings of ACM Conference on Human Factors
in Computing Systems, pages 1333–1342.
[Elmqvist and Tsigas, 2003] Elmqvist, N. and Tsigas, P. (2003). Causality visualization using
animated growing polygons. In Proceedings of the IEEE Symposium on Information Visualization,
pages 189–196.
[Fekete, 2004] Fekete, J.-D. (2004). The InfoVis Toolkit. In Proceedings of the IEEE Symposium
on Information Visualization, pages 167–174.
[Fekete and Plaisant, 1999] Fekete, J.-D. and Plaisant, C. (1999). Excentric labeling: Dynamic
neighborhood labeling for data visualization. In Proceedings of the ACM Conference on Human
Factors in Computer Systems, pages 512–519.
[Fellbaum, 1998] Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press.
119
[Fishkin and Stone, 1995] Fishkin, K. and Stone, M. C. (1995). Enhanced dynamic queries via
movable filters. In Proceedings of the ACM Conference on Human Factors in Computing Systems,
pages 415–420.
[Fitts, 1954] Fitts, P. M. (1954). The information capacity of the human motor system in controlling
the amplitude of movement. volume 47, pages 381–391.
[Forlines and Balakrishnan, 2009] Forlines, C. and Balakrishnan, R. (2009). Improving visual
search with image segmentation. In Proceedings of the ACM Conference on Human Factors in
Computing Systems, pages 1093–1102.
[Fruchterman and Reingold, 1991] Fruchterman and Reingold (1991). Graph drawing by force-
directed placement. volume 21.
[Furnas, 1986] Furnas, G. W. (1986). Generalized fisheye views. In Proceedings of the ACM
Conference on Human Factors in Computing Systems, pages 16–23.
[Furnas, 1997] Furnas, G. W. (1997). Effective view navigation. In Proceedings of the ACM
Conference on Human Factors in Computing Systems, pages 367–374.
[Furnas, 2006] Furnas, G. W. (2006). A fisheye follow-up: further reflections on focus + context. In
Proceedings of the ACM Conference on Human Factors in Computing Systems, pages 999–1008.
[Furnas and Bederson, 1995] Furnas, G. W. and Bederson, B. B. (1995). Space-scale diagrams:
Understanding multiscale interfaces. In Proceedings of the ACM Conference on Human Factors
in Computing Systems, pages 234–241.
[Gad et al., 2012] Gad, S., Ramakrishnan, N., Hampton, K. N., and Kavanaugh, A. (2012). Bridg-
ing the divide in democratic engagement: Studying conversation patterns in advantaged and
disadvantaged communities. In Proceedings of the IEEE Conference on Social Informatics.
[Gansner et al., 2005] Gansner, E. R., Koren, Y., and North, S. C. (2005). Topological fisheye
views for visualizing large graphs. volume 11, pages 457–468.
120
[Gao et al., 2011] Gao, Z., Song, Y., Liu, S., Wang, H., Wei, H., Chen, Y., and Cui, W. (2011).
Tracking and connecting topics via incremental hierarchical dirichlet processes. In Data Mining
(ICDM), 2011 IEEE 11th International Conference on, pages 1056–1061.
[Ghoniem et al., 2007] Ghoniem, M., Luo, D., Yang, J., and Ribarsky, W. (2007). NewsLab:
Exploratory broadcast news video analysis. In Proceedings of the IEEE Symposium on Visual
Analytics Science & Technology, pages 123–130.
[Gohr et al., 2009] Gohr, A., Hinneburg, A., Schult, R., and Spiliopoulou, M. (2009). Topic
evolution in a stream of documents. In SIAM International Conference on Data Mining, SDM’09,
pages 859–872.
[Grems, 1962] Grems, M. (1962). A survey of languages and systems for information retrieval.
volume 5, pages 43–46.
[Grossman, 2012] Grossman, J. (2012). Big data: an opportunity for historians? In Perspectives
on history issue in American historical association.
[Guo and Ramakrishnan, 2010] Guo, S. and Ramakrishnan, N. (2010). Finding the storyteller:
automatic spoiler tagging using linguistic cues. In Proceedings of the 23rd International
Conference on Computational Linguistics, COLING’10, pages 412–420. ACL.
[Gustafson and Irani, 2007] Gustafson, S. G. and Irani, P. P. (2007). Comparing visualizations for
tracking off-screen moving targets. In Extended Abstracts of the ACM Conference on Human
Factors in Computing Systems, pages 2399–2404.
[Gutwin and Skopik, 2003] Gutwin, C. and Skopik, A. (2003). Fisheyes are good for large steering
tasks. In Proceedings of the ACM Conference on Human Factors in Computing Systems, pages
201–208.
[Halverson, 2006] Halverson, T. (2006). Integrating models of human-computer visual interaction.
In Extended Abstracts of the ACM Conference on Human Factors in Computing Systems, pages
1747–1750.
121
[Halverson and Hornof, 2004] Halverson, T. and Hornof, A. J. (2004). Link colors guide a search.
In Extended Abstracts of the ACM Conference on Human Factors in Computing Systems, pages
1367–1370.
[Halverson and Hornof, 2007] Halverson, T. and Hornof, A. J. (2007). A minimal model for
predicting visual search in human-computer interaction. In Proceedings of the ACM Conference
on Human Factors in Computing Systems, pages 431–434.
[Halvey and Kean, 2007] Halvey, M. and Kean, M. T. (2007). An assessment of tag presentation
techniques. In Proceedings of the ACM Conference on World Wide Web, pages 1313–1314.
[Hampton, 2007] Hampton, K. N. (2007). Neighborhoods in the network society: the e-neighbors
study. In information, communication and society, volume 10, pages 714–748.
[Hampton, 2010] Hampton, K. N. (2010). Internet use and the concentration of disadvantage:
globalization and the urban underclass. In american behavioral scientist, volume 53, pages
1111–1132. SAGE publications.
[Hampton et al., 2011] Hampton, K. N., Goulet, L. S., Rainie, L., and Purcell, K. S. (2011). Social
networking sites and our lives: how people’s trust, personal relationships, and civic and political
involvement are connected to their use of social networking sites and other technologies. In
Public sociology: research, action, and change. Pew research center.
[Hampton et al., 2009] Hampton, K. N., Sessions, L., Her, E. J., , and Rainie, L. (2009). Social
isolation and new technology: how the internet and mobile phones impact americans social
networks. In Pew research center.
[Hampton and Wellman, 2003] Hampton, K. N. and Wellman, B. (2003). Neighboring in netville:
how the internet supports community and social capital in a wired suburb. In city and community,
volume 2, pages 277–311.
122
[Hargittai and Shaw, 2011] Hargittai, E. and Shaw, A. (2011). The internet, young adults and
political engagement around the 2008 presidential election. Presented at the berkman center for
internet and society at Harvard university.
[Harrower and Brewer, 2003] Harrower, M. A. and Brewer, C. A. (2003). ColorBrewer.org: An
online tool for selecting color schemes for maps. volume 40, pages 27–37.
[Hassan-Montero. and Herrero-Solana, 2006] Hassan-Montero., Y. and Herrero-Solana, V. (2006).
Improving tag-clouds as visual information retrieval interfaces. In Proceedings of the Interna-
tional Conference on Multidisciplinary Information Sciences and Technologies, pages 25–28.
[Hassan-Montero and Herrero-Solana, 2006] Hassan-Montero, Y. and Herrero-Solana, V. (2006).
Improving tag-clouds as visual information retrieval interfaces. In Proceedings of the Interna-
tional Conference on Multidisciplinary Information Sciences and Technologies.
[Havre et al., 2002] Havre, S., Hetzler, E., Whitney, P., and Nowell, L. (2002). ThemeRiver:
Visualizing thematic changes in large document collections. volume 8, pages 9–20.
[Healey, 1996] Healey, C. G. (1996). Choosing effective colours for data visualization. In Proceed-
ings of the IEEE Conference on Visualization, pages 263–270.
[Hearst, 2009] Hearst, M. (2009). Search user interfaces. Cambridge University Press.
[Hearst and Rosner, 2008] Hearst, M. A. and Rosner, D. K. (2008). Tag clouds: Data analysis tool
or social signaller? In Proceedings of the Hawaii International Conference on System Sciences,
pages 160–160.
[Helliwell and Putnam, 2004] Helliwell, J. and Putnam, R. (2004). The social context of well-being.
In philosophical transactions of the royal society of London, volume 359, pages 1435–1446.
Springer.
[Henry and Fekete, 2006] Henry, N. and Fekete, J.-D. (2006). MatrixExplorer: a dual-
representation system to explore social networks. volume 12, pages 677–684.
123
[Henry and Fekete, 2007] Henry, N. and Fekete, J.-D. (2007). MatLink: Enhanced matrix visu-
alization for analyzing social networks. In Human-Computer Interaction — Proceedings of
INTERACT, volume 4663 of LNCS, pages 288–302.
[Heyman, 2011] Heyman, D. (2011). Axis Maps, personal communication.
[Hinckley et al., 2002] Hinckley, K., Cutrell, E., Bathiche, S., and Muss, T. (2002). Quantitative
analysis of scrolling techniques. In Proceedings of the ACM Conference on Human Factors in
Computing Systems, pages 65–72.
[Hoffman et al., 2010] Hoffman, M., Blei, D. M., and Bach, F. (2010). Online learning for latent
dirichlet allocation. In advances in neural information processing systems 23, pages 856–864.
[Hofmann, 1999a] Hofmann, T. (1999a). Probabilistic latent semantic analysis. In Proceedings of
the Conference on Uncertainty in Artificial Intelligence, pages 289–296.
[Hofmann, 1999b] Hofmann, T. (1999b). Probabilistic latent semantic analysis. In in proceeding
of uncertainty in artificial intelligence, UAI’99, pages 289–296.
[Hong and Davison, 2010] Hong, L. and Davison, B. D. (2010). Empirical study of topic modeling
in twitter. In Proceedings of the First Workshop on Social Media Analytics, SOMA ’10, pages
80–88. ACM.
[Hong et al., 2011] Hong, L., Dom, B., Gurumurthy, S., and Tsioutsiouliklis, K. (2011). A time-
dependent topic model for multiple text streams. In proceedings of the 17th ACM SIGKDD
international conference on knowledge discovery and data mining, pages 832–840.
[Hossain et al., 2010] Hossain, M. S., Tadepalli, S., Watson, L. T., Davidson, I., Helm, R. F., and
Ramakrishnan, N. (2010). Unifying dependent clustering and disparate clustering for non-
homogeneous data. In Proceedings of the ACM Conference on Knowledge Discovery and Data
Mining, pages 593–602.
[Hotho et al., 2005] Hotho, A., Nürnberger, A., and Paass, G. (2005). A brief survey of text mining.
volume 20, pages 19–62.
124
[Igarashi and Hinckley, 2000] Igarashi, T. and Hinckley, K. (2000). Speed-dependent automatic
zooming for browsing large documents. In Proceedings of the ACM Symposium on User Interface
Software and Technology, pages 139–148.
[Imhof, 1975] Imhof, E. (1975). Positioning names on maps. volume 2, pages 128–144.
[Inselberg, 1985] Inselberg, A. (1985). The plane with parallel coordinates. volume 1, pages
69–91.
[Interrante and Grosch, 1997] Interrante, V. and Grosch, C. (1997). Strategies for effectively
visualizing 3D flow with volume LIC. In Proceedings of the IEEE Conference on Visualization,
pages 421–424.
[Irani et al., 2006] Irani, P., Gutwin, C., and Yang, X. D. (2006). Improving selection of off-screen
targets with hopping. In Proceedings of ACM Conference on Human Factors in Computing
Systems, pages 299–308.
[Isenberg et al., 2008] Isenberg, P., Tang, A., and Carpendale, S. (2008). An exploratory study
of visual information analysis. In Proceeding of the ACM Conference on Human Factors in
Computing Systems, pages 1217–1226.
[Iwata et al., 2010] Iwata, T., Yamada, T., Sakurai, Y., and Ueda, N. (2010). Online multiscale
dynamic topic models. In Proceedings of the ACM Conference on Knowledge Discovery and
Data Mining, pages 663–672.
[Jackson and Moulinier, 2002] Jackson, P. and Moulinier, I. (2002). Natural Language Processing
for Online Applications: Text Retrieval, Extraction & Categorization. John Benjamins.
[Jänicke and Chen, 2010] Jänicke, H. and Chen, M. (2010). A salience-based quality metric for
visualization. volume 29, pages 1183–1192.
[Jones, 1972] Jones, S. K. (1972). A statistical interpretation of term specificity and its application
in retrieval. volume 28, pages 11–21.
125
[Jul and Furnas, 1998] Jul, S. and Furnas, G. W. (1998). Critical zones in desert fog: Aids to
multiscale navigation. In Proceedings of the ACM Symposium on User Interface Software and
Technology, pages 97–106.
[Kadivar et al., 2009] Kadivar, N., Chen, V., Dunsmuir, D., Lee, E., Qian, C., Dill, J., Shaw, C.,
and Woodbury, R. (2009). Capturing and supporting the analysis process. In Proceedings of the
IEEE Symposium on Visual Analytics Science & Technology, pages 131–138.
[Kageura and Umino, 1996] Kageura, K. and Umino, B. (1996). Methods of automatic term
recognition: a review. volume 3, pages 259–289.
[Kakoulis and loannis G. Tollis, 1998] Kakoulis, K. G. and loannis G. Tollis (1998). A unified
approach to labeling graphical features. In Proceedings of the ACM Symposium on Computational
Geometry, pages 347–356.
[Kameda and Imai, 2003] Kameda, T. and Imai, K. (2003). Map label placement for points and
curves. volume E86–A, pages 835–840.
[Kapler and Wright, 2005] Kapler, T. and Wright, W. (2005). GeoTime information visualization.
volume 4, pages 136–146.
[Kavanaugh, 2013] Kavanaugh, A. (2013). The arc of social computing: interaction in web versus
physical communities (to appear). springer.
[Kavanaugh et al., 2000] Kavanaugh, A., Cohill, A., and Patterson, S. (2000). The use and impact
of the blacksburg electronic village, in community networks: lessons from blacksburg, virginia.
Artech house.
[Kavanaugh et al., 2008] Kavanaugh, A., Kim, B., Schmitz, J., and Pérez-Quiñones, M. (2008). Net
gains in political participation: secondary effects of the internet on community. In information,
communication and society, volume 11, pages 933–963.
126
[Kavanaugh et al., 2007] Kavanaugh, A., Zin, T., Rosson, M., Carroll, J., Schmitz, J., and Kim,
B. (2007). Local groups online: political learning and participation. In computer supported
cooperative work, volume 16, pages 375–395.
[Keim et al., 2006] Keim, D. A., Mansmann, F., Schneidewind, J., and Ziegler, H. (2006). Chal-
lenges in visual data analysis. In Proceedings of the International Conference on Information
Visualization, pages 9–16.
[Keim et al., 2002] Keim, D. A., North, S. C., Panse, C., and Schneidewind, J. (2002). Efficient
cartogram generation: A comparison. In Proceedings of the IEEE Symposium on Information
Visualization, pages 33–36.
[Kennings and Vorwerk, 2006] Kennings, A. A. and Vorwerk, K. (2006). Force-directed methods
for generic placement. volume 25, pages 2076–2087.
[Kim et al., 2011] Kim, K. T., Ko, S., Elmqvist, N., and Ebert, D. S. (2011). WordBridge: using
composite tag clouds in node-link diagrams for visualizing content and relations in text corpora.
In Proceedings of the Hawaiian International Conference on System Sciences.
[Kim et al., 2010] Kim, N. W., Card, S. K., and Heer, J. (2010). Tracing genealogical data with
timenets. In Proceedings of the ACM Conference on Advanced Visual Interfaces, pages 241–248.
[Kirkpatrick et al., 1983] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by
simulated annealing. volume 220, pages 671–680.
[Klinenberg, 2002] Klinenberg, E. (2002). Heat wave. The university of Chicago press.
[Kneebone and Nadeau, 2011] Kneebone, E. and Nadeau, C.and Berube, A. (2011). The re-
emergence of concentrated poverty: metropolitan trends in the 2000s. Metropolitan opportunity
series. Brookings.
[Koh et al., 2010] Koh, K., Lee, B., Kim, B. H., and Seo, J. (2010). ManiWordle: Providing flexible
control over Wordle. volume 16, pages 1190–1197.
127
[Kuo et al., 2007] Kuo, B. Y.-L., Hentrich, T., Good, B. M., and Wilkinson, M. D. (2007). Tag
clouds for summarizing web search results. In Proceedings of the ACM Conference on World
Wide Web, pages 1203–1204.
[Kwon et al., 2012] Kwon, B., Javed, W., Ghani, S., Elmqvist, N., Yi, J. S., and Ebert, D. (2012).
Evaluating the role of time in investigative analysis of document collections. volume 18, pages
1992–2004.
[Lakkaraju and Ahn, 2012] Lakkaraju, H. and Ahn, H. (2012). Tem: a novel perspective to model-
ing content on microblogs. In Proceedings of the 21st international conference companion on
World Wide Web, WWW’12, pages 563–564. ACM.
[Lamping and Rao, 1996] Lamping, J. and Rao, R. (1996). The Hyperbolic Browser: A focus +
context technique for visualizing large hierarchies. volume 7, pages 33–35.
[Lau and Moere, 2007] Lau, A. and Moere, A. V. (2007). Towards a model of information aesthetics
in information visualization. In Proceedings of the International Conference on Information
Visualization, pages 87–92.
[Laurini and Thompson, 1992] Laurini, R. and Thompson, D. (1992). Fundamentals of Spatial
Information Systems. Academic Press, New York.
[Lawler, 1995] Lawler, G. F. (1995). Introduction to stochastic processes. Chapman & Hall/CRC.
[Lee et al., 2006a] Lee, B., Parr, C. S., Plaisant, C., Bederson, B. B., Veksler, V. D., Gray, W. D.,
and Kotfila, C. (2006a). TreePlus: interactive exploration of networks with enhanced tree layouts.
volume 12, pages 1414–1426.
[Lee et al., 2006b] Lee, B., Plaisant, C., Parr, C. S., Fekete, J.-D., and Henry, N. (2006b). Task
taxonomy for graph visualization. In Proceedings of BEyond time and errors: novel evaLuation
methods for Information Visualization, pages 82–86.
[Lee et al., 2010] Lee, B., Riche, N. H., Karlson, A. K., and Carpendale, M. S. T. (2010). Spark-
Clouds: Visualizing trends in tag clouds. volume 16, pages 1182–1189.
128
[Leskovec et al., 2009] Leskovec, J., Backstrom, L., and Kleinberg, J. (2009). Meme-tracking
and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 497–506. ACM.
[Leung and Apperley, 1994] Leung, Y. K. and Apperley, M. D. (1994). A review and taxonomy of
distortion-oriented presentation techniques. volume 1, pages 126–160.
[Levkowitz and Herman, 1992] Levkowitz, H. and Herman, G. T. (1992). Color scales for image
data. volume 12, pages 72–80.
[Levkowitz et al., 1992] Levkowitz, H., Holub, R. A., Meyer, G. W., and Robertson, P. K. (1992).
Color versus black and white in visualization. volume 12, pages 20–22.
[Lin and He, 2009] Lin, C. and He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In
Proceedings of the 18th ACM conference on Information and knowledge management, CIKM’09,
pages 375–384. ACM.
[Liu, 2010] Liu, B. (2010). Sentiment analysis and opinion mining. Morgan and claypool publish-
ers.
[Liu et al., 2013] Liu, S., Wu, Y., Wei, E., Liu, M., and Liu, Y. (2013). Storyflow: Tracking the
evolution of stories. volume 19, pages 2436–2445.
[Lohmann et al., 2009] Lohmann, S., Ziegler, J., and Tetzlaff, L. (2009). Comparison of tag cloud
layouts: Task-related performance and visual exploration. In Proceedings of INTERACT, volume
5726 of Lecture Notes in Computer Science, pages 392–404. Springer.
[Lu and Zhai, 2008] Lu, Y. and Zhai, C. (2008). Opinion integration through semi-supervised topic
modeling. In Proceedings of the 17th international conference on World Wide Web, WWW’08,
pages 121–130. ACM.
[M. Shahriar Hossain, 2013] M. Shahriar Hossain, Naren Ramakrishnan, I. D. L. T. W. (2013).
How to alternatize a clustering algorithm. In Data Mining and Knowledge Discovery, volume 27,
pages 193–224. Springer.
129
[MacEachren, 1995] MacEachren, A. M. (1995). How Maps Work: Representation, Visualization
and Design. Guilford Press, New York.
[Mackinlay, 1986] Mackinlay, J. (1986). Automating the design of graphical presentations of
relational information. volume 5, pages 110–141.
[Mackinlay et al., 1990] Mackinlay, J. D., Card, S. K., and Robertson, G. G. (1990). Rapid
controlled movement through a virtual 3D workspace. In Computer Graphics (ACM SIGGRAPH
Proceedings), volume 24, pages 171–176.
[Mackinlay et al., 1995] Mackinlay, J. D., Rao, R., and Card, S. K. (1995). An organic user
interface for searching citation links. In Proceedings of the ACM Conference on Human Factors
in Computing Systems, pages 67–73.
[Mackinlay et al., 1991] Mackinlay, J. D., Robertson, G. G., and Card, S. K. (1991). The Perspec-
tive Wall: Detail and context smoothly integrated. In Proceedings of the ACM Conference on
Human Factors in Computing Systems, pages 173–179.
[Maharik et al., 2011] Maharik, R., Bessmeltsev, M., Sheffer, A., Shamir, A., and Carr, N. (2011).
Digital micrography. volume 30, pages 100:1–100:12.
[Manning and Schütze, 1999] Manning, C. and Schütze, H. (1999). Foundations of Statistical
Natural Language Processing. MIT Press, Cambridge, MA.
[McDonnel and Elmqvist, 2009] McDonnel, B. and Elmqvist, N. (2009). Towards utilizing GPUs
in information visualization: A model and implementation of image-space operations. volume 15,
pages 1105–1112.
[Mei et al., 2007] Mei, Q., Ling, X., Wondra, M., Su, H., and Zhai, C. (2007). Topic sentiment
mixture: modeling facets and opinions in weblogs. In Proceedings of the 16th international
conference on World Wide Web, WWW’07, pages 171–180. ACM.
[Merton, 1968] Merton, R. (1968). The matthew effect in science. In Science, volume 159, pages
56–63.
130
[Moere and Purchase, 2011] Moere, A. V. and Purchase, H. C. (2011). On the role of design in
information visualization. volume 10, pages 356–371.
[Munzner et al., 2003] Munzner, T., Guimbretière, F., Tasiran, S., Zhang, L., and Zhou, Y. (2003).
TreeJuxtaposer: scalable tree comparison using focus+context with guaranteed visibility. In
Computer Graphics (ACM SIGGRAPH Proceedings), pages 453–462.
[Nekrasovski et al., 2006] Nekrasovski, D., Bodnar, A., McGrenere, J., Guimbretière, F., and
Munzner, T. (2006). An evaluation of pan & zoom and rubber sheet navigation with and without
an overview. In Proceedings of the ACM Conference on Human Factors in Computing Systems,
pages 11–20.
[Noack, 2005] Noack, A. (2005). Energy-based clustering of graphs with nonuniform degrees. In
Proceedings of the 13th International Symposium on Graph Drawing, pages 309–320.
[OpenCalais, 2010] OpenCalais (2010). OpenCalais. http://www.opencalais.com. Ac-
cess Mar 2010.
[OSMF, 2012] OSMF (2012). OpenStreetMap. http://www.openstreetmap.org/. ac-
cessed March 2012.
[Pan and Mitra, 2011] Pan, C. and Mitra, P. (2011). Event detection with spatial latent dirichlet
allocation. In Proceedings of the 11th annual international ACM/IEEE joint conference on
Digital libraries, JCDL ’11, pages 349–358. ACM.
[Perlin and Fox, 1993] Perlin, K. and Fox, D. (1993). Pad: An alternative approach to the computer
interface. In Computer Graphics (ACM SIGGRAPH ’93 Proceedings), pages 57–64.
[Pietriga and Appert, 2008] Pietriga, E. and Appert, C. (2008). Sigma Lenses: Focus-context
transitions combining space, time and translucence. In Proceedings of the ACM Conference on
Human Factors in Computing Systems, pages 1343–1352.
[Plaisant et al., 1995] Plaisant, C., Carr, D., and Shneiderman, B. (1995). Image browsers: Taxon-
omy and guidelines for developers. volume 12, pages 21–32.
131
[Plaisant et al., 2002] Plaisant, C., Grosjean, J., and Bederson, B. B. (2002). SpaceTree: Supporting
exploration in large node link tree, design evolution and empirical evaluation. In Proceedings of
the IEEE Symposium on Information Visualization, pages 57–64.
[Plaisant et al., 1996] Plaisant, C., Milash, B., Rose, A., Widoff, S., and Shneiderman, B. (1996).
LifeLines: Visualizing personal histories. In Proceedings of the ACM Conference on Human
Factors in Computing Systems, pages 221–227.
[Proulx et al., 2007] Proulx, P., Chien, L., Harper, R., Schroh, D., Kapler, T., Jonker, D., and
Wright, W. (2007). nSpace and GeoTime: A VAST 2006 case study. volume 27, pages 46–56.
[Putnam, 2000] Putnam, R. (2000). Bowling alone. Simon and Schuster.
[Raghavan, 1997] Raghavan, P. (1997). Information retrieval algorithms: A survey. In Proceedings
of ACM-SIAM Symposium on Discrete Algorithms, pages 11–18.
[Rao and Card, 1994] Rao, R. and Card, S. K. (1994). The Table Lens: Merging graphical and
symbolic representations in an interactive focus+context visualization for tabular information. In
Proceedings of the ACM Conference on Human Factors in Computing Systems, pages 318–322.
[Rheingans, 1992] Rheingans, P. (1992). Color, change and control for quantitative data display.
In Proceedings of the IEEE Conference on Visualization, pages 252–259.
[Rheingans, 2002] Rheingans, P. (2002). Are we there yet? Exploring with dynamic visualization.
volume 22, pages 6–10.
[Rivadeneira et al., 2007] Rivadeneira, A. W., Gruen, D. M., Muller, M. J., and Millen, D. R.
(2007). Getting our head in the clouds: toward evaluation studies of tagclouds. In Proceedings
of the ACM Conference on Human Factors in Computing Systems, pages 995–998.
[Robertson and Mackinlay, 1993a] Robertson, G. G. and Mackinlay, J. D. (1993a). The document
lens. In Proceedings of the ACM Symposium on User Interface Software and Technology, pages
101–108.
132
[Robertson and Mackinlay, 1993b] Robertson, G. G. and Mackinlay, J. D. (1993b). The Document
Lens. In Proceedings of the ACM Symposium on User Interface Software and Technology, pages
101–108.
[Robertson and Jones, 1976] Robertson, S. E. and Jones, K. S. (1976). Relevance weighting of
search terms. volume 27, pages 129–146.
[Robinson et al., 1995] Robinson, A. H., Morrison, J. L., Muehrcke, P. C., Kimerling, A. J., and
Guptill, S. C. (1995). Elements of Cartography. John Wiley & Sons.
[Roscover, 2010] Roscover, D. (2010). Burdened. In Time Magazine.
[Roscover, 2013] Roscover, D. (2013). Steven Paul Jobs. http://www.gorosco.com/
#438803/Steven-Paul-Jobs.
[Rosen-Zvi et al., 2004] Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. (2004). The
author-topic model for authors and documents. In Proceedings of the 20th conference on
uncertainty in artificial intelligence, pages 487–494. UAI ’04.
[Saito et al., 2005] Saito, T., Miyamura, H. N., Yamamoto, M., Saito, H., Hoshiya, Y., and Kaseda,
T. (2005). Two-tone pseudo coloring: Compact visualization for one-dimensional data. In
Proceedings of the IEEE Symposium of Information Visualization, pages 173–180.
[Saito and Takahashi, 1990] Saito, T. and Takahashi, T. (1990). Comprehensible rendering of 3-D
shapes. volume 24, pages 197–206.
[Salton and Buckley, 1988] Salton, G. and Buckley, C. (1988). Term-weighting approaches in
automatic text retrieval. volume 24, pages 513–523.
[Salton and McGill, 1983] Salton, G. and McGill, M. J. (1983). Introduction to Modern Informa-
tion Retrieval. McGraw Hill.
[Sampson, 2011] Sampson, R. J. (2011). Great american city: chicago and the enduring neighbor-
hood effect. The university of chicago press.
133
[Sarkar et al., 1993] Sarkar, M., Snibbe, S. S., Tversky, O. J., and Reiss, S. P. (1993). Stretching
the rubber sheet: A metaphor for visualizing large layouts on small screens. In Proceedings of
the ACM Symposium on User Interface Software and Technology, pages 81–91.
[Schulz, 2011] Schulz, K. (2011). The mechanical muse: what is distant reading. In New York
Times.
[Shah et al., 2005] Shah, D. V., Cho, J., Eveland, W. P. J. R., and Kwak, N. (2005). Information
and expression in a digital age. In communication research, volume 32, pages 531–565.
[Shen and Ma, 2007] Shen, Z. and Ma, K.-L. (2007). Path visualization for adjacency matrices. In
Proceedings of the Eurographics/IEEE-VGTC Symposium on Visualization, pages 83–90.
[Shi et al., 2010] Shi, L., Wei, F., Liu, S., Tan, L., Lian, X., and Zhou, M. (2010). Understanding
text corpora with multiple facets. In Proceedings of the IEEE Conference on Visual Analytics
Science and Technology, pages 99–106.
[Shneiderman, 1983] Shneiderman, B. (1983). Direct manipulation: A step beyond programming
languages. volume 16, pages 57–69.
[Shneiderman, 1992] Shneiderman, B. (1992). Tree visualization with tree-maps: A 2-D space-
filling approach. volume 11, pages 92–99.
[Shneiderman, 1994] Shneiderman, B. (1994). Dynamic queries for visual information seeking.
volume 11, pages 70–77.
[Shneiderman, 1996] Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for
information visualizations. In Proceedings of the IEEE Symposium on Visual Languages, pages
336–343.
[Shoemaker and Gutwin, 2007] Shoemaker, G. and Gutwin, C. (2007). Supporting multi-point
interaction in visual workspaces. In Proceedings of the ACM Conference on Human Factors in
Computing Systems, pages 999–1008.
134
[Sinclair and Cardew-Hall, 2008] Sinclair, J. and Cardew-Hall, M. (2008). The folksonomy tag
cloud: when is it useful? volume 34, pages 15–29.
[Skupin, 2000] Skupin, A. (2000). From metaphor to method: Cartographic perspectives on
information visualization. In Proceedings of the IEEE Symposium on Information Visualization,
pages 91–98.
[Slack et al., 2006] Slack, J., Hildebrand, K., and Munzner, T. (2006). PRISAD: A partitioned
rendering infrastructure for scalable accordion drawing (extended version). volume 5, pages
137–151.
[Slocum et al., 2004] Slocum, T. A., McMaster, R. B., Kessler, F. C., and Howard, H. H. (2004).
Thematic Cartography and Geographic Visualization. Pearson Education, 2nd edition.
[Slocum et al., 2009] Slocum, T. A., McMaster, R. B., Kessler, F. C., and Howard, H. H. (2009).
Thematic Cartography and Geovisualization. Prentice Hall, third edition.
[Smith and Mark, 1998] Smith, B. and Mark, D. M. (1998). Ontology and geographic kinds. In
Proceedings of the International Symposium on Spatial Data Handling, pages 308–320.
[Smith and Taivalsaari, 1999] Smith, R. B. and Taivalsaari, A. (1999). Generalized and stationary
scrolling. In Proceedings of the ACM Symposium on User Interface Software and Technology,
pages 1–9.
[Stasko et al., 2008a] Stasko, J. T., Görg, C., and Liu, Z. (2008a). Jigsaw: supporting investigative
analysis through interactive visualization. volume 7, pages 118–132.
[Stasko et al., 2008b] Stasko, J. T., Görg, C., and Liu, Z. (2008b). Jigsaw: supporting investigative
analysis through interactive visualization. In Proceedings of the IEEE Symposium on Visual
Analytics Science and Technology, pages 187–194.
[Stone et al., 1994] Stone, M. C., Fishkin, K., and Bier, E. A. (1994). The movable filter as a user
interface tool. In Proceedings of the ACM Conference on Human Factors in Computing Systems,
pages 306–312.
135
[Strobelt et al., 2009] Strobelt, H., Oelke, D., Rohrdantz, C., Stoffel, A., Keim, D. A., and Deussen,
O. (2009). Document cards: A top trumps visualization for documents. volume 15, pages
1145–1152.
[Svakhine and Ebert, 2003] Svakhine, N. A. and Ebert, D. S. (2003). Interactive volume illustra-
tion and feature halos. In Proceedings of the Pacific Conference on Computer Graphics and
Applications, pages 347–354.
[Tadepalli et al., 2009] Tadepalli, S., Ramakrishnan, N., Watson, L. T., Mishra, B., and Helm, R. F.
(2009). Simultaneously segmenting multiple gene expression time courses by analyzing cluster
dynamics. volume 7, pages 339–356.
[Tanahashi and Ma, 2012] Tanahashi, Y. and Ma, K.-L. (2012). Design considerations for optimiz-
ing storyline visualizations. volume 18, pages 2679–2688.
[Teh et al., 2004] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2004). Hierarchical
dirichlet processes. In Journal of the american statistical association, volume 101.
[Teh et al., 2006] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical
dirichlet processes. In Journal of the american statistical association, volume Vol. 101, pages
pp. 1566–1581.
[Tobler, 1970] Tobler, W. (1970). A computer model simulating urban growth in the Detroit region.
volume 46, pages 234–240.
[Tobler, 1976] Tobler, W. (1976). Cartograms and cartosplines. In Proceedings of the Workshop on
Automated Cartography and Epidemiology, pages 53–58.
[Tory and Möller, 2005] Tory, M. and Möller, T. (2005). Evaluating visualizations: Do expert
reviews work? volume 25, pages 8–11.
[Tufte, 1983] Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press,
Cheshire, Connecticut.
136
[Tufte, 1990] Tufte, E. R. (1990). Envisioning Information. Graphics Press, Cheshire, Connecticut.
[Tufte, 1997] Tufte, E. R. (1997). Visual Explanations: images and quantities, evidence and
narrative. Graphics Press, Cheshire, Connecticut.
[Turkay et al., 2011] Turkay, C., Parulek, J., Reuter, N., and Hauser, H. (2011). Interactive visual
analysis of temporal cluster structures. volume 30, pages 711–720.
[Uslaner and Brown, 2005] Uslaner, E. M. and Brown, M. (2005). Inequality, trust, and civic
engagement. pages 868–894.
[van Ham, 2003] van Ham, F. (2003). Using multilevel call matrices in large software projects. In
Proceedings of the IEEE Symposium on Information Visualization, pages 227–232.
[van Ham et al., 2009] van Ham, F., Wattenberg, M., and Viégas, F. B. (2009). Mapping text with
phrase nets. volume 15, pages 1169–1176.
[van Wijk and Nuij, 2003] van Wijk, J. J. and Nuij, W. A. A. (2003). Smooth and efficient zooming
and panning. In Proceedings of the IEEE Symposium on Information Visualization, pages 15–22.
[Venolia and Neustaedter, 2003] Venolia, G. D. and Neustaedter, C. (2003). Understanding se-
quence and reply relationships within email conversations: a mixed-model visualization. In
Proceedings of the ACM Conference on Human Factors in Computing Systems, pages 361–368.
[Viégas et al., 2006] Viégas, F. B., Golder, S., and Donath, J. (2006). Visualizing email content:
portraying relationships from conversational histories. In Proceedings of the ACM Conference
on Human Factors in Computing Systems, pages 979–988.
[Viégas and Wattenberg, 2008] Viégas, F. B. and Wattenberg, M. (2008). Tag clouds and the case
for vernacular visualization. volume 15, pages 49–52.
[Viégas et al., 2009] Viégas, F. B., Wattenberg, M., and Feinberg, J. (2009). Participatory visual-
ization with Wordle. volume 15, pages 1137–1144.
137
[Vinson, 1999] Vinson, N. G. (1999). Design guidelines for landmarks to support navigation in
virtual environments. In Proceedings of the ACM Conference on Human Factors in Computing
Systems, pages 278–285.
[Vuillemot et al., 2009] Vuillemot, R., Clement, T., Plaisant, C., and Kumar, A. (2009). What’s
being said near ’martha’? exploring name entities in literary text collections. In Proceedings of
the IEEE Symposium on Visual Analytics Science and Technology, pages 107–114.
[W3C, 2012] W3C (2012). Scalable Vector Graphics (SVG). http://www.w3.org/SVG/.
accessed March 2012.
[Wallach, 2008] Wallach, H. M. (2008). Structured Topic Models for Language. doctoral disserta-
tion, university of Cambridge.
[Walsh, 1992] Walsh, K. C. (1992). Talking about politics: informal groups and social identity in
american life. The university of Chicago press books.
[Wang et al., 2008] Wang, C., Bleid, D., and Heckerman, D. (2008). Continuous time dynamic
topic models. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages
579–586.
[Wang et al., 2013] Wang, X., Liu, S., Song, Y., and Guo, B. (2013). Mining evolutionary multi-
branch trees from text streams. In Proceedings of the 19th ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, pages 722–730.
[Wang and McCallum, 2006] Wang, X. and McCallum, A. (2006). Topics over time: a non-Markov
continuous-time model of topical trends. In Proceedings of the ACM Conference on Knowledge
Discovery and Data Mining, pages 424–433.
[Wang et al., 2007] Wang, X., Zhai, C., Hu, X., and Sproat, R. (2007). Mining correlated bursty
topic patterns from coordinated text streams. In Proceedings of the 13th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining, KDD ’07, pages 784–793.
ACM.
138
[Wang et al., 2009] Wang, X., Zhang, K., Jin, X., and Shen, D. (2009). Mining common topics
from multiple asynchronous text streams. In Proceedings of the Second ACM International
Conference on Web Search and Data Mining, WSDM ’09, pages 192–201. ACM.
[Wang et al., 2012] Wang, Y., Agichtein, E., and Benzi, M. (2012). Tm-lda: efficient online
modeling of latent topic transitions in social media. In Proceedings of the 18th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 123–131.
ACM.
[Ware, 2004] Ware, C. (2004). Information Visualization: Perception for Design. Morgan Kauf-
mann Publishers, second edition.
[Wattenberg, 2002] Wattenberg, M. (2002). Arc diagrams: Visualizing structure in strings. In
Proceedings of the IEEE Symposium on Information Visualization, pages 110–116.
[Wattenberg, 2006] Wattenberg, M. (2006). Visual exploration of multivariate graphs. In Proceed-
ings of the ACM 2006 Conference on Human Factors in Computing Systems, pages 811–819.
[Wattenberg and Viégas, 2008] Wattenberg, M. and Viégas, F. B. (2008). The word tree, an inter-
active visual concordance. volume 14, pages 1221–1228.
[Wei et al., 2010] Wei, F., Liu, S., Song, Y., Pan, S., Zhou, M. X., Qian, W., Shi, L., Tan, L., and
Zhang, Q. (2010). TIARA: a visual exploratory text analytic system. In Proceedings of the ACM
Conference on Knowledge Discovery and Data Mining, pages 153–162.
[Wise et al., 1995] Wise, J. A., Thomas, J. J., Pennock, K., Lantrip, D., Pottier, M., Schur, A., and
Crow, V. (1995). Visualizing the non-visual: Spatial analysis and interaction with information
from text documents. In Proceedings of the IEEE Symposium on Information Visualization,
pages 51–58.
[Wolfe, 1998] Wolfe, J. M. (1998). Attention, chapter Visual Search. Psychology Press. H. Pashler
(ed.).
139
[Wong et al., 2003] Wong, N., Carpendale, M. S. T., and Greenberg, S. (2003). EdgeLens: An
interactive method for managing edge congestion in graphs. In Proceedings of the IEEE
Symposium on Information Visualization, pages 51–58.
[Wong et al., 2005] Wong, P. C., Mackey, P., Perrine, K., Eagan, J., Foote, H., and Thomas, J.
(2005). Dynamic visualization of graphs with extended labels. In Proceedings of the IEEE
Symposium on Information Visualization, pages 73–80.
[Wood and Dykes, 2008] Wood, J. and Dykes, J. (2008). Spatially ordered treemaps. volume 14,
pages 1348–1355.
[Wright et al., 2006] Wright, W., Schroh, D., Proulx, P., Skaburskis, A., and Cort, B. (2006). The
sandbox for analysis: concepts and evaluation. In Proceedings of the ACM Conference on Human
Factors in Computing Systems, pages 801–810.
[Wyszecki and Stiles, 2000] Wyszecki, G. and Stiles, W. S. (2000). Color Science: Concepts and
Methods, Quantitative Data and Formulae. Wiley, 2nd edition.
[Xu and Chen, 2005] Xu, J. and Chen, H. (2005). Criminal network analysis and visualization.
volume 48, pages 100–107.
[Yahoo, 2012] Yahoo (2012). TagMaps. http://tagmaps.research.yahoo.com/. ac-
cessed March 2012.
[Yi et al., 2007] Yi, J. S., ah Kang, Y., Stasko, J. T., and Jacko, J. A. (2007). Toward a deeper
understanding of the role of interaction in information visualization. volume 13.
[Yu et al., 2010] Yu, D., Park, H., Gerold, D., and Legge, G. E. (2010). Comparing reading speed
for horizontal and vertical english text. volume 10, pages 1–17.
[Zellweger et al., 2003] Zellweger, P. T., Mackinlay, J. D., Good, L., Stefik, M., and Baudisch, P.
(2003). City lights: contextual views in minimal space. In Extended Abstracts of the ACM
Conference on Human Factors in Computing Systems, pages 838–839.
140
[Zhang et al., 2010] Zhang, J., Song, Y., Zhang, C., and Liu, S. (2010). Evolutionary hierarchical
Dirichlet processes for multiple correlated time-varying corpora. In proceedings of the 16th ACM
SIGKDD international conference on knowledge discovery and data mining, pages 1079–1088.