Web Archives and Data Challenges - Archives Unleashed
-
Upload
mwe400 -
Category
Technology
-
view
1.735 -
download
0
Transcript of Web Archives and Data Challenges - Archives Unleashed
IBM Survey Status Update
Put Hacks to Work: Archives in Research
Credit: Flickr @ilovecology
Can we use what we make?
Emporer Penguins huddling together for survival... Population... Interacting in a large ecosystem with other animals.2
3
Who is the audience?
What matters?
WhiteHouse.gov press release from May 1, 2003, archived on May 6, 2003
6
WhiteHouse.gov press release from May 1, 2003, archived on October 1, 2003
7
8
8
Filtering to what matters9Source | Destination | Date | Frequency | Content Type | Bytes | ContentLink Data:http://gawker.com/5953665/mitt-romneys-staff-played-the-media-covering-them-in-a-friendly-game-of-flag-footballMitt Romney's Staff Played the Media Covering Them in a Friendly Game of Flag http://gawker.com2012-10-22
July 14, 200610
July 14, 2006
11
February 25 201112
14
News Media on the Web(Weber, Ognyanova, Kosterich & Nguyen, 2015)
Correlations between outgoing link vectors to show profile similarities16
NJ Local News: 2007 - 2012
18
19DatasetResearch PotentialDatesCapturesUnique URLs
Hurricane KatrinaOnline networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 20121,694,236663,740 Superstorm Sandy2003 201241,703,11220,013,455
US SenateStudy the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse109th 112th Congresses26,965,770 8,674,397 US House51,840,77712,410,014
Occupy Wall StreetPrevious research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs2010 2012247,928,27211,3259,655
US MediaPrevious studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns2008 20121,315,132,555539,184,823
20th Century Collection = 9TB of metadataMedia Seed List = 4,891For instance, researchers have proposed focusing archival efforts on capturing data that changes the most frequently, in order to capture the majority of new content [36]. Elsewhere, researchers have suggested that crawling strategies should prioritize archival efforts based on the size and relative position of websites within their larger ecosystems [37].
19
What about reliability?
Driscoll and Walker (2014) For instance, a comparison of Twitter data collected via a public API and data collected from a fire hose provided by GNIP PowerTrack, found significant differences between the two datasets. In most cases the PowerTrack data proved to be more powerful,
20
21
Validity?
22
23
24
25tCount of URLsPotentialActualDifference
26
tCount of URLsOWSHouseSenateKatrinaexistingpotentialb = set a unit of time for analysis, cchoosing n periods across a total time T
3 month windows of time 26
In the ideal case, it would be possible to create a factor that corrects for data degrade:btHow does this help?Each of the illustrated cases fits against an exponential function ~ b Senate: 0.13House: 0.13Katrina: 0.02OWS: 0.1027ebt
28
Challenges are not unique to these dataCourtesy of Marc Smith, NodeXL
28
29
Research support from:
NSF Award #1244727; Additional support from the NetSCI Lab @ Rutgers