Web Archives and Data Challenges - Archives Unleashed

29
Put Hacks to Work: Archives in Researc

Transcript of Web Archives and Data Challenges - Archives Unleashed

IBM Survey Status Update

Put Hacks to Work: Archives in Research

Credit: Flickr @ilovecology

Can we use what we make?

Emporer Penguins huddling together for survival... Population... Interacting in a large ecosystem with other animals.2

3

Who is the audience?

What matters?

WhiteHouse.gov press release from May 1, 2003, archived on May 6, 2003

6

WhiteHouse.gov press release from May 1, 2003, archived on October 1, 2003

7

8

8

Filtering to what matters9Source | Destination | Date | Frequency | Content Type | Bytes | ContentLink Data:http://gawker.com/5953665/mitt-romneys-staff-played-the-media-covering-them-in-a-friendly-game-of-flag-footballMitt Romney's Staff Played the Media Covering Them in a Friendly Game of Flag http://gawker.com2012-10-22

July 14, 200610

July 14, 2006

11

February 25 201112

14

News Media on the Web(Weber, Ognyanova, Kosterich & Nguyen, 2015)

Correlations between outgoing link vectors to show profile similarities16

NJ Local News: 2007 - 2012

18

19DatasetResearch PotentialDatesCapturesUnique URLs

Hurricane KatrinaOnline networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 20121,694,236663,740 Superstorm Sandy2003 201241,703,11220,013,455

US SenateStudy the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse109th 112th Congresses26,965,770 8,674,397 US House51,840,77712,410,014

Occupy Wall StreetPrevious research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs2010 2012247,928,27211,3259,655

US MediaPrevious studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns2008 20121,315,132,555539,184,823

20th Century Collection = 9TB of metadataMedia Seed List = 4,891For instance, researchers have proposed focusing archival efforts on capturing data that changes the most frequently, in order to capture the majority of new content [36]. Elsewhere, researchers have suggested that crawling strategies should prioritize archival efforts based on the size and relative position of websites within their larger ecosystems [37].

19

What about reliability?

Driscoll and Walker (2014) For instance, a comparison of Twitter data collected via a public API and data collected from a fire hose provided by GNIP PowerTrack, found significant differences between the two datasets. In most cases the PowerTrack data proved to be more powerful,

20

21

Validity?

22

23

24

25tCount of URLsPotentialActualDifference

26

tCount of URLsOWSHouseSenateKatrinaexistingpotentialb = set a unit of time for analysis, cchoosing n periods across a total time T

3 month windows of time 26

In the ideal case, it would be possible to create a factor that corrects for data degrade:btHow does this help?Each of the illustrated cases fits against an exponential function ~ b Senate: 0.13House: 0.13Katrina: 0.02OWS: 0.1027ebt

28

Challenges are not unique to these dataCourtesy of Marc Smith, NodeXL

28

29

Research support from:

NSF Award #1244727; Additional support from the NetSCI Lab @ Rutgers