Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library...

19
Advancing Library Cyberinfrastructure for Big Data Sharing and Reuse 2017 NFAIS Annual Conference, Feb 27, 2017 Zhiwu Xie

Transcript of Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library...

Page 1: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Advancing Library Cyberinfrastructure

for Big Data Sharing and Reuse

2017 NFAIS Annual Conference, Feb 27, 2017

Zhiwu Xie

Page 2: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Big Data: How Big?

• Moving yardstick

• No longer unique to “big” science

• 1000 Genomes project:

200TB in 4 years

• Sloan Digital Sky Phase I and

II: 130TB in 8 years

• Today, a small lab can

produce as much data in

shorter period of time

Page 3: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Library Big Data: Examples

• Library of Congress Twitter Archive

• Digital Preservation Network (DPN)

• HathiTrust Research Center (HTRC)

• Digital Public Library of America (DPLA)

• SHARE

Page 4: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Towards Use And Reuse Driven

Big Data ManagementZhiwu Xie1, Yinlin Chen1, Julie Speer1, Tyler Walters1, Pablo A Tarazaga2, and Mary Kasarda2

1University Libraries and 2Department of Mechanical EngineeringVirginia Polytechnic Institute and State University

Blacksburg, USA

June 23, 2015, JCDL 2015, Knoxville, TN

Page 5: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

“…running water is never stale and a door-hinges never get worm-eaten…”

-- Lü's Annals, c. 239 BCE

Page 6: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Research Data Management

• What are the roles of the academic and

research library?

Page 7: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Research Data Management• What are the roles of the academic and research library?

• How can we help?

U.S. National Archives’ Local Identifier: 102-LH-1494Chris 73 / Wikimedia Commons

Page 8: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Big Data: Institutional Context

Page 9: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Data Projects @ VT Libraries• Inter- and cross- disciplinary

• Grow out of our capacity, beyond IR building

• Focus on reuse

• Require deep engagements

Page 10: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Goodwin Hall Living Lab

• A 160,000-sf new building wired with

>240 different sensors

• Sensor mounts were directly wielded

to the structural steel during the

building construction

• Sensors are strategically positioned

and sufficiently sensitive to detect

human movements

• Will be the most instrumented

building for vibration

Page 11: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Goodwin Hall Living Lab

• Designed as a multi-purpose living

laboratory

• Opportunities for multi- and cross-

disciplinary exploration and discovery

• > 40 researchers and educators in

various disciplines and institutes

expressed interests in using the data

• VT libraries is tasked with building the

digital libraries to manage the data

and support these activities

• Data volume: > 30TB per year

Page 12: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

VT Event Digital Library &

Archive

• Track and analyze live events such

as earthquakes, political events,

community activities, and violence,

crime prevention

• Potentially used by researchers from

many diverse disciplines

• Currently run on the lab’s own 20-

node Hadoop cluster

• 1 billion tweets & 11TB of webpages

• Through a MOU, library invested on

the data storage and became a

partner

Page 13: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

SHARE Notify

• Free, open data set about research

and scholarly activities gathered

from various sources

• Linking publications to grants,

receive real time event notifications

on mobile devices, etc.

• 149 aggregated sources, ~20 million

events as of Feb 2017

Page 14: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Developing Library Cyberinfrastructure

Strategy for Big Data Sharing and Reuse

• A 2-year IMLS National Leadership for Libraries grant,

starting form June 2016

• Incentivized by the above 3 projects

• A collaboration between VT Libraries, Mechanical

Engineering, Computer Science, and UNT.

• Emphasis is on

• Leveraging shared infrastructure

• Widely applicable strategy

• Equip libraries with solid knowledge and techniques

to balance their desires, needs, and constraints with

a clear understanding of the tradeoffs

Page 15: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Key Research Questions

• What are the key technical challenges?

• What are the monetary and non-monetary (time,

skill set, administrative, etc.) costs? Are there any

cost patterns or correlations to the CI options?

• What are the knowledge and skill requirements

for librarians?

• What are the key service and performance

characteristics?

• How to consolidate the answers to the above

questions to form an easy to adapt and effective

library CI strategy?

Page 16: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Cyberinfrastructure Options

• Institutional high-performance computing (HPC),

high-throughput computing (HTC) and storage

facilities

• National HPC, HTC, and storage facilities, e.g.,

XSEDE resources

• National research clouds, e.g., Chameleon

Cloud, CloudLab, Open Science Data Cloud,

etc.

• Commercial clouds, e.g., Amazon Web Services

(AWS), Rackspace, etc.

• No unified CI framework or strategy to pick CI for

different library big data sharing and reuse

situations

Page 17: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Library Big Data Reuse Patterns

Compute

Storage

Bridge Network Hub

Goodwin Hall Event DL SHARE Notify

Page 18: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Progress So Far

• Identified the network bandwidth as a key

bottleneck in the bridge pattern

• Analyzing data loading, its acceleration

techniques, and tradeoffs in the network pattern

• Participated in building VT’s mass storage facility

• Participated in building VT’s 10G campus network

Page 19: Advancing Library Cyberinfrastructure for Big Data Sharing ... · Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse • A 2-year IMLS National Leadership

Questions?