Cognitive Engine: Boosting Scientific Discovery

34
Scalable Software Systems Laboratory Scalable Software Systems Laboratory Department of Electrical and Computer Engineering CognitiveEngine: Boosting Scientific Discovery Xiaolin Andy Li http://www.andyli.ece.ufl.edu

Transcript of Cognitive Engine: Boosting Scientific Discovery

Scalable Software Systems Laboratory

Scalable Software Systems Laboratory Department of Electrical and Computer Engineering

CognitiveEngine: Boosting Scientific Discovery Xiaolin Andy Li http://www.andyli.ece.ufl.edu

Scalable Software Systems Laboratory

Information Technology Text in here

1939 1946 1970 1980 1990 New Age

ENIAC

ARPANET The Internet Fiber Optics

Vint Cerf

Bob Kahn Charles Kuen Kao

Mosaic Web Browser Marc Andreessen and Eric Bina

WWW Tim Berners Lee

Martin Cooper, 1973 Steve Jobs, 2007

1G, 1980s 2G, 1990s 3G, 2000s 4G, 2010s

ABC John Atanasoff BSEE@UF, 1925

Scalable Software Systems Laboratory

Cloud Computing n  SaaS: Software as a Service

n  Salesforce, 1999

n  StaaS: Storage as a Service n  Amazon S3, 2006; Dropbox, 2008

n  PaaS: Platform as a Service n  Google App Engine, 2008; Microsoft Azure, 2010; n  Docker, 2013; IBM BlueMix, 2014

n  IaaS: Infrastructure as a Service n  Amazon AWS, 2002; Eucalyptus, 2008 n  Rackspace/NASA OpenStack, 2010; Google Compute Engine, 2012

2000

Scalable Software Systems Laboratory

SDN: Software-Defined Networking

*RRJOH�&RQILGHQWLDO�DQG�3URSULHWDU\

*RRJOHV�2SHQ)ORZ�:$1

Nick McKeown

Scott Schenker

Martin Casado

2009

Scalable Software Systems Laboratory

Internet2 Innovation Platform 2013

Scalable Software Systems Laboratory

Geoffrey Hinton, Yann LeCun, Yoshua Bengio, Andrew Ng, Demis Hassabis

2013

Scalable Software Systems Laboratory

1970 àà 1990 àà 2010 àà 2030 àà

2D IT Booming Cycles

IT Boom V2 IT Boom V3 IT Boom V1

1950 à à à 1980 à à à 2010 à à à 2040

3D Computing Platform Cycles

2nd Platform 3rd Platform 1st Platform 4th Platform

Towards Intelligent Platform IT Boom V4

Scalable Software Systems Laboratory

Time for Change Current Unified Big Systems

Hadoop

OpenStack

Torque

Pig

Dryad

Pregel

Percolator

CIEL

Container Virtual Machine Bare Metal

Scalable Software Systems Laboratory

GatorCloud - Towards Software-Defined Ecosystems

OpenFlow

Software-Defined

Computing

SDC Apps

Runtime

Big Data

PBS/Torq

Virtual Machine Container

Nova Controller

HPC

Program Models

Software-Defined

Networking

SDN Apps

Low Latency

SDN Hypervisor

OVS

OF-Config

Open Flow

GENI

SDN Controller

High Throughp

ut

Scalable Software Systems Laboratory

GatorCloud Network Topology

2*10Gb/s upgraded to

2*100Gb/s

National Lambda Rail, Internet2, GENI

(via Jacksonville)

UF

Physics CMS/OSG

Data Center

GatorVisor

SSRB CNS Lab

NEB S3Lab

CISE Lab

Apps Controller

Nets Controller

8U

46U

8U

8U

1U2U

3U

3U

3U

8U

46U

8U

8U

1U2U

3U

3U

3U

Data Cloud VM Cloud Cloud Portal

VM Cloud Data Cloud

2

2

2

2

100G

100G

100G 100G 10G

40G

4

4

Cloud Orange Cloud Green

FLR

ECDC HPC Center - ES

Physics HPC Center - Phy 2

100G

Larsen HPC Center - Eng

SSRB Campus Datacenter

Hybrid Controller

Larsen HCS Lab

40G 4

2*10Gb/s upgraded to

2*100Gb/s

Golfer Golfer

Deployed in 2012, one of the first 100Gbps SDN Campus Research Networks in USA

SDN Switch

Phase 1 SDN, 40G/10G Phase 2 SDN, 100G

SDN Control Plane

Scalable Software Systems Laboratory

HiPerGator Supercomputer

Ranking from top500 supercomputer list # 4 among public universities in US # 8 among universities in US # 115 among all machines listed

Major Data Centers at UF HiPerGator Supercomputer CMS/OSG Physics HPC Centers ICBR: Interdisciplinary Center for Biotech Research CTSI: Clinical and Translational Science Institute ACIS/CAC Data Center CHREC Data Center (Novo-G) NEB Data Center

Scalable Software Systems Laboratory

What Changed?

Lecture 1 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Convolution Pooling Softmax Other

GoogLeNet VGG MSRA SuperVision

[Krizhevsky NIPS 2012]

Year 2012 Year 2014 Year 2010

Dense grid descriptor: HOG, LBP

Coding: local coordinate, super-vector

Pooling, SPM

Linear SVM

NEC-UIUC

[Lin CVPR 2011] [Szegedy arxiv 2014] [Simonyan arxiv 2014]

4-Jan-16 31

Year 2015

Revolution of Depth

34

5866

86

HOG, DPM AlexNet(RCNN)

VGG(RCNN)

ResNet(Faster RCNN)*

PASCAL VOC 2007 Object Detection mAP (%)

shallow8 layers

16 layers

101 layers

*w/ other improvements & more data

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.

Engines ofvisual recognition

Revolution of Depth

3.57

6.7 7.3

11.7

16.4

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNet Classification top-5 error (%)

shallow8 layers

19 layers22 layers

152 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.

8 layers

Beyond Human

Scalable Software Systems Laboratory

CognitiveEngine: Beyond Hadoop and Spark n  Bulk Synchronization Parallel

n  Both a blessing and a curse n  Easy to schedule and arrange dependency n  All synchronized

Map

Reduce

Stage

Stage

Stage

Stage

Scalable Software Systems Laboratory

ADD Design Choices n  Asynchronous Distributed Datasets (ADD)

n  Inherits the easy-to-use programming interface n  Differentiate static data (samples) and the iteratively updated data

(parameters) n  Automatic asynchronous updates, with user specified bound n  Asynchronous-aware scheduling

Scalable Software Systems Laboratory

ADD local copy

ADD System

ADD Server

ADD Server

ADD Client

ADD Client

ADD Client

Training samples

Training samples

Training samples

Async push

Async pull

Feed Forward + Back Propagation

ADD features •  Async push and pull of model

update •  Users are allowed to specify the

condition of returning from pull/push, so that they don’t have to wait

•  Adaptive model update method: all-to-one/tree aggregation/P2P approximate update

•  User-controllable tradeoff between asynchrony and convergence rate

•  Model snapshot and sharing

Scalable Software Systems Laboratory

Execution

Static Data

Dynamic Data Handler Function State

ADD Partition

ADD Task

ADD Task

ADD Task

Locality Iteration, etc.

Fetch

Compute

Update

Bookkeeping

Scalable Software Systems Laboratory

Advantages n  Asynchronous Update n  IO / CPU overlap n  Fault tolerant n  Derive and live with state-of-the-art system

n  Spark

n  Sharing among jobs and users n  Maximizing parallelism of GPUs

Scalable Software Systems Laboratory

DeepApps n  DeepScience

n  DeepSky n  DeepDefense

n  DeepHealth n  DeepBipolar n  DeepVital n  DeepGuard n  DeepCancer n  DeepBot/Dingding

n  DeepDrug

Scalable Software Systems Laboratory

DeepSky: Sloan Digital Sky Survey

With Jian Ge

Scalable Software Systems Laboratory

The animation shows how Kepler detects planets. As the planet passes between the host star and the spacecraft, the observed star brightness decreases slightly, signaling the potential detection of a planet. Kepler looked at over 150,000 stars continuously for four years in the constellations Cygnus and Lyra, seeking to record the slight periodic brightness changes in stars that could reveal the presence of planets.

Kepler detects planets by taking a photometric measurement of the stars in its field of view every 30 minutes. A planet transit will show as a small periodic dip in the “light curve” of a star over time.

Kepler Data

Goal: Detect planet(s) currently missed by the Kepler Team’s automatic search programs -- likely “super-Earths” with long periods

Scalable Software Systems Laboratory

Quasar Spectra Pair Method

The identification of 2175 bump is based on Mgii absorber catalog with limitation: •  We can only identify the 2175 bump in the redshift

range from 0.7 to 2.5. •  The method is based on Mg II absorber catalog. If the

Mg ii absorber catalog is not complete, the 2175 bump sample may not be complete.

Scalable Software Systems Laboratory

Analysis of the Effects

(a) Input data with bumps (c) Feature map of last convolutional layer

(b) Filters of the first convolutional layer

Scalable Software Systems Laboratory

Reconstruction of Bumps

(d) Reconstructed input image with bump

(e) Reconstructed input image without bump

Scalable Software Systems Laboratory

DeepDefense: DDoS Detection

Scalable Software Systems Laboratory

DeepDefense Architecture

LSTM

CTC

DataSequence1000

DataSequence2000

DataSequence3000

DataSequence4000

CNN

CNN

CNN

CNN

CNN

LSTM LSTM

LSTM LSTM

LSTM

LSTM LSTM

LSTM

LSTM LSTM

LSTM

LSTM LSTM

LSTM LSTM

LSTM LSTM

Spatial

Temp

oral, Recurrent, C

ascading

LSTM

BPTT

BPTS

Feature Analysis

Ensemble Analysis

Knowledge Fusion

Performance Evaluation

BPTT: Backpropogation Through Time BPTS: Backpropogation Through Space CNN: Convolution Neural Network LSTM: Long Short-Term Memory CTS: Connectionist Temporal Classification

Searchable O

utp

uts

Scalable Software Systems Laboratory

Data-Driven DeepHealth

With Azra Bihorac, Lizi Wu, Parisa Rashidi etc

Scalable Software Systems Laboratory

Bipolar Disorder & Challenge Objectives •  Bipolar disorder is a brain disease that causes

unusual mood shifts •  Estimated 51% of affected population go

untreated in a given year •  Detection not straightforward - symptoms and

test metrics not too dissimilar from other brain disease

•  Recent studies indicate heritability and genetic factors as causes opening new area of detection using genome data.

•  CAGI challenge given to predict the bipolar disorder using exomes .

•  Exome sequencing data of 1000 samples with 500 for training and 500 for prediction challenge Image source http://www.nimh.nih.gov/health/statistics/prevalence/

bipolar-disorder-among-adults.shtml

Scalable Software Systems Laboratory

Data Pre-Processing n  Extracted genotype information from the exomes n  The genotypes were 0/0,0/1,1/1 and ./. n  One-hot-encoding transformation on the genotypes i.e 0/0

encoded as 0100, 0/1 encoded as 0010,etc. n  One hot encoding treats all categorical variables equidistant

Scalable Software Systems Laboratory

DeepBipolar V1: Convolutional DNN Genotype data: 2008 * 1000 * 1

32 kernels,kernel size: 4*4*1 , stride: (1,4)

32 kernels,kernel size: 3*3*32 , stride: (1,1)

Max Pooling: Pool size (3,3), stride=(3,2)

2 x 64 kernels,size: 3*3*32 , stride: (1,1)

MP:size (1,3), stride=(3,3)

128 kernels,kernel size: 3*3*64 , stride: (1,1)

128 kernels,kernel size: 3*3*128 , stride: (1,1)

Max Pooling: size (2,2), stride=(2,2)

128 kernels,size: 3*3*128,stride: (1,1)

MP:size (3,3), s=(2,2)

1 kernels,size: 1*1,stride: (1,1)

Fully Connected Layer 64 neurons

Sigmoid - Probability Output Layer

997

502

32 32

995

500

331

249

32

64

329

247

64

327

81

109

245

107

128

79 128

77

105

52

38

128

36

50

128

128

17

24

24

17

1

64

Scalable Software Systems Laboratory

DeepBipolar V2: Convolutional AutoEncoder Genotype data: 2008 * 1000 * 1

32 kernels,kernel size: 4*4*1 , stride: (1,4)

32 kernels,kernel size: 3*3*32 , stride: (1,1)

Max Pooling(MP):size (3,3), stride=(3,2)

64 kernels,kernel size: 3*3*32 , stride: (1,1)

64 kernels,kernel size: 3*3*64 , stride: (1,1)

Max Pooling: Pool size (1,3), stride=(3,3)

128 kernels,kernel size: 3*3*64 , stride: (1,1)

997

502

32 32

995

500

331

249

32

64

329

247

64

327

81

109 107

128

79 128

128

Up Sampling: size (3,3), stride=(3,2)

109

81

245

2 x 64 kernels,size: 3*3*64 Deconvolution

Up Sampling: size (1,3), stride=(3,3)

2 x 32 kernels, size: 3*3*64 Deconvolution

64

327

245

64

329

247

331

249

64 995

32

500

1000

2008

32

1*1 Convolution layer

2008

1000

1

Input data

Scalable Software Systems Laboratory

SDE Controller

SDDC Hypervisor

SDE App Store

GatorCloud: SDN-enabled Campus Cloud

DeepCloud Towards Composable Intelligent Platform

Golfer

GolfVisor

8U

46U

8U

8U

1U2U

3U

3U

3U

8U

46U

8U

8U

1U2U

3U

3U

3U

8U

46U

8U

8U

1U2U

3U

3U

3U

8U

46U

8U

8U

1U2U

3U

3U

3U

Gator, GENI, and Testbed Racks

Internet2/NLR

100G

100G

GENI Apps

GolfStore

Clo

ud D

ashb

oard

Users Researchers Scientists

Developers

Engineers

Admins

IaaS

Paa

S

SaaS

CP

SaaS

Naa

S

HP

Caa

S

iBD

aaS

Security Apps

Network Apps

BigData Apps

Self-

Prot

ectio

n

Major Data Centers at UF HiPerGator Supercomputer CMS/OSG Physics HPC Centers ICBR: Interdisciplinary Center

for Biotech Research CTSI: Clinical and Translational

Science Institute ACIS Data Center NEB Data Center

HPC Apps

Staa

S

Scalable Software Systems Laboratory

S3Lab Research Highlights Finest

Smartphone Indoor

Location Ecosystem

First SDN-enabled Campus Cloud GatorCloud

Fastest Campus

Research Network 100G

IMPACT

Fourth DeepCloud Intelligent Platform

Scalable Software Systems Laboratory

NSF I/UCR Center for Big Learning (Pending)

Deep Learning

Big Systems

Big Data

Intelligence

Member Benefits

• Leveraging the world-class talents (about 40 professors and 200 graduate students) in the era of big learning, big data, and big systems.

• Realizing a 10:1 return on investment.

• Discovering top students in top universities.

• Joining peer members from high-profile companies and research units.

CBL Consortium: University of Florida (UF, South), Carnegie Mellon University (CMU, East), University of Missouri at Kansas City (UMKC, Central), University of Notre Dame (ND, North), and University of Oregon (UO, West), and a large number of industrial partners.

Scalable Software Systems Laboratory

Thank You!