Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

30
Ongoing and Future Work: Part II DeepDive & Caffe con Troll: Knowledge Base Construction from Text and Beyond Ce Zhang Stanford University

Transcript of Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Page 1: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Ongoing and Future Work: Part IIDeepDive & Caffe con Troll:

Knowledge Base Construction from Text and Beyond

Ce ZhangStanford University

Page 2: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

http://deepdive.stanford.edu

DeepDive

Page 3: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

DeepDive

Unstructured Inputs

Structured Outputs

Goal: High Quality

DeepDive: Applications to Knowledge Base Construction

Caffe con Troll: A Deep Learning Engine

DeepDive with Caffe con Troll: Ongoing Work

Page 4: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Many pressing scientific questionsare macroscopic.

Page 5: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

KBC Applications

Science is built up with facts, as a house is with stones.

- Jules Henri Poincaré

Example: Paleontology

Taxon Rock

Age Location

Scientific FactsBiodiversity

Macroscopic ViewInsights & Knowledge

Impact of climate change to bio-diversity?

Page 6: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

KBC Applications

Example: Paleontology

Taxon Rock

Age Location

Scientific FactsBiodiversity

Macroscopic ViewInsights & Knowledge

Impact of climate change to bio-diversity?

Page 7: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

KBC Applications

Example: Paleontology

Taxon Rock

Age Location

Scientific FactsBiodiversity

Macroscopic ViewInsights & Knowledge

Impact of climate change to bio-diversity?

1570 1670 1770 1870 1970 2015

Input SourcesKB C

onst

ruct

ion Knowledg

e Base (KB)

Page 8: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

KBC ApplicationsPaleontolog

y Genomics

Taxon Rock

Age Location

Knowledge BaseGene Drug

Disease

Knowledge BaseDark Web

Server Service

Price Location

Knowledge Base

Climate & Biodiversity Social GoodHealth & Medicine

Page 9: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Challenge:Can we just do KBC manually?

Page 10: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Challenge of Manual KBCPaleontolog

y

Taxon Rock

Age Location

Knowledge BaseEffort on Manual KBC

Sepkoski (1982) manually compiled a compendium of 3300 animal families with 396 references in his monograph.300 professional volunteers (1998-present) spent 8 continuo-us human years to compile PaleoDB with 55,479 references.

2010 2011 2012 201380

90

100

110

# N

ew P

a-le

o Re

fer-

ence

s (K

) 100K new references per year! 16 continuous human

years every year just to keep up-to-date!

Page 11: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Can we build a machine to read for us?

Page 12: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Automatic KBC

Input Sources

Machine

Knowledge Base

Page 13: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Case Study - PaleoDeepDive

The GoalExtract paleobiological facts to build higher coverage

fossil record.

T. Rex are found dating to the upper Cretaceous.

Appears(“T. Rex”, “Cretaceous”)

DeepDive

Page 14: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Case Study - PaleoDeepDive

55K documents

329 geoscientists8 years

126K fossil mentions

2000 machine cores46 machine years

1M relations

300K documents3M fossil mentions2.1M relations

PaleoDB PaleoDeepDiveHuman-created Paleobiology database!

Machine-created Paleobiology database!(>90% Precision)

Biodiversity Curve

On the same relation, PaleoDeepDive achieves equal (or sometimes better) precision as professional

human volunteers.

10x…..

Page 15: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Validation on Real Applications

Paleontology

Geology

Pharmacogenomics

Genomics

Wikipedia-like Relations

Dark Web

“It's a little scary, the machines are getting that good.”

Recall: 2-10x more extractions than humanPrecision: 92%-97% (Human ~84%-92%)

Highest score out of 18 teams and 65 submissions (2nd highest is also DeepDive).

Applied Physics

Goal: Enables easy engineering to build high-quality KBC Systems by

thinking about features not algorithms.

Page 16: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Can we support more sophisticatedimage processing in DeepDive?

Page 17: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Go Beyond Text-Processing

What kindof dinosaur is this?

Does this patient haveshort finger?

Is this sea star found in 2014 sick?

What’s theClinical out-come of this patient?

Images are important to many scientific questions.

[User] Can I run Deep Learning on my datasets with DeepDive?

Page 18: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Just before we start the run…

On which machine should we run? CPU or GPU?I have a GPU Cluster

I have 5000 CPU cores

I have $100K to spend on the cloud

EC2: c4.4xlarge 8 [email protected]

EC2: g2.2xlarge1.5K cores@800MHz

0.7TFlops 1.2TFlops

Not a 10x gap? Can we close this gap?

Page 19: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Caffe con Troll

http://github.com/HazyResearch/CaffeConTroll

A prototype system to study theCPU/GPU tradeoff. Same-input-same-output as Caffe.

Page 20: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

What we found…

c4.4x_large

($0.68/h)

c4.4x_large

($0.68/h)

g2.2x_large

($0.47/h)

c4.8x_large

($1.37/h)

c4.8x_large

($1.37/h)

Rela

tive

Spee

d

Caffe CPU

CcT CPU Caffe GPU

Caffe CPU

CcT CPU0

0.20.40.60.8

11.2

Page 21: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

What we found…

c4.4x_large

($0.68/h)

c4.4x_large

($0.68/h)

g2.2x_large

($0.47/h)

c4.8x_large

($1.37/h)

c4.8x_large

($1.37/h)

Rela

tive

Spee

d

Caffe CPU

CcT CPU Caffe GPU

Caffe CPU

CcT CPU0

0.20.40.60.8

11.2

Page 22: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

What we found…

c4.4x_large

($0.68/h)

c4.4x_large

($0.68/h)

g2.2x_large

($0.47/h)

c4.8x_large

($1.37/h)

c4.8x_large

($1.37/h)

Rela

tive

Spee

d

Caffe CPU

CcT CPU Caffe GPU

Caffe CPU

CcT CPU0

0.20.40.60.8

11.2

Page 23: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

What we found…

c4.4x_large

($0.68/h)

c4.4x_large

($0.68/h)

g2.2x_large

($0.47/h)

c4.8x_large

($1.37/h)

c4.8x_large

($1.37/h)

Rela

tive

Spee

d

Caffe CPU

CcT CPU Caffe GPU

Caffe CPU

CcT CPU0

0.20.40.60.8

11.2

Proportional to FLOPs!

Page 24: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Four Shallow Ideas Described in Four Pages…

arXiv:1504.04343

Page 25: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

One of the four shallow ideas…

3 CPU Cores3 Images Strategy 1 Strategy 2

If the amount of data is too small for each core, the process might not be CPU bound.

For AlexNet over Haswell CPUs, Strategy 2 is 3-4x faster.

Page 26: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Caffe con Troll + DeepDive(Ongoing Work)

Page 27: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Application 1: Paleontology

Images without high-quality human labels also contain valuable information.

What can we learn from these images without human labels?

Name of Fossil

Fossil Image

Page 28: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Application 1: Paleontology

We apply Distant Supervision!

Porifera Brachiopoda ClassifierDocument

Can we build a system that automatically “reads” a Paleontology textbook and learn the difference between sponges and shells?

Page 29: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Application 1: Paleontology29

Fig. 387,1a-c. *B. rara, Serpukhovian, Kazakhstan, Dzhezgazgan district; a,b, holotype, viewed ventrally, laterally, MGU 31/342, XI (Litvinovich, 1967);

Figure Name Mention Taxon MentionDeepDive Extractions

Fig. 387Figures

Provide Labels

Train CNN

Test with Human Labels3K Brachiopoda Images

2K Porifera ImagesAccuracy = 94%

Page 30: Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Thank You

deepdive.stanford.edu

github.com/HazyResearch/CaffeConTroll

Ce Zhang: [email protected] Group: [email protected]