Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
-
Upload
mlconf -
Category
Technology
-
view
546 -
download
1
Transcript of Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ongoing and Future Work: Part IIDeepDive & Caffe con Troll:
Knowledge Base Construction from Text and Beyond
Ce ZhangStanford University
http://deepdive.stanford.edu
DeepDive
DeepDive
Unstructured Inputs
Structured Outputs
Goal: High Quality
DeepDive: Applications to Knowledge Base Construction
Caffe con Troll: A Deep Learning Engine
DeepDive with Caffe con Troll: Ongoing Work
Many pressing scientific questionsare macroscopic.
KBC Applications
Science is built up with facts, as a house is with stones.
- Jules Henri Poincaré
Example: Paleontology
Taxon Rock
Age Location
Scientific FactsBiodiversity
Macroscopic ViewInsights & Knowledge
Impact of climate change to bio-diversity?
KBC Applications
Example: Paleontology
Taxon Rock
Age Location
Scientific FactsBiodiversity
Macroscopic ViewInsights & Knowledge
Impact of climate change to bio-diversity?
KBC Applications
Example: Paleontology
Taxon Rock
Age Location
Scientific FactsBiodiversity
Macroscopic ViewInsights & Knowledge
Impact of climate change to bio-diversity?
1570 1670 1770 1870 1970 2015
Input SourcesKB C
onst
ruct
ion Knowledg
e Base (KB)
KBC ApplicationsPaleontolog
y Genomics
Taxon Rock
Age Location
Knowledge BaseGene Drug
Disease
Knowledge BaseDark Web
Server Service
Price Location
Knowledge Base
Climate & Biodiversity Social GoodHealth & Medicine
Challenge:Can we just do KBC manually?
Challenge of Manual KBCPaleontolog
y
Taxon Rock
Age Location
Knowledge BaseEffort on Manual KBC
Sepkoski (1982) manually compiled a compendium of 3300 animal families with 396 references in his monograph.300 professional volunteers (1998-present) spent 8 continuo-us human years to compile PaleoDB with 55,479 references.
2010 2011 2012 201380
90
100
110
# N
ew P
a-le
o Re
fer-
ence
s (K
) 100K new references per year! 16 continuous human
years every year just to keep up-to-date!
Can we build a machine to read for us?
Automatic KBC
Input Sources
Machine
Knowledge Base
Case Study - PaleoDeepDive
The GoalExtract paleobiological facts to build higher coverage
fossil record.
T. Rex are found dating to the upper Cretaceous.
Appears(“T. Rex”, “Cretaceous”)
DeepDive
Case Study - PaleoDeepDive
55K documents
329 geoscientists8 years
126K fossil mentions
2000 machine cores46 machine years
1M relations
300K documents3M fossil mentions2.1M relations
PaleoDB PaleoDeepDiveHuman-created Paleobiology database!
Machine-created Paleobiology database!(>90% Precision)
Biodiversity Curve
On the same relation, PaleoDeepDive achieves equal (or sometimes better) precision as professional
human volunteers.
10x…..
Validation on Real Applications
Paleontology
Geology
Pharmacogenomics
Genomics
Wikipedia-like Relations
Dark Web
“It's a little scary, the machines are getting that good.”
Recall: 2-10x more extractions than humanPrecision: 92%-97% (Human ~84%-92%)
Highest score out of 18 teams and 65 submissions (2nd highest is also DeepDive).
Applied Physics
Goal: Enables easy engineering to build high-quality KBC Systems by
thinking about features not algorithms.
Can we support more sophisticatedimage processing in DeepDive?
Go Beyond Text-Processing
What kindof dinosaur is this?
Does this patient haveshort finger?
Is this sea star found in 2014 sick?
What’s theClinical out-come of this patient?
Images are important to many scientific questions.
[User] Can I run Deep Learning on my datasets with DeepDive?
Just before we start the run…
On which machine should we run? CPU or GPU?I have a GPU Cluster
I have 5000 CPU cores
I have $100K to spend on the cloud
EC2: c4.4xlarge 8 [email protected]
EC2: g2.2xlarge1.5K cores@800MHz
0.7TFlops 1.2TFlops
Not a 10x gap? Can we close this gap?
Caffe con Troll
http://github.com/HazyResearch/CaffeConTroll
A prototype system to study theCPU/GPU tradeoff. Same-input-same-output as Caffe.
What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
Rela
tive
Spee
d
Caffe CPU
CcT CPU Caffe GPU
Caffe CPU
CcT CPU0
0.20.40.60.8
11.2
What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
Rela
tive
Spee
d
Caffe CPU
CcT CPU Caffe GPU
Caffe CPU
CcT CPU0
0.20.40.60.8
11.2
What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
Rela
tive
Spee
d
Caffe CPU
CcT CPU Caffe GPU
Caffe CPU
CcT CPU0
0.20.40.60.8
11.2
What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
Rela
tive
Spee
d
Caffe CPU
CcT CPU Caffe GPU
Caffe CPU
CcT CPU0
0.20.40.60.8
11.2
Proportional to FLOPs!
Four Shallow Ideas Described in Four Pages…
arXiv:1504.04343
One of the four shallow ideas…
3 CPU Cores3 Images Strategy 1 Strategy 2
If the amount of data is too small for each core, the process might not be CPU bound.
For AlexNet over Haswell CPUs, Strategy 2 is 3-4x faster.
Caffe con Troll + DeepDive(Ongoing Work)
Application 1: Paleontology
Images without high-quality human labels also contain valuable information.
What can we learn from these images without human labels?
Name of Fossil
Fossil Image
Application 1: Paleontology
We apply Distant Supervision!
Porifera Brachiopoda ClassifierDocument
Can we build a system that automatically “reads” a Paleontology textbook and learn the difference between sponges and shells?
Application 1: Paleontology29
Fig. 387,1a-c. *B. rara, Serpukhovian, Kazakhstan, Dzhezgazgan district; a,b, holotype, viewed ventrally, laterally, MGU 31/342, XI (Litvinovich, 1967);
Figure Name Mention Taxon MentionDeepDive Extractions
Fig. 387Figures
Provide Labels
Train CNN
Test with Human Labels3K Brachiopoda Images
2K Porifera ImagesAccuracy = 94%
Thank You
deepdive.stanford.edu
github.com/HazyResearch/CaffeConTroll
Ce Zhang: [email protected] Group: [email protected]