DeepDive: Extracting Databases from Dark Data deepdive...

1
DeepDive: Extracting Databases from Dark Data Hazy Research Group, Infolab, Computer Science Department, Stanford University Jaeho Shin, Ce Zhang, Feiran Wang, Sen Wu, Alex Ratner, Theodoros Rekatsinas, Thomas Palomares, Youssef Ahres, Dan Iter, Ivan Gozali, Henry Ehrenberg, Zifei Shan, Feng Niu, Raphael Hoffman, Michael Cafarella, Christopher Ré Debugging DeepDive apps Running DeepDive apps Writing DeepDive apps - Interpretable probabilities, no arbitrary scores - Calibration plots show if system output makes sense - Algorithm independent evaluation method - Iterative error analysis process helps users achieve high quality rapidly by focusing on the top error mode - Interactive tools for evaluating and exploring data - Declarative, end-to-end specification (DDlog) - Statistical inference model - Data processing - Standard tools for data processing - SQL, Python UDFs, CoreNLP - Extractors in familiar languages - Automatic features (DDlib) - Generic patterns in data (e.g., NLP tags, dependency) - Data programming: DeepDive's future - Programatic supervision: noisy but cost-effective - Apps built entirely with rules using patterns in data DeepDive creating societal impact Input: Unstructured information e.g., text, tables (plan to add images, diagrams) Output: Structured data with probabilities Factor Graph - Leveraging database & operating systems technologies - PostgreSQL, Greenplum - Parallel/distributed data processing - Incremental maintenance of inference results - Run small changes to model up to 100x faster - VLDB 2015, SIGMOD Research Highlight Award 2015 - DimmWitted: Fast Inference Engine - NUMA-aware high-speed Gibbs sampler - SIGMOD 2014 Best Paper Award - Noise-aware learning technique - Noisy rules reinforcing/fixing-up other rules - Handled jointly in a unified statistical inference model - Higher quality possible despite noisy individual rules 1.8M documents (TAC-KBP 2014) 2.4M facts - Powers DARPA MEMEX to fight human trafficking, actively deployed in many law enforcement agencies - Accelerates biodiversity research, featured in Nature - Award-winning knowledge base construction systems DeepDive extracting and fusing knowledge deepdive.stanford.edu hasSpouse(p1,p2, 1, "from_dbpedia") :- spouseCandidate(p1, name1, p2, name2), spousesInDBpedia(n1, n2), [ lower(n1) = lower(name1), lower(n2) = lower(name2) ; lower(n2) = lower(name1), lower(n1) = lower(name2) ]. hasSpouse(p1,p2, 1, "wife/husband between") :- spouseCandidate(p1, name1, p2, name2), personMention(p1, begin1, end1, s), personMention(p2, begin2, end2, s), sentences(s, tokens, lemmas), lemmas[end1..begin2] && ["wife", "husband"]. Where are the human-traffickers on the dark Web? Locaeon Price NYC $… NaRonal DistribuRon Which gene mutaEon causes glaucoma? Gene Phenotype LTBP2 Glaucoma What is the effect of a drug? Gene1 Gene2 BRCA1 ARNT Protein InteracRon Biomed Literature Online Ads/Reviews Gene-Pheno InteracRon Genomics Literature Mutations altering the coding sequence of KRT16 cause Pachyonychia Congenita … Recessive mutations in LTBP2 (…) have been identified as a cause of early-onset glaucoma Mutations in QARS, encoding Glutaminyl- tRNA synthetase, cause progressive microcephaly incremental runs < 0.5 hour [VLDB15] full run 6 hours Think about features, not algorithms Think about training, not features nor algorithms - Open source engine for building machine learning systems that extract databases from dark data - Helps answer macroscopic questions that require high-quality structured data at massive scale - Enables rapid development with tools & abstractions error analysis debug re-run Paleobiology Anti-Human Trafficking TAC-KBP2014 Winner Genomics Drug Repurposing

Transcript of DeepDive: Extracting Databases from Dark Data deepdive...

Page 1: DeepDive: Extracting Databases from Dark Data deepdive ...forum.stanford.edu/events/posterslides/DeepDiveExtractingDatabasesfromDarkData.pdfDeepDive: Extracting Databases from Dark

DeepDive:ExtractingDatabasesfromDarkDataHazyResearchGroup,Infolab,ComputerScienceDepartment,StanfordUniversityJaehoShin,CeZhang,FeiranWang,SenWu,AlexRatner,TheodorosRekatsinas,ThomasPalomares,YoussefAhres,DanIter,IvanGozali,HenryEhrenberg,ZifeiShan,FengNiu,RaphaelHoffman,MichaelCafarella,ChristopherRé

DebuggingDeepDiveapps

RunningDeepDiveapps

WritingDeepDiveapps

- Interpretableprobabilities,noarbitraryscores- Calibrationplotsshowifsystemoutputmakessense- Algorithmindependentevaluationmethod- Iterativeerroranalysisprocesshelpsusersachievehighqualityrapidlybyfocusingonthetoperrormode

- Interactivetoolsforevaluatingandexploringdata

- Declarative,end-to-endspecification(DDlog)- Statisticalinferencemodel- Dataprocessing

- Standardtoolsfordataprocessing- SQL,PythonUDFs,CoreNLP- Extractorsinfamiliarlanguages

- Automaticfeatures(DDlib)- Genericpatternsindata(e.g.,NLPtags,dependency)

- Dataprogramming:DeepDive'sfuture- Programaticsupervision:noisybutcost-effective- Appsbuiltentirelywithrulesusingpatternsindata

DeepDivecreatingsocietalimpact

Input:Unstructuredinformatione.g.,text,tables(plantoaddimages,diagrams)

Output:Structureddatawithprobabilities

Factor Graph

- Leveragingdatabase&operatingsystemstechnologies- PostgreSQL,Greenplum- Parallel/distributeddataprocessing

- Incrementalmaintenanceofinferenceresults- Runsmallchangestomodelupto100xfaster- VLDB2015,SIGMODResearchHighlightAward2015

- DimmWitted:FastInferenceEngine- NUMA-awarehigh-speedGibbssampler- SIGMOD2014BestPaperAward

- Noise-awarelearningtechnique- Noisyrulesreinforcing/fixing-upotherrules- Handledjointlyinaunifiedstatisticalinferencemodel- Higherqualitypossibledespitenoisyindividualrules

1.8Mdocuments(TAC-KBP2014)

2.4Mfacts

- PowersDARPAMEMEXtofighthumantrafficking,activelydeployedinmanylawenforcementagencies

- Acceleratesbiodiversityresearch,featuredinNature- Award-winningknowledgebaseconstructionsystems

DeepDiveextractingandfusingknowledge

deepdive.stanford.edu

hasSpouse(p1,p2,1,"from_dbpedia"):-spouseCandidate(p1,name1,p2,name2),spousesInDBpedia(n1,n2),[lower(n1)=lower(name1),lower(n2)=lower(name2);lower(n2)=lower(name1),lower(n1)=lower(name2)].

hasSpouse(p1,p2,1,"wife/husbandbetween"):-spouseCandidate(p1,name1,p2,name2),personMention(p1,begin1,end1,s),personMention(p2,begin2,end2,s),sentences(s,tokens,lemmas),lemmas[end1..begin2]&&["wife","husband"].

Wherearethehuman-traffickersonthedarkWeb?

Locaeon Price

NYC $…

NaRonalDistribuRon

WhichgenemutaEoncauses

glaucoma?

Gene Phenotype

LTBP2 Glaucoma

Whatistheeffectofadrug?

Gene1 Gene2

BRCA1 ARNT

ProteinInteracRon

BiomedLiterature OnlineAds/Reviews

Gene-PhenoInteracRon

GenomicsLiterature

Mutations altering the coding sequence of KRT16 cause Pachyonychia Congenita …

Recessive mutations in LTBP2 (…) have been identified as a cause of early-onset glaucoma …

Mutations in QARS, encoding Glutaminyl-tRNA synthetase, cause progressive microcephaly …

incrementalruns<0.5hour[VLDB15]

fullrun 6hours

Thinkaboutfeatures,notalgorithms

Thinkabouttraining,notfeaturesnoralgorithms

- Opensourceengineforbuildingmachinelearningsystemsthatextractdatabasesfromdarkdata

- Helpsanswermacroscopicquestionsthatrequirehigh-qualitystructureddataatmassivescale

- Enablesrapiddevelopmentwithtools&abstractions

erroranalysisdebug

re-run

PaleobiologyAnti-Human Trafficking

TAC-KBP2014 Winner

Genomics Drug Repurposing