Download - DeepDive: Extracting Databases from Dark Data deepdive ...forum.stanford.edu/events/posterslides/DeepDiveExtractingDatabasesfromDarkData.pdfDeepDive: Extracting Databases from Dark

Transcript
Page 1: DeepDive: Extracting Databases from Dark Data deepdive ...forum.stanford.edu/events/posterslides/DeepDiveExtractingDatabasesfromDarkData.pdfDeepDive: Extracting Databases from Dark

DeepDive:ExtractingDatabasesfromDarkDataHazyResearchGroup,Infolab,ComputerScienceDepartment,StanfordUniversityJaehoShin,CeZhang,FeiranWang,SenWu,AlexRatner,TheodorosRekatsinas,ThomasPalomares,YoussefAhres,DanIter,IvanGozali,HenryEhrenberg,ZifeiShan,FengNiu,RaphaelHoffman,MichaelCafarella,ChristopherRé

DebuggingDeepDiveapps

RunningDeepDiveapps

WritingDeepDiveapps

- Interpretableprobabilities,noarbitraryscores- Calibrationplotsshowifsystemoutputmakessense- Algorithmindependentevaluationmethod- Iterativeerroranalysisprocesshelpsusersachievehighqualityrapidlybyfocusingonthetoperrormode

- Interactivetoolsforevaluatingandexploringdata

- Declarative,end-to-endspecification(DDlog)- Statisticalinferencemodel- Dataprocessing

- Standardtoolsfordataprocessing- SQL,PythonUDFs,CoreNLP- Extractorsinfamiliarlanguages

- Automaticfeatures(DDlib)- Genericpatternsindata(e.g.,NLPtags,dependency)

- Dataprogramming:DeepDive'sfuture- Programaticsupervision:noisybutcost-effective- Appsbuiltentirelywithrulesusingpatternsindata

DeepDivecreatingsocietalimpact

Input:Unstructuredinformatione.g.,text,tables(plantoaddimages,diagrams)

Output:Structureddatawithprobabilities

Factor Graph

- Leveragingdatabase&operatingsystemstechnologies- PostgreSQL,Greenplum- Parallel/distributeddataprocessing

- Incrementalmaintenanceofinferenceresults- Runsmallchangestomodelupto100xfaster- VLDB2015,SIGMODResearchHighlightAward2015

- DimmWitted:FastInferenceEngine- NUMA-awarehigh-speedGibbssampler- SIGMOD2014BestPaperAward

- Noise-awarelearningtechnique- Noisyrulesreinforcing/fixing-upotherrules- Handledjointlyinaunifiedstatisticalinferencemodel- Higherqualitypossibledespitenoisyindividualrules

1.8Mdocuments(TAC-KBP2014)

2.4Mfacts

- PowersDARPAMEMEXtofighthumantrafficking,activelydeployedinmanylawenforcementagencies

- Acceleratesbiodiversityresearch,featuredinNature- Award-winningknowledgebaseconstructionsystems

DeepDiveextractingandfusingknowledge

deepdive.stanford.edu

hasSpouse(p1,p2,1,"from_dbpedia"):-spouseCandidate(p1,name1,p2,name2),spousesInDBpedia(n1,n2),[lower(n1)=lower(name1),lower(n2)=lower(name2);lower(n2)=lower(name1),lower(n1)=lower(name2)].

hasSpouse(p1,p2,1,"wife/husbandbetween"):-spouseCandidate(p1,name1,p2,name2),personMention(p1,begin1,end1,s),personMention(p2,begin2,end2,s),sentences(s,tokens,lemmas),lemmas[end1..begin2]&&["wife","husband"].

Wherearethehuman-traffickersonthedarkWeb?

Locaeon Price

NYC $…

NaRonalDistribuRon

WhichgenemutaEoncauses

glaucoma?

Gene Phenotype

LTBP2 Glaucoma

Whatistheeffectofadrug?

Gene1 Gene2

BRCA1 ARNT

ProteinInteracRon

BiomedLiterature OnlineAds/Reviews

Gene-PhenoInteracRon

GenomicsLiterature

Mutations altering the coding sequence of KRT16 cause Pachyonychia Congenita …

Recessive mutations in LTBP2 (…) have been identified as a cause of early-onset glaucoma …

Mutations in QARS, encoding Glutaminyl-tRNA synthetase, cause progressive microcephaly …

incrementalruns<0.5hour[VLDB15]

fullrun 6hours

Thinkaboutfeatures,notalgorithms

Thinkabouttraining,notfeaturesnoralgorithms

- Opensourceengineforbuildingmachinelearningsystemsthatextractdatabasesfromdarkdata

- Helpsanswermacroscopicquestionsthatrequirehigh-qualitystructureddataatmassivescale

- Enablesrapiddevelopmentwithtools&abstractions

erroranalysisdebug

re-run

PaleobiologyAnti-Human Trafficking

TAC-KBP2014 Winner

Genomics Drug Repurposing