PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is...

33
PROJECT PROPOSALS: COMMUNITY DETECTION AND ENTITY RESOLUTION Donatella Firmani [email protected]

Transcript of PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is...

Page 1: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

PROJECTPROPOSALS:COMMUNITYDETECTIONANDENTITYRESOLUTION

[email protected]

Page 2: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

PROJECT1:COMMUNITYDETECTION

Page 3: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

WhatisCommunityDetection?

• WhatSocialNetworkAnalysisis?

• Communitydetection:discoveringgroupsinanetworkwhereindividuals’groupmembershipsarenotexplicitlygiven

Network Analysis is the keywordFor the 21st Century

Researchers , Politicians , People talk about Social Networks.

Page 4: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

SubjectivityofCommunityDefinition

Eachconnectedcomponentisacommunity

Adensely-connectedcommunity

Definitionofacommunitycanbesubjective.

Page 5: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Node-CentricCommunityDetection

• Node-CentricCommunity:Eachnodeinagroupsatisfiescertainproperties

• Sampleproperties:• CompleteMutuality

• cliques• Reachabilityofmembers

• k-clique,k-clan,k-club• Nodaldegrees

• k-plex,k-core• RelativefrequencyofWithin-OutsideTies

5

Page 6: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

CompleteMutuality:Cliques

• Clique:amaximumcompletesubgraphinwhichallnodesareadjacenttoeachother

• NP-hardtofindthemaximumcliqueinanetwork• Straightforwardimplementationtofindcliquesisveryexpensiveintimecomplexity

Nodes5,6,7and8formaclique

6

Page 7: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

EnumeratingallMaximalCliques[CDMPT16]

AJ H H

FD D E

S

A

J HF

D ES

W UG

Y

Page 8: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

CliqueisVeryStrict

• Cliqueisaverystrictdefinition• Verticesofacliqueareatdistance1eachother• Diameterofinducedsubgraphis1• Min-degreeofinducedsubgraphs-1(cliquesizes)

• Normallyuserelaxationsofcliquesasdefinitionforcommunities

• Cliquerelaxationsinclude:• k-clique:verticeswithdistance*nogreaterthankfromeachother• k-club/k-clan:subgraphsofdiameternogreaterthank• k-plex:subgraphsofmin-degreenogreaterthans-k

• 1-clique=1-club=1-clan=1-plex=clique

• (*)distanceiscomputedontheinputgraphandcancontain“external”edges 8

Page 9: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Enumeratinglargek-plexes[CFMPT17]

Page 10: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Enumeratinglargek-plexes[CFMPT17]

Page 11: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

GraphDatabases

• Storedataasnodesandrelationships• Databasefulloflinkednodes

Page 12: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

SampleGraphDB

• AllegroGraph• Bitsy• Cayley• GraphBase• Graphd• HyperGraphDB• IBMSystemG• imGraph• InfiniteGraph• InfoGrid• Neo4j• Sparksee/DEX• Trinity• TurboGraph

Page 13: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

SampleGraphDBqueries

• Patternmatchingquery• Nodeswithfirstname“James”

• Adjacencyquery• NodesthatJamesknowsdirecly• I.e.,areadjacenttoJamesintheknowsrelationship

• Reachabilityquery• NodesthatJamesknows• I.e.,arereachablefromJamesintheknowsrelationship

• GraphAnalyticalquery

Page 14: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Single-sql-queryfor#connectedcomponents(forFUN)

http://stackoverflow.com/questions/33465859/a-number-of-connected-components-of-a-graph-in-sql

Page 15: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Neo4jqueryfor#connectedcomponents

• http://172.17.0.21:7474/service/mazerunner/analysis/connected_components/FOLLOWS

• ViaMazerunner• RESTAPI• https://github.com/neo4j-contrib/neo4j-mazerunner• IntegratesApacheSpark,GraphXandNeo4jforbigscalegraphanalysis

• GraphX:ApacheSpark'sAPIforgraphsandgraph-parallelcomputation

Page 16: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Performance

Page 17: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Summaryandopenproblems

• NetworkAnalysisisthekeywordForthe21stCentury• Researchers,Politicians,PeopletalkaboutSocialNetworks.

• Problems:• Communities• AnalysisofStructure&SocialSpace

• Technologies:• GraphDB• BigDatatechnologies

Page 18: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

PROJECT2:ENTITYRESOLUTION

Page 19: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

WhatisEntityResolution(ER)?

• Inputdata:modeledasagraph.• Graphnode=datarecord.• Graphedgelabel=probabilitythat

recordpairrepresentsthesameentity.

• Output:asetofclusters,eachofwhichcorrespondstoanentity.• 2nodesinaclusteriff recordsrepresentthesameentity.

• Traditionalproblems[EIV07,GM12].• Pairwisematch:whatistheprobabilitythattworecordsmatch?• Clustering:howtopartitionrecordsintoanunknown#ofentities?• Blocking:howtoperformERinsub-quadratictime?

19

Page 20: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

WhatisERUsinganOracle?

• Inputdata:modeledasagraph.

• Output:asetofclusters=entities.

• Formalproblem[WL+13,VBD14,FSS16]:• Givenanoraclethatcancorrectlyanswerifarecordpairisamatch,whatisanoptimalstrategytoaskoraclequeriessoastominimizethenumberofqueriesforresolvingtheentiregraph?

• Motivation:reducecrowdsourcingERcostfordataset.

20

Page 21: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

¨ Formalproblem[FSS16]:– Givenanoraclethatcancorrectlyanswerifarecordpairisamatch,

whatisanoptimalstrategytoaskoraclequeriessoastomaximizeprogressiverecallwrt thesequenceoforaclequeries?

– Progressiverecall=areaunder“recallvsquerysequence”curve.

¨ Motivation:limitedresolutiontime,earlyusertermination.

WhatisOnlineERUsinganOracle?

21

Page 22: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

¨ DatafromtheVaticanSecretArchives– Registri Vaticani:Popelettersthroughoutthe13th-century.

¨ Linkageproblem:entities=characters.

Example:DBofHandwrittenCharacters

22

Page 23: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Example:DBofHandwrittenCharacters

23

?

?

Page 24: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

¨ OptimalstrategyneedstoaskN– K+(Kchoose2)oraclequeries.– Takesadvantageof(matchingandnon-matching)transitivity.

¨ EO:askoraclequeriesin↓edgeprobabilityorder.– Cangrowmultipleclustersandsub-clustersinparallel.– Worst-caseapproximationratioofO(N)[VBD14].

Strategy1:EdgeOrdering[WL+13]

24

Page 25: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

¨ OptimalstrategyneedstoaskN– K+(Kchoose2)oraclequeries.– Takesadvantageof(matchingandnon-matching)transitivity.

¨ NO:processnodesin ↓orderoftheirexpectedclustersizes.– Askoraclequeriesin↓edgeprobabilityordertoprocessednodes.– Cangrowsimilar-sizedclusters(butnotsub-clusters)inparallel.– Worst-caseapproximationratioofO(K)[VBD14].

Strategy2:NodeOrdering[VBD14]

25

Page 26: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

¨ Edgeordering:usebenefitmetricinsteadofedgeprobability.– Iterativelyqueryoraclewith(u,v)havinghighestvalueofbe(u,v).– Initially,edgewithhighestvalueofp(u,v)isqueried.– Subsequently,canquerylowerprobability,higherbenefitedge.

OracleStrategyforProgressiveRecall

26

Page 27: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

¨ Hybridordering:usenodeordering,thenedgeordering.– Iteratively:selectnodeuwithhighestvalueofbn(u),thenquery

oraclewith(u,v),vє C,indecreasingorderofbn(u,C).– Heuristic:useathresholdonbenefitbn(u,C).– Finally,processnon-inferableedges(u,v)in↓orderofbe(u,v).

Strategy3:HybridOrdering[FSS16]

27

Page 28: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

ErrorsinOracleAnswers

• Inputdata:modeledasagraph.

• Output:asetofnoisy clusters.

• Formalproblem:• Givenanoraclethatcananswersifarecordpairisamatchwithsomeerrorprobability,whatisanoptimalstrategytoaskoraclequeriessoastominimizethenumberofqueriesforresolvingtheentiregraphandmaximizingprecision?

28

Page 29: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Example:DBofHandwrittenCharacters

29

?

?

Page 30: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

ErrorsandGraphCuts

• Vertexcut:a partition ofthe nodes(vertices) ofagraphintotwo disjointsubsets.

• Cut-set:thesetofedgesthathaveoneendpointineachsubsetofthepartition.

• Whatwouldyoutrustmore?

30

? ?

Page 31: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

ErrorsandGraphCuts

• Vertexcut:a partition ofthe nodes(vertices) ofagraphintotwo disjointsubsets.

• Cut-set:thesetofedgesthathaveoneendpointineachsubsetofthepartition.

• Formalproblem:• Buildgraphswithlargecutswithaslessasedgesaspossible• So-calledexpandergraphs

• Technicalcontribution:Provethattheoutputgraphconsistsofexpanders

31

Page 32: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

¨ HybridorderingwithExpanders:usenodeorderingbyassigninganodetoaclusteronlyifmorethanKanswersarepositive,thenedgeordering.

Strategy4:HybridOrderingwithExpanders

32

Page 33: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.

Summaryandopenproblems

• FormalstudyofmaximizingprogressiverecallinonlineER.• ProblemisNP-complete.

• Formalstudyofmaximizingprogressiverecallandprecisioninpresenceoferrorsinoracleanswers.

• Openproblems:• Designrobust,onlinestrategiesforerrorsinoracleanswers.• Designamorepowerfulinterface forqueriesthanpairwise.• Scalability(e.g.blocking)

33