PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is...
Transcript of PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is...
PROJECTPROPOSALS:COMMUNITYDETECTIONANDENTITYRESOLUTION
PROJECT1:COMMUNITYDETECTION
WhatisCommunityDetection?
• WhatSocialNetworkAnalysisis?
• Communitydetection:discoveringgroupsinanetworkwhereindividuals’groupmembershipsarenotexplicitlygiven
Network Analysis is the keywordFor the 21st Century
Researchers , Politicians , People talk about Social Networks.
SubjectivityofCommunityDefinition
Eachconnectedcomponentisacommunity
Adensely-connectedcommunity
Definitionofacommunitycanbesubjective.
Node-CentricCommunityDetection
• Node-CentricCommunity:Eachnodeinagroupsatisfiescertainproperties
• Sampleproperties:• CompleteMutuality
• cliques• Reachabilityofmembers
• k-clique,k-clan,k-club• Nodaldegrees
• k-plex,k-core• RelativefrequencyofWithin-OutsideTies
5
CompleteMutuality:Cliques
• Clique:amaximumcompletesubgraphinwhichallnodesareadjacenttoeachother
• NP-hardtofindthemaximumcliqueinanetwork• Straightforwardimplementationtofindcliquesisveryexpensiveintimecomplexity
Nodes5,6,7and8formaclique
6
EnumeratingallMaximalCliques[CDMPT16]
AJ H H
FD D E
S
A
J HF
D ES
W UG
Y
CliqueisVeryStrict
• Cliqueisaverystrictdefinition• Verticesofacliqueareatdistance1eachother• Diameterofinducedsubgraphis1• Min-degreeofinducedsubgraphs-1(cliquesizes)
• Normallyuserelaxationsofcliquesasdefinitionforcommunities
• Cliquerelaxationsinclude:• k-clique:verticeswithdistance*nogreaterthankfromeachother• k-club/k-clan:subgraphsofdiameternogreaterthank• k-plex:subgraphsofmin-degreenogreaterthans-k
• 1-clique=1-club=1-clan=1-plex=clique
• (*)distanceiscomputedontheinputgraphandcancontain“external”edges 8
Enumeratinglargek-plexes[CFMPT17]
Enumeratinglargek-plexes[CFMPT17]
GraphDatabases
• Storedataasnodesandrelationships• Databasefulloflinkednodes
SampleGraphDB
• AllegroGraph• Bitsy• Cayley• GraphBase• Graphd• HyperGraphDB• IBMSystemG• imGraph• InfiniteGraph• InfoGrid• Neo4j• Sparksee/DEX• Trinity• TurboGraph
SampleGraphDBqueries
• Patternmatchingquery• Nodeswithfirstname“James”
• Adjacencyquery• NodesthatJamesknowsdirecly• I.e.,areadjacenttoJamesintheknowsrelationship
• Reachabilityquery• NodesthatJamesknows• I.e.,arereachablefromJamesintheknowsrelationship
• GraphAnalyticalquery
Single-sql-queryfor#connectedcomponents(forFUN)
http://stackoverflow.com/questions/33465859/a-number-of-connected-components-of-a-graph-in-sql
Neo4jqueryfor#connectedcomponents
• http://172.17.0.21:7474/service/mazerunner/analysis/connected_components/FOLLOWS
• ViaMazerunner• RESTAPI• https://github.com/neo4j-contrib/neo4j-mazerunner• IntegratesApacheSpark,GraphXandNeo4jforbigscalegraphanalysis
• GraphX:ApacheSpark'sAPIforgraphsandgraph-parallelcomputation
Performance
Summaryandopenproblems
• NetworkAnalysisisthekeywordForthe21stCentury• Researchers,Politicians,PeopletalkaboutSocialNetworks.
• Problems:• Communities• AnalysisofStructure&SocialSpace
• Technologies:• GraphDB• BigDatatechnologies
PROJECT2:ENTITYRESOLUTION
WhatisEntityResolution(ER)?
• Inputdata:modeledasagraph.• Graphnode=datarecord.• Graphedgelabel=probabilitythat
recordpairrepresentsthesameentity.
• Output:asetofclusters,eachofwhichcorrespondstoanentity.• 2nodesinaclusteriff recordsrepresentthesameentity.
• Traditionalproblems[EIV07,GM12].• Pairwisematch:whatistheprobabilitythattworecordsmatch?• Clustering:howtopartitionrecordsintoanunknown#ofentities?• Blocking:howtoperformERinsub-quadratictime?
19
WhatisERUsinganOracle?
• Inputdata:modeledasagraph.
• Output:asetofclusters=entities.
• Formalproblem[WL+13,VBD14,FSS16]:• Givenanoraclethatcancorrectlyanswerifarecordpairisamatch,whatisanoptimalstrategytoaskoraclequeriessoastominimizethenumberofqueriesforresolvingtheentiregraph?
• Motivation:reducecrowdsourcingERcostfordataset.
20
¨ Formalproblem[FSS16]:– Givenanoraclethatcancorrectlyanswerifarecordpairisamatch,
whatisanoptimalstrategytoaskoraclequeriessoastomaximizeprogressiverecallwrt thesequenceoforaclequeries?
– Progressiverecall=areaunder“recallvsquerysequence”curve.
¨ Motivation:limitedresolutiontime,earlyusertermination.
WhatisOnlineERUsinganOracle?
21
¨ DatafromtheVaticanSecretArchives– Registri Vaticani:Popelettersthroughoutthe13th-century.
¨ Linkageproblem:entities=characters.
Example:DBofHandwrittenCharacters
22
Example:DBofHandwrittenCharacters
23
?
?
¨ OptimalstrategyneedstoaskN– K+(Kchoose2)oraclequeries.– Takesadvantageof(matchingandnon-matching)transitivity.
¨ EO:askoraclequeriesin↓edgeprobabilityorder.– Cangrowmultipleclustersandsub-clustersinparallel.– Worst-caseapproximationratioofO(N)[VBD14].
Strategy1:EdgeOrdering[WL+13]
24
¨ OptimalstrategyneedstoaskN– K+(Kchoose2)oraclequeries.– Takesadvantageof(matchingandnon-matching)transitivity.
¨ NO:processnodesin ↓orderoftheirexpectedclustersizes.– Askoraclequeriesin↓edgeprobabilityordertoprocessednodes.– Cangrowsimilar-sizedclusters(butnotsub-clusters)inparallel.– Worst-caseapproximationratioofO(K)[VBD14].
Strategy2:NodeOrdering[VBD14]
25
¨ Edgeordering:usebenefitmetricinsteadofedgeprobability.– Iterativelyqueryoraclewith(u,v)havinghighestvalueofbe(u,v).– Initially,edgewithhighestvalueofp(u,v)isqueried.– Subsequently,canquerylowerprobability,higherbenefitedge.
OracleStrategyforProgressiveRecall
26
¨ Hybridordering:usenodeordering,thenedgeordering.– Iteratively:selectnodeuwithhighestvalueofbn(u),thenquery
oraclewith(u,v),vє C,indecreasingorderofbn(u,C).– Heuristic:useathresholdonbenefitbn(u,C).– Finally,processnon-inferableedges(u,v)in↓orderofbe(u,v).
Strategy3:HybridOrdering[FSS16]
27
ErrorsinOracleAnswers
• Inputdata:modeledasagraph.
• Output:asetofnoisy clusters.
• Formalproblem:• Givenanoraclethatcananswersifarecordpairisamatchwithsomeerrorprobability,whatisanoptimalstrategytoaskoraclequeriessoastominimizethenumberofqueriesforresolvingtheentiregraphandmaximizingprecision?
28
Example:DBofHandwrittenCharacters
29
?
?
ErrorsandGraphCuts
• Vertexcut:a partition ofthe nodes(vertices) ofagraphintotwo disjointsubsets.
• Cut-set:thesetofedgesthathaveoneendpointineachsubsetofthepartition.
• Whatwouldyoutrustmore?
30
? ?
ErrorsandGraphCuts
• Vertexcut:a partition ofthe nodes(vertices) ofagraphintotwo disjointsubsets.
• Cut-set:thesetofedgesthathaveoneendpointineachsubsetofthepartition.
• Formalproblem:• Buildgraphswithlargecutswithaslessasedgesaspossible• So-calledexpandergraphs
• Technicalcontribution:Provethattheoutputgraphconsistsofexpanders
31
¨ HybridorderingwithExpanders:usenodeorderingbyassigninganodetoaclusteronlyifmorethanKanswersarepositive,thenedgeordering.
Strategy4:HybridOrderingwithExpanders
32
Summaryandopenproblems
• FormalstudyofmaximizingprogressiverecallinonlineER.• ProblemisNP-complete.
• Formalstudyofmaximizingprogressiverecallandprecisioninpresenceoferrorsinoracleanswers.
• Openproblems:• Designrobust,onlinestrategiesforerrorsinoracleanswers.• Designamorepowerfulinterface forqueriesthanpairwise.• Scalability(e.g.blocking)
33