Clustering and community detection · Comment: Finding dense subgraphs is hard in general •...

Clusteringandcommunitydetection

SocialandTechnologicalNetworks

Rik Sarkar

UniversityofEdinburgh,2018.

Communitydetection

• Givenanetwork• Whatarethe“communities”– Closelyconnectedgroupsofnodes– Relativelyfewedgestooutsidethecommunity

• Similartoclusteringindatasets– Grouptogetherpointsthataremorecloseorsimilartoeachotherthanotherpoints

Communitydetectionbyclustering

• First,defineametricbetweennodes– Eithercomputeintrinsicmetricslikeallpairsshortestpaths[Floyd-Warshall algorithmO(n3)]

– OrembedthenodesinaEuclideanspace,andusethemetricthere• Wewilllaterstudyembeddingmethods

• Applyaclusteringalgorithmwiththemetric

Clustering

• Acoreproblemofmachinelearning:–Whichitemsareinthesamegroup?

• Identifiesitemsthataresimilarrelativetorestofdata

• Simplifiesinformationbygroupingsimilaritems– Helpsinalltypesofotherproblems

Clustering• Outlineapproach:• Givenasetofitems

– Defineadistancebetween them• E.g.Euclideandistancebetweenpointsinaplane;Euclideandistancebetweenotherattributes;non-euclidean distances;pathlengthsinanetwork;tiestrengthsinanetwork…

– Determineagrouping(partitioning)thatoptimisessomefunction(prefers‘close’itemsinsamegroup).

• Referenceforclustering:– CharuAggarwal:TheDataMiningTextbook,Springer

• FreeonSpringersite(fromuniversitynetwork)– Blumetal.FoundationsofDataScience(freeonline)

K-meansclustering

• Findk-clusters

–Withcenters

– Thatminimizethesumofsquareddistancesofnodestotheirclustercenters(calledthek-meanscost)

K-meansclustering:Lloyd’salgorithm

• Therearenitems• Selectk‘centers’

– Mayberandomklocationsinspace– Maybelocationofkoftheitemsselectedrandomly– Maybechosenaccordingtosomemethod

• Iteratetillconvergence:– Assigneachitemtotheclusterforitsclosestcenter– Recompute locationofcenterasthemeanlocationofallelements inthecluster(theircentroid)

– Repeat• Warning:Lloyd’salgorithmisaHeuristic.Doesnotguaranteethatthek-meanscostisminimised

K-means

• Visualisations• http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html

• http://shabal.in/visuals/kmeans/1.html

K-means

• Ward’salgorithm(alsoHeuristic)– Startwitheachnodeasitsowncluster– Ateachround,findtwoclusterssuchthatmergingthemwillreducethek-meanscostthemost

–Mergethesetwoclusters– Repeatuntiltherearek-clusters

Kmeans:discussion• Triestominimise squaredsumofdistancesofitemsto

clustercenters– NP-hard.Computationally intractable– Algorithmgiveslocaloptimum

• Dependsoninitialisation (startingsetofcenters)– Cangivepoorresults– Submodular optimisation canhelp

• Theright‘k’maybeunknown– Possiblestrategy:trydifferentpossibilities andtakethebest

• Canbeimprovedbyheuristicslikechoosingcenterscarefully– E.g.choosingcenterstobeasfarapartaspossible:chooseone,

choosepointfarthesttoit,choosepointfarthesttoboth(maximise mindistancetoexistingsetetc)…

– Trymultipletimesandtakebestresult..

K-medoids

• Similar,butnoweachcentermustbeoneofthegivenitems– Ineachcluster,findtheitemthatisthebest‘center’andrepeat

• Usefulwhenthereisnoambientspace(extrinsicmetric)– E.g.Adistancebetweenitemscanbecomputedbetweennodes,buttheyarenotinanyparticularEuclideanspace,sothe‘centroid’inLloyd’salgorithmisnotameaningfulpoint

Othercenterbasedmethods

• K-center:Minimise maximumdistancetocenter:

• K-median:Minimise sumofdistances:

Hierarchicalclustering

• Hierarchicallygroupitems• Usingsomestandardclusteringmethod

Hierarchicalclustering• Topdown(divisive):– Startwitheverything in1cluster

– Makethebestdivision,andrepeatineachsubcluster

• Bottomup(agglomerative):– Startwithndifferent clusters– Mergetwoatatimebyfindingpairsthatgivethebestimprovement

Hierarchicalclustering• Givesmanyoptionsforaflatclustering

• Problem:whatisagood‘cut’ofthedendogram?

Densitybasedclustering

• Groupdenseregionstogether

• Betteratnon-linearseparations

• Workswithunknownnumberofclusters

DBSCAN• Densityatadatapoint:

– NumberofdatapointswithinradiusEps• Acorepoint:

– Pointwithdensityatleastτ• Borderpoint

– Densitylessthanτ,butatleastonecorepointwithinradiusEps• Noisepoint

– Neithercorenorborder.Farfromdenseregions

Algorithm

• ConstructUDGofcorepoints

• Connectedcomponentsofthegraphgivetheclusters

• Assignborderpointstosuitableclusters(E.g.totheclustertowhichithasmostedges)

DBSCAN:Discussions

• Requiresknowledgeofsuitableradiusanddensityparameters(Eps andτ)

• Doesnotallowforpossibilitythatdifferentclustersmayhavedifferentdensities

DBSCAN

• Usefulincaseswhereitisclearwhichobjectscanbeconsideredsimilarbutnumberofclustersisnotknown

• Knowntoperformverywellinrealproblems

• Worstcasecomplexity:O(n2)

• Currentresearch:Makingfasterinspecialcases,approximations,distributedalgorithms.

Otherdensitybasedclustering

• Singlelinkage(sameasKruskal’s MSTalgorithm)– Startwithnclusters–Mergetwoclusterswiththeshortestbridginglink– Repeatuntilkclusters

• Other,morerobustmethodsexist

Communities

• Groupsoffriends• Colleagues/collaborators• Webpagesonsimilartopics• Biologicalreactiongroups• Similarcustomers/users…

Otherapplications

• Acoarserrepresentationofnetworks• Oneormoremeta-nodeforeachcommunity• Identifybridges/weak-links• Structuralholes

Communitydetectioninnetworks

• Asimplestrategy:– Chooseasuitabledistancemeasurebasedonavailabledata• E.g.Pathlengths;distancebasedoninversetiestrengths;sizeoflargestenclosinggrouporcommonattribute;distanceinaspectral(eigenvector)embedding;etc..

– Applyastandardclusteringalgorithm

Clusteringisnotalwayssuitableinnetworks

• Smallworldnetworkshavesmalldiameter– Andsometimeintegerdistances– Adistancebasedmethoddoesnothavealotofoptiontorepresentsimilarities/dissimilarities

• Highdegreenodesarecommon– Connectdifferentcommunities– Hardtoseparatecommunities

• Edgedensitiesvaryacrossthenetwork– Samethresholddoesnotworkwelleverywhere

Definitionsofcommunities

• Varies.Dependingonapplication

• Generalidea:Densesubgraphs:Morelinkswithincommunity,fewlinksoutside

• Sometypesandconsiderations:– Partitions:Eachnodeinexactlyonecommunity– Overlapping:Eachnodecanbeinmultiplecommunities

Comment:Findingdensesubgraphs ishardingeneral

• Findinglargestclique– NP-hard– Computationallyintractable

• Decisionversion:Doesacliqueofsizekexist?– AlsoNP-complete– Computationallyintractable– Polynomialtime(efficient)algorithmsunlikelytoexist

• Wewilllookforapproximations

Densesubgraphs:Fewpreliminarydefinitions

• ForS,Tsubgraphs ofV• e(S,T):SetofedgesfromStoT– e(S)=e(S,S):EdgeswithinS

• dS(v):numberofedgesfromvtoS• EdgedensityofS:|e(S)|/|S|– Largestforcompletegraphsorcliques

Dense subgraph Problem

• Findthesubgraph withlargestedgedensity• Therealsoexistsadecisionversion:– Isthereasubgraph withedgedensity>α

• CanbesolvedusingMaxFlowalgorithms– O(n2m):inefficient inlargedatasets– Findstheonedensestsubgraph

• Variant:FinddensestScontaininggivensubsetX• Otherversions:Findsubgraphs sizekorless• NP-hard

EfficientapproximationforfindingdenseScontainingX

• Givesa1/2approximation• EdgedensityofoutputSsetisatleasthalfofoptimalsetS*

• (ProofinKempe 2018:http://www-bcf.usc.edu/~dkempe/teaching/structure-dynamics.pdf ).

Betweenness&graphpartitioning• Wewanttosplitnetworkintotightlyknitgroups(communitiesetc)

• Idea:Identifytheedgesconnectingdifferentcommunitiesandremovethem

• Theseedgesare“central”tothenetwork– Theylieonshortestpaths

• Betweenness ofedge(e)(canbeconsideredforvertex(v)):– Wesend1unitoftrafficbetween everypairofnodesinthenetwork

– measurewhatfractionpassesthroughe,assumingtheflowissplitequallyamongallshortestpaths.

Computingbetweenness

• Computingallshortestpathsseparatelyisinefficient

• Amoreefficientway:• Fromeachnode:– Step1:ComputeBFStree– Step2:Findnumberofshortestpathstoeachnode

– Step3:Findtheflowthrougheachedge– Seekleinberg-Easleyfordetailedalgorithm

Partitioning(Girvan-newman)Repeat:• Findedgeeofhighest

betweenness• Removee

• Producesahierarchicparitioning structureasthegraphdecomposesintosmallercomponents

• Networkversionofhierarchicclustering

Modularity

• Whatistheright“cut”inahierarchicclusteringthatrepresentsgoodcommunities?

• Clusteringagraph• Problem:Whatistherightclustering?• Idea:Maximizeaquantitycalledmodularity

ModularityofsubsetS

• GivengraphG• ConsiderarandomG’graphwithsamenodedegrees(rememberconfigurationmodel)– NumberofedgesinSinG:|e(S)|G– ExpectednumberofedgesinSinG’:E[|e(S)|G’]– Modularity ofS:|e(S)|G - E[|e(S)|G’]– Morecoherentcommunitieshavemoreedgesinsidethanwouldbeexpected inarandomgraphwithsamedegrees

– Note:modularity canbenegative

Modularityofaclustering

• Takeapartition(clustering)ofV:• Writed(Si)forsumofdegreesofallnodesinSi• ItcanbeshownthatE[|e(S)|G’]≈d(Si)2• Definition:Sumoverthepartition:

q(P) =1

|e(Si)|G � 1

4md(si)

• Canbeusedasastoppingcriterion(orfindingrightlevelofpartitioning)inothermethods– Eg.Girvan-newman

Modularitybasedclustering• Modularityismeantforusemoreasameasureofquality,notso

muchasaclusteringmethod

• FindingclusteringwithhighestmodularityisNP-hard• Heuristic:Louvain method:

– Placeeach node inits own community– For each community,consider merging with neighbor.

• Make the greedy choice – make the merge that maximizes modularity• Or donotmerge if none increases modularity

– Repeat• Note:Modularityisarelativemeasureforcomparingcommunity

structure.• Notentirelyclearinwhichcasesitmayormaynotgivegoodresults• Athresholdof0.3ormoreissometimes consideredtogivegood

clustering

Karateclubhierarchicclustering

• Shapeofnodesgivesactualsplitintheclubduetointernalconflicts– Newman2003

Correlationclustering• Someedgesareknowntobesimilar/friends/trusted

• marked“+”• Someedgesareknowntobedissimilar/enemies/distrusted

• marked“-”• Maximizethenumberof+edgesinsideclustersand

• Minimizethenumberof-edgesbetweenclusters

Applications

• Communitydetectionbasedonsimilarpeople/users

• Documentclusteringbasedonknownsimilarityordissimilaritybetweendocuments

• Useofsentimentsand/orotherdivisiveattributes

Features

• Clusteringwithoutneedtoknownumberofclusters– k-means,medians,clustersetc needtoknownumberofclustersorotherparameters likethreshold

– Numberofclustersdepends onnetworkstructure• Actually,doesnotneedanyparameter• NPhard• Notethatgraphmaybecompleteornotcomplete

– Insomeapplicationswithunlabelededges,itmaybereasonabletochangeedgesto“+”edgesandnon-edgesto“-”edges

Approximation

• Naive1/2approximation:– Iftherearemore+edges

• Putthemallin1cluster

– Iftherearemore- edges• Putnodesinndifferentclusters

• (notveryuseful)!

Betterapproximations

• 2waysoflookingatit:– MaximizeagreementorMinimizedisagreement– Similar idea,butweknowdifferentapproximationalgorithms

• NikhilBansal etal.developPTAS(polynomialtimeapproximationscheme)formaximizingagreement:– (1-ε)approximation, running time

• Min-disagree:– 4-approximation

Localdetectionofcommunities

• Supposethereisahugegraph,likewww,orfacebook network

• Weoftenwanttofindthecommunitythatcontainsaparticularnodeorgroup– E.g.tomakerecommendations: “yourfriendshavewatchedthismovie…”

– Toinferpreferences andattributes• Runningafullscalecommunitydetectioniscomputationallyimpractical

• Wedonotknowthenumberofcommunities• A“localmethod”likeDBSCANcanhelp

Conductance:measureofedgesinsidecommunityvsoutside

• GivensubsetsS,TinV• e(S,T):setofedgesbetweenSandT• Volumeofedges:

• ConductanceofSisdefinedas:

• Communitiesarelikelytohavelowconductance

vol(S) =X

Personalised pagerank

• GivenaseedsetX• FindthecommunitySthatcontainsX• Pagerank style:Userandomwalks• Algorithm– Setalimitktonumberofstepsinrandomwalks– Repeat:

• SelectatrandomastartpointfromX• Takekrandomstepsinthegraph

– Counthowfrequently eachnodeoccurs– pagerank– Nodesinthecommunityhavehighpagerank

• AlternativeAlgorithm– Setaprobabilitytoresetrandomwalk– Repeat:

• SelectatrandomastartpointfromX• Withprobability1– 𝜀movetoarandomneighbor• Withprobability𝜀movetoarandomnodeinX• Counthowfrequentlyeachnodeoccurs– pagerank

– Nodesinthecommunityhavehighpagerank

• Communitieshavelowconductance• Therefore,shortrandomwalkswillleavethecommunityonlyrarely

• Therefore,nodesinthecommunityofXwillhavehighpagerank comparedtothoseoutside

• ItcanbeprovedthatifXisinalowconductancecommunity,nodesoutsidethiscommunitywilloccurinfrequently.– Wewillomitthisproof

Clustering and community detection · Comment: Finding dense subgraphs is hard in general •...

Documents

Transcript of Clustering and community detection · Comment: Finding dense subgraphs is hard in general •...

Mining Frequent Subgraphs

Sermon Transcript April 7, 2019 Finding God in the Hard ...

Finding Highly Connected Subgraphs - TU Berlinfpt.akt.tu-berlin.de/publications/Finding_Highly...Finding Highly Connected Subgraphs? FalkHüﬀner,ChristianKomusiewicz,andManuelSorge

Finding Dense Subgraphs with Hierarchical Relations in ...graphanalysis.org/SIAM-CSE17/Sariyuce-CSE17.pdf · tenance of dense subgraphs (corresponding to groups of tightly-coupled

Finding Dense Subgraphs for Sparse Undirected, Directed ...saad/PDF/umsi-2009-91.pdfFinding Dense Subgraphs for Sparse Undirected, Directed, and Bipartite Graphs Jie Chen and Yousef

Optimal Connected Subgraphs: Formulations and Algorithms

Finding Dense Subgraphs via Low-Rank Bilinear …users.ece.utexas.edu/~dimakis/DKS_ICML.pdfFinding Dense Subgraphs via Low-Rank Bilinear Optimization nitude, eigenvalue of the adjacency

Analysis of Biological Networksuser.ceng.metu.edu.tr/~tcan/ceng465_f1415/Schedule/week14_netw… · • Finding dense subgraphs • Applications –Identification of novel pathways,

Finding Cross Genome Patterns in Annotation Graphs2.1 Dense subgraphs Given an initial tripartite graph, a challenge is to nd interesting regions of the graph, i.e., candidate subgraphs,

Dense subgraphs of random graphs

Dense Subgraphs on Dynamic Networkspersonal.denison.edu/~lalla/papers/DISC-Final.pdf · Finding dense subgraphs has received a great deal of attention in graph algorithms literature

A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Dense Subgraphs on Dynamic Networks

On Finding Dense Subgraphs - University Of Marylandsamir/talks/ismp-dense.pdfDensest Subgraph Problem and Variations Some Results Open Problems On Finding Dense Subgraphs Barna Saha

Parameterized Complexity of Finding Small Degree ...amini/Publications/ASS.pdf · Concerning the parameterized complexity of nding regular subgraphs, Moser and Thilikos proved that

Finding Homogeneous Collections of Dense Subgraphs Using ...

Fast Computation of Dense Temporal Subgraphs · In this paper, we are to study the problem of finding dense subgraphs (FDS) on large temporal graphs. Summary. (1) ECP is quite common.

MOebius Function Dense Subgraphs

Computing dense subgraphs with semidefinite programming · Finding a densest k-subgraph ? Diﬃcult problem (NP-hard and more - see e.g. Khot ’05) Solving to optimality ? Few methods:

Approximate Pattern Matching in Massive Graphs with ...matei/papers/sigmod2020.pdf · Pattern matching in graphs, that is, finding subgraphs that ... Multiple application areas (e.g.,