The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery

The Lazy Traveling SalesmanMemory Management for Large-Scale Link Discovery

Axel-Cyrille Ngonga Ngomo and Mofeed Hassan

University of LeipzigInstitute for Applied Informatics

May 28th, 2016Crete, Greece

Ngonga Ngomo and Hassan (InfAI) GNOME November 10, 2016 1 / 22

Why Link Discovery?

1 Fourth principle2 Links are central for

Cross-ontology QAData IntegrationReasoningFederated Queries...

3 Current topology of theLOD Cloud

31+ billion triples≈ 0.5 billion linksowl:sameAs in mostcases

Why is it difficult?

Definition (Link Discovery)Given sets S and T of resources and relation RFind M = {(s, t) ∈ S × T : R(s, t)}Common approach: Find M ′ = {(s, t) ∈ S × T : σ(s, t) ≥ θ}

Example: R = :sameModel

:s770fm rdfs:label "S770FM"@en:s770fm rdf:type :SABER:s770fm :model :770:s770fm :top :FlamedMaple:s770fm :producer :Ibanez

:s770fm rdfs:label "S770BEM"@en:s770fm rdf:type :SABER:s770fm :model :770:s770fm :top :BirdEyeMaple:s770fm :producer :Ibanez

1 Time complexityLarge number of triplesQuadratic a-priori runtime69 days for mapping cities fromDBpedia to GeonamesSolutions usually in-memoryInsufficient memory oncommodity hardware

2 Complexity of specificationsCombination of several attributesrequired for high precisionTedious discovery of mostadequate mappingDataset-dependent similarityfunctions

1 Time complexityLarge number of triplesQuadratic a-priori runtime69 days for mapping cities fromDBpedia to GeonamesSolutions usually in-memoryInsufficient memory oncommodity hardware

2 Complexity of specificationsCombination of several attributesrequired for high precisionTedious discovery of mostadequate mappingDataset-dependent similarityfunctions

Problem Statement

AssumptionsConstant memory C|S|+ |T | > |C |

GoalDevise time-efficient approach to compute M ′

Ensure completeness of resultsSolution: Gnome

Divide and Merge

Time-Efficient Link DiscoveryInsight: Most approaches rely on divide-and-merge paradigmExample: HR3

σ(s, t) ≥ θ ⇔ δ(s, t) ≤ ∆

Divide and Merge1 Define S = {S1, . . . ,Sn} with Si ⊆ S ∧

⋃iSi = S

2 Define T = {T1, . . . ,Sm} with Tj ⊆ T ∧⋃jTj = T

3 Find mapping function µ : S → 2T withelements of each Si must only be compared with elements of sets in µ(Si )the union of the results over all Si ∈ S is exactly M′.

Divide and Merge1 Define S = {S1, . . . ,Sn} with Si ⊆ S ∧

⋃iSi = S

2 Define T = {T1, . . . ,Sm} with Tj ⊆ T ∧⋃jTj = T

3 Find mapping function µ : S → 2T withelements of each Si must only be compared with elements of sets in µ(Si )the union of the results over all Si ∈ S is exactly M′.

Task Graph

DefinitionA task Eij stands for comparing Si with Tj ∈ µ(Si )Task Graph G = (V ,E ,wv ,we), with

V = S ∪ Twv (v) = |V |we(eij ) = |Si ||Tj |

Task Graph

Problem

Locality maximizationQuestion: What if V does not fit in memory?Insight: Main bottleneck is access to hard drive.Solution:

1 Find groups of nodes that fit in memory and2 Compute sequence of groups that minimizes hard drive access

Clustering: Naïve Approach

ApproachCluster by Si

Example: |C | = 7

Greedy ApproachStart by largest taskAdd connected largest tasks until none fits in C

Example: |C | = 7

Scheduling

Output of clustering is sequence G1, . . . ,GNof clustersIntuition: Consequent clusters should sharedataOverlap o(Gi ,Gj) =

∑v∈V (Gi )∩V (Gj )

Overlap o(G1, . . . ,GN) =N−1∑i=1

o(Gi ,Gi+1)

GoalMaximize overlap of generated sequence

Scheduling

Best-EffortSelect random pair of clustersIf permutation improves overlap, then permuteRelies on local knowledge for scalability

Trick:

∆(Gi ,Gj) = (o(Gi−1,Gi ) + o(Gi ,Gi+1) + o(Gj−1,Gj) + o(Gj ,Gj+1))−(o(Gi−1,Gj) + o(Gj ,Gi+1) + o(Gj−1,Gi ) + o(Gi ,Gj+1)).

G3 G1 G2 G40 3 2

G4 G1 G2 G34 3 2

Scheduling

Best-EffortSelect random pair of clustersIf permutation improves overlap, then permuteRelies on local knowledge for scalability

Trick:

∆(Gi ,Gj) = (o(Gi−1,Gi ) + o(Gi ,Gi+1) + o(Gj−1,Gj) + o(Gj ,Gj+1))−(o(Gi−1,Gj) + o(Gj ,Gi+1) + o(Gj−1,Gi ) + o(Gi ,Gj+1)).

G3 G1 G2 G40 3 2

G4 G1 G2 G34 3 2

Scheduling

GreedyStart with random clusterChoose next cluster with largest overlapGlobal knowledge needed

G3 G1 G2 G40 3 2

G3 G2 G1 G42 3 4

Scheduling

GreedyStart with random clusterChoose next cluster with largest overlapGlobal knowledge needed

G3 G1 G2 G40 3 2

G3 G2 G1 G42 3 4

Experimental Setup

Datasets1 DBP: 1 million labels from DBpedia

version 04-20152 LGD: 0.8 million places from

LinkedGeoDataHardware

1 Intel Xeon E5-2650 v3 processors(2.30GHz)

2 Ubuntu 14.04.3 LTS3 10GB RAM

Measures1 Total runtime2 Hit ratio

Evaluation of Clustering

Only show results of LGDResults on DBP lead to similar insights

Runtimes Hit Ratio|C | Naive Greedy Naive Greedy100 568.0 646.3 0.57 0.77200 518.3 594.0 0.66 0.80400 532.0 593.3 0.67 0.80

1,000 5,974.0 118,454.7 0.51 0.642,000 6,168.0 115,450.0 0.51 0.634,000 7,118.3 121,901.7 0.50 0.63

Conclusion1 Naïve approach is more efficient2 Greedy approach is more effective3 Select naïve approach for clustering

Evaluation of Clustering

Runtimes Hit Ratio|C | Naive Greedy Naive Greedy100 568.0 646.3 0.57 0.77200 518.3 594.0 0.66 0.80400 532.0 593.3 0.67 0.80

1,000 5,974.0 118,454.7 0.51 0.642,000 6,168.0 115,450.0 0.51 0.634,000 7,118.3 121,901.7 0.50 0.63

Conclusion1 Naïve approach is more efficient2 Greedy approach is more effective3 Select naïve approach for clustering

Evaluation of Scheduling

Runtimes (ms) Hit ratio|C | Best-Effort Greedy Best-Effort Greedy100 571.3 1,599.3 0.56 0.68200 565.7 1,448.3 0.66 0.85400 581.0 1,379.3 0.67 0.88

1,000 5,666.0 814,271.7 0.51 0.862,000 6,268.0 810,855.0 0.51 0.864,000 6,675.7 814,041.7 0.50 0.86

Conclusion1 Best-effort approach more time-efficient2 Best-effort approach is to be used for scheduling

Evaluation of Scheduling

Runtimes (ms) Hit ratio|C | Best-Effort Greedy Best-Effort Greedy100 571.3 1,599.3 0.56 0.68200 565.7 1,448.3 0.66 0.85400 581.0 1,379.3 0.67 0.88

1,000 5,666.0 814,271.7 0.51 0.862,000 6,268.0 810,855.0 0.51 0.864,000 6,675.7 814,041.7 0.50 0.86

Conclusion1 Best-effort approach more time-efficient2 Best-effort approach is to be used for scheduling

GNOME vs. Caching Approaches

Runtimes (ms)|C | Gnome FIFO F2 LFU LRU SLRU

1,000 5,974.0 37,161.0 42,090.3 45,906.7 54,194.3 56,904.32,000 6,168.0 31,977.0 39,071.3 39,872.0 45,473.0 46,795.04,000 7,118.3 21,337.0 40,860.0 28,028.3 26,816.7 27,200.0

Hit ratio1,000 0.51 0.17 0.16 0.19 0.17 0.172,000 0.51 0.29 0.30 0.32 0.30 0.304,000 0.51 0.54 0.55 0.59 0.55 0.56

Conclusion1 Gnome is more time-efficient2 Leads to higher hit rates in most cases

Scalability

100k 200k 400k 800kLGD 362,141.3 1,452,922.0 5,934,038.7 20,001,965.7DBP 434,630.7 1,790,350.7 6,677,923.0 12,653,403.3

ConclusionSub-quadratic growth of runtimeRuntime grows linearly with number of mappingsFor LGD, 360 – 370 mappings/s

Conclusion and Future Work

Presented GnomeTwo-step approach for link discoveryRelies on divide-and-merge paradigmEnsure LD on datasets of arbitrary sizeCompared with state-of-the-art cachingFuture Work

Parallel implementationCombination with blocking

That’s all Folks!

Axel NgongaAKSW Research Group

Augustusplatz 10, Room P90504109 Leipzig, Germany

ngonga@informatik.uni-leipzig.dehttp://limes.sf.net

Acknowledgement

This work was supported by grants from the EU H2020 Framework Programmeprovided for the project HOBBIT (GA no. 688227).

The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery

Engineering

Transcript of The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery

Neural Network to solve Traveling Salesman Problem

TRAVELING SALESMAN PROBLEM BIOINFORMATIC ALGORITHMS

e The Traveling Salesman Problem - MIT Mathematicsmath.mit.edu/~goemans/18453S17/TSP-CookCPS.pdf · The Traveling Salesman Problem 7.1 INTRODUCTION In the general form of the traveling

The Noisy Euclidean Traveling Salesman Problem and Learningpapers.nips.cc/paper/2049-the-noisy-euclidean-traveling... · 2014-04-15 · The Noisy Euclidean Traveling Salesman Problem

1.3- Traveling Salesman Problems

The Time Dependent Traveling Salesman Problem: Polyhedra ... · The Time-Dependent Traveling Salesman Problem (TDTSP) is a generalization of the Traveling Salesman Problem (TSP) where

The Traveling Salesman Problem - Stanford Universityweb.stanford.edu/class/cme334/docs/2012-11-14-Firouz_TSP.pdf · The Traveling Salesman Problem. The Traveling Salesman - Omede

Traveling Salesman Problem: An Overview of Applications ...cdn.intechopen.com/pdfs/12736/InTech-Traveling_salesman_problem… · Traveling Salesman Problem: An Overview of Applications,

Doubling dimension and the traveling salesman problem

The Traveling Salesman Problem for Cubic Graphs

Algorithm A Traveling Salesman Problem

The Traveling Salesman Problem in Theory & Practice

Traveling Salesman Problem (TSP)

The Traveling Salesman - Cornell University · History Of course the problem was faced by real salesman, who realized the diﬃculty. “Business leads the traveling salesman here

PS 3401 the Traveling Salesman Problem

The traveling salesman problem on the WWW

The Traveling Salesman Problem: A Case Study in Local ... · The Traveling Salesman Problem: A Case Study in Local Optimization ... In the traveling salesman problem, ... implementations

Lecture 1: Symmetric Traveling Salesman Problemresources.mpi-inf.mpg.de/conferences/adfocs-15/material/Ola-Lect1.pdfOutline LECTURE 1: Traveling Salesman Problem LECTURE 2: Traveling

Graphs Paths Circuits Euler. Traveling Salesman Problems.

Probabilistic Traveling Salesman Problems