Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized...

77
Detecting Path Anomalies in Sequential Data on Networks Tim LaRock tlarock.github.io [email protected] In collaboration with Tina Eliassi-Rad Giona Casiraghi Vahan Nanumyan 1 Ingo Scholtes Frank Schweitzer

Transcript of Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized...

Page 1: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Detecting Path Anomalies in Sequential Data on Networks

Tim LaRocktlarock.github.io

[email protected] collaboration with

Tina Eliassi-RadGiona CasiraghiVahan Nanumyan

1

Ingo Scholtes Frank Schweitzer

Page 2: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

This TalkMotivation: Understanding mechanisms behind sequential data on networks

Today:Motivate the study of path anomalies

Introduce de Bruijn graph representation of sequential data

Develop tractable null model to measure deviation of path data from expectation

Validate null model in synthetic data + compare with naïve baseline method

Application of methodology to a real system

2

Page 3: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Intuitive Example: Passenger Flight Data

3

Page 4: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

4

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Intuitive Example: Passenger Flight Data

Page 5: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

5

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Intuitive Example: Passenger Flight Data

Page 6: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

6

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Intuitive Example: Passenger Flight Data

Underrepresented?

Page 7: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

7

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Intuitive Example: Passenger Flight Data

Overrepresented?

Page 8: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

is significantly overrepresented?

8

Research Question

is significantly underepresented?

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

JFK BTVBOS

Given this pathway dataset, can we determine whether…

Page 9: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

is significantly overrepresented?

9

Research Question

is significantly underepresented?

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

JFK BTVBOS

Given this pathway dataset, can we determine whether…

In other words: Which paths are anomalous?

Page 10: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Problem: Path anomaly detection

10

For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.

Page 11: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Problem: Path anomaly detection

11

For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.

When k=2, this corresponds to comparing a random walk with a single step of memory to a memoryless (Markovian) random walk on G.

Page 12: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example

12

Three Goals:

1. Introduce de Bruijn graphs as representations of sequential data

2. Show how path anomalies emerge in a simple setting

3. Show how path anomalies can be detected through a random walk simulation approach

(Spoiler: Simulation approach is infeasible for real world datasets!)

Page 13: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example

13

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Page 14: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example

14

Pathway 1

Pathway 2

Pathway 3

XA C

B DX

XA C

Page 15: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example: Data

15

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Page 16: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example: Data

16

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

XA C

B DX

XA C

Page 17: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example: Data to (first-order) graph

17

edge counts

30

XA

205

XB

135

CX

100

DX

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

Page 18: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example: Data to (first-order) graph

18

edge counts

30

XA

205

XB

135

CX

100

DX A

BX

C

D

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

Page 19: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example: Data to 2nd order de Bruijn graph

19

Path counts

30

XA X B

105

CX

100

DXC BA D

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

Page 20: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example: Data to 2nd order de Bruijn graph

20

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

30

Path counts

XA X B

105

CX

100

DXC BA D

XB

XA

DX

C30

X

100

Page 21: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

21

Toy Example: Path Anomalies via Simulations

For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.

Page 22: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example: Path Anomalies via Simulations

22

Simulate many random walk datasetsCompute expected frequency of each pathwayand subtract from observed value

A

BX

C

D XB

XA

DX

C30 - E[ ]

XXA C

100 - E[ ]

For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.

Page 23: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Toy Example: Path Anomalies via Simulations

23

Path counts

30

XA X

105

B CX

100

DXC BA D +12.57

+11.69

-12.57

-11.69

XA C

XA D

DXB

B CX

Deviation from Randomized Paths

XB

XA

DX

C30 - E[ ]

XXA C

100 - E[ ]

Page 24: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Challenges

24

Page 25: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Path Anomaly Detection: Challenges

Detecting path anomalies via simulations à computationally intensive

Result is expected value, no concrete notion of significance

Alternative: detect path anomalies analytically by developing a tractable null model

25

Page 26: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Null Model: Challenges

26

Traditional null models (e.g. configuration model) cannot be applied directly

Page 27: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Null Model: Challenges

27

Traditional null models (e.g. configuration model) cannot be applied directly

Edges between higher-order nodes can not be randomized by stub-matching

XBXA DCCX✅ 🚫XA

Page 28: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Null Model: Challenges

28

Traditional null models (e.g. configuration model) cannot be applied directly

Edges between higher-order nodes can not be randomized by stub-matching

Need to randomize edge weight distribution in de Bruijn graph models, since connectivity structure is fixed by 1st-order topology

XBXA DCCX✅ 🚫XA

Page 29: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

HYPA: Efficient Detection of Path Anomalies

29

Page 30: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Generalized Hypergeometric Ensemble

30Casiraghi et al. (2016, 2018)

Page 31: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Generalized Hypergeometric Ensemble

31

Generalization of the configuration model to weighted, directed networks.

Page 32: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Generalized Hypergeometric Ensemble

32

Generalization of the configuration model to weighted, directed networks.

Fixes the expected weight of every node, rather than the exact degree sequence.

Page 33: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Generalized Hypergeometric Ensemble

33

Generalization of the configuration model to weighted, directed networks.

Fixes the expected weight of every node, rather than the exact degree sequence.

Urn Problem Intuition:

○ Each pair of nodes that can possibly connect is assigned a color

DXXA

DXXB

XB CX

X CX

Urn

A

Page 34: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Generalized Hypergeometric Ensemble

34

Generalization of the configuration model to weighted, directed networks.

Fixes the expected weight of every node, rather than the exact degree sequence.

Urn Problem Intuition:

○ Each pair of nodes that can possibly connect is assigned a color

○ Add balls, where Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

DXXA

DXXB

XB CX

X CX

Urn

A 30x135

205x100

205x135

30x100

Page 35: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Generalized Hypergeometric Ensemble

35

Generalization of the configuration model to weighted, directed networks.

Fixes the expected weight of every node, rather than the exact degree sequence.

Urn Problem Intuition:

○ Each pair of nodes that can possibly connect is assigned a color

○ Add balls, where

○ Draw m edges to sample a network from the urn, where

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

DXXA

DXXB

XB CX

X CX

Urn

A 30x135

205x100

205x135

30x100

Page 36: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Hypergeometric Ensemble

36

Pr(Xvw = f(v, w)) /✓

Kvw

f(v, w)

◆✓Pij Kij �Kvw

m� f(v, w)

<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

Page 37: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Hypergeometric Ensemble

37

{Pr(Xvw = f(v, w)) /

✓Kvw

f(v, w)

◆✓Pij Kij �Kvw

m� f(v, w)

<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

Probability of observing frequency f(v,w) given the entire weighted network structure.

Page 38: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Hypergeometric Ensemble

38

{

Pr(Xvw = f(v, w)) /✓

Kvw

f(v, w)

◆✓Pij Kij �Kvw

m� f(v, w)

<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

Number of ways to pick f(v,w) multiedges from Kvw possible.

Page 39: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Hypergeometric Ensemble

39

{

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

Pr(Xvw = f(v, w)) /✓

Kvw

f(v, w)

◆✓Pij Kij �Kvw

m� f(v, w)

<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>

Number of ways to pick everything else.

Page 40: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Putting it all together: HYPA scores

40

Page 41: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Putting it all together: HYPA scores

41

If “close” to 0, then the pathway is underrepresented.

If “close” to 1, then pathway is overrepresented.

Page 42: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Putting it all together: HYPA scores

42

XB

XA

DX

CX0.99

0.97XB

XA

DX

C30 X

100

If “close” to 0, then the pathway is underrepresented.

If “close” to 1, then pathway is overrepresented.

+12.57

+11.69

-12.57

-11.69

XA C

XA D

DXB

B CX

Deviation from Randomized Paths

Observed Data Simulation Result HYPA Scores

Page 43: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Validation

43

Page 44: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Noise via Path Randomization

44

Page 45: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies: Setup

45

Page 46: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies: Setup

46

A

BX

C

D XB

XA

DX

CX

Start with an arbitrary first order topology, then construct the kth-order de Bruijn graph

k = 2

Page 47: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies: Setup

47

A

BX

C

D XB

XA

DX

CX

Randomly choose some edges to label over-represented

k = 2

Page 48: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies: Setup

48

A

BX

C

D XB

XA

DX

CX30

100

Assign heterogeneous weights based on label

k = 2

Page 49: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies: Setup

49

A

BX

C

D XB

XA

DX

CX

Generate paths via random walks on this model, then evaluate ability of HYPA to detect injected anomalies (binary classifier).

30

100

k = 2

Page 50: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies: ROC Example

50

Page 51: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies: ROC Example

51

Injected Anomalies of Length 2

Page 52: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies: AUC Results

52

Page 53: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Application to Flight Data

53

Page 54: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Airlines

54

5% sample of all US domestic flights in 2018

Data from: http://www.transtats.bts.gov/Tables.asp?DB_ID=125

Page 55: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

AirlinesHypotheses:

1. Return flights should be over-represented, since people most often travel round trip.

55

Page 56: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Airlines: Return trips are over-represented

56

Fraction of over-represented return/non-return flights for various discrimination thresholds.

Page 57: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Airlines: Return trips are over-represented

57

Fraction of over-represented return/non-return flights for various discrimination thresholds.

Page 58: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

AirlinesHypotheses:

1. Return flights should be over-represented, since people most often travel round trip.

2. Over-represented non-return flights are due to regional/national hubs, since people need to fly from small airports → regional hub → large airport.

58

Page 59: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Airlines: Trip Balance

59

Page 60: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

AirlinesHypotheses:

1. Return flights should be over-represented, since people most often travel round trip.

2. Over-represented non-return flights are due to regional/national hubs, since people need to fly from small airports → regional hub → large airport.

3. “Efficient” paths are more likely to be over-represented.

60

Page 61: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Airlines: Efficiency

61

Page 62: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

References

Tim [email protected]

tlarock.github.iohttps://arxiv.org/abs/1905.10580

Thanks!

62

Scholtes, Ingo. "When is a network a network?: Multi-order graphical model selection in pathways and temporal networks." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.

Casiraghi et al. "Generalized hypergeometric ensembles: Statistical hypothesis testing in complex networks." arXiv:1607.02441 (2016).

Casiraghi & Nanumyan. “Generalised hypergeometric ensembles of random graphs: the configuration model as an urn problem.” arXiv:1810.06495 (2018)

R. TransStat. Origin and destination survey database. http://www.transtats.bts.gov/Tables.asp?DB_ID=125, 2018.

Page 63: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

63

https://arxiv.org/abs/1905.10580

Page 64: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Definition: kth-order de Bruijn Graph

64

For a given graph G = (V,E) and positive integer k we define a k-th order

De Bruijn graph of paths in G as a graph Gk= (V k, Ek

), where (i) each node

~v := ~v0v1 . . . vk�1 2 V kis a path of length k � 1 in G, and (ii) (~v, ~w) 2 Ek

i↵

vi+1 = wi for i = 0, . . . , k � 2.

<latexit sha1_base64="nPo7/5HIQzxKitn0S+xtps/GiRw=">AAADPnicbVJNb9NAELXNVylfLRy5jEiQbOFGdjiAkCJVhVKOReqXVKfRZj1OFtu71u46URX5l3HhN3DjyIUDCHHlyK6TStAyl32aN/PezO6Oq4IpHUVfXO/a9Rs3b63dXr9z9979BxubD4+UqCXFQyoKIU/GRGHBOB5qpgs8qSSSclzg8Th/bfnjGUrFBD/Q5xUOSzLhLGOUaJMabboHb4UEAhM2Qw4TSaopJP7ewD8Kd4MkAMJTqIRi2vDAuMYJSujmXZgjpJgZV9Oc+HkSbOkpCJka+g3CjqzZhws9kUFF9FSZfqttVZW1bMnu3llu3M7yEHbP8qAbwnyKEsFnASChU+AiRdOWzJAuZs2rwRKMotkoTopUaAWz0SLfipsGEmNglIwBswbW1JoXyCfarmWqLLeaImyX85kxSvwL/bA9503Qiu0uxbLMVBgX9ixuYABzgxqTz8zNJT4bRCEsJwkh3+onQW99tNGJelEbcBXEK9BxVrE/2vicpILWJXJNC6LUaRxVerggUjNaYLOe1AorQnMywVMDOSlRDRft8zfw1GTSdppMcA1t9u+OBSmVOi/HprK073CZs8n/cae1zl4OF4xXtUZOl0ZZXYAWYP8SpEwi1cW5AYRK80co0CmRhGrz4+wlxJdXvgqO+r34ea//vt/Z3lldx5rz2Hni+E7svHC2nXfOvnPoUPej+9X97v7wPnnfvJ/er2Wp5656Hjn/hPf7D9a6/RE=</latexit>

Page 65: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Pseudocode

65

Page 66: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Naïve Baseline ComparisonFrequency-Based Anomaly Detection (FBAD)

Compute mean, µ, and standard deviation, σ, of kth order edge weights

Given scaling factor α, label edges as- Overrepresented if frequency is larger than µ + σα - Underrepresented if frequency is smaller than µ - σα

66

Page 67: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Synthetic Anomalies

67

Page 68: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Real Data

68

Page 69: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Exploring Motifs

69

Page 70: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Case Study: London TubeData:

- Origin → destination statistics between London Tube stations- (origin, destination, #observations)

- Shortest paths between stations- Assume people follow shortest paths

70

Page 71: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

London TubeHypothesis:

- People typically use public transportation to travel large geographic distances- Overrepresented pathways should cover larger distances

Test:

- Measure distance between every station- For 2nd order transitions A-B-C, compute distance between nodes A and C- Analyze distributions of distance in over vs. under represented transitions

- Expect to see distribution shifted towards higher values for over-represented transitions

71

Page 72: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

London Tube

72

Page 73: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

London Tube

73

Median distance between source and destination nodes in under/over represented transitions.

Page 74: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Constructing Ground TruthConstruct ground truth based on the method discussed earlier:

○ Randomize path data using k-1st order random walks

○ Compute kth-order path statistics

○ Repeat m times, noting the frequency of each path

○ Estimate multinomial distribution and its CDF from these statistics

○ If CDF(path) > threshold, label over-represented

74

Page 75: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Tube Data - Ground Truth

75

Page 76: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Computational complexity

76

O(N + |V |2�k1)

<latexit sha1_base64="ab5tZyX/wq/L2KgZL0VZSB19Lxg=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUBFKUgVdFt240gr2AW0aJpNJO3QyCTMToaTd+CtuXCji1s9w5984bbPQ1gMDh3Pu4c49XsyoVJb1beSWlldW1/LrhY3Nre0dc3evIaNEYFLHEYtEy0OSMMpJXVHFSCsWBIUeI01vcD3xm49ESBrxBzWMiROiHqcBxUhpyTUP7kq38BSOGqNupcN0zkeu3R2cuGbRKltTwEViZ6QIMtRc86vjRzgJCVeYISnbthUrJ0VCUczIuNBJJIkRHqAeaWvKUUikk04PGMNjrfgwiIR+XMGp+juRolDKYejpyRCpvpz3JuJ/XjtRwaWTUh4ninA8WxQkDKoITtqAPhUEKzbUBGFB9V8h7iOBsNKdFXQJ9vzJi6RRKdtn5cr9ebF6ldWRB4fgCJSADS5AFdyAGqgDDMbgGbyCN+PJeDHejY/ZaM7IMvvgD4zPHzTNlOI=</latexit>

Page 77: Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized Hypergeometric Ensemble 31 Generalization of the configuration model to weighted,

Scalability

77