Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized...

Post on 10-Oct-2020

1 views 0 download

Transcript of Detecting Path Anomalies in Sequential Data on Networks · Casiraghi et al.(2016, 2018) Generalized...

Detecting Path Anomalies in Sequential Data on Networks

Tim LaRocktlarock.github.io

larock.t@husky.neu.eduIn collaboration with

Tina Eliassi-RadGiona CasiraghiVahan Nanumyan

1

Ingo Scholtes Frank Schweitzer

This TalkMotivation: Understanding mechanisms behind sequential data on networks

Today:Motivate the study of path anomalies

Introduce de Bruijn graph representation of sequential data

Develop tractable null model to measure deviation of path data from expectation

Validate null model in synthetic data + compare with naïve baseline method

Application of methodology to a real system

2

Intuitive Example: Passenger Flight Data

3

4

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Intuitive Example: Passenger Flight Data

5

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Intuitive Example: Passenger Flight Data

6

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Intuitive Example: Passenger Flight Data

Underrepresented?

7

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Intuitive Example: Passenger Flight Data

Overrepresented?

is significantly overrepresented?

8

Research Question

is significantly underepresented?

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

JFK BTVBOS

Given this pathway dataset, can we determine whether…

is significantly overrepresented?

9

Research Question

is significantly underepresented?

x 10

x 1

x 0

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

JFK BTVBOS

Given this pathway dataset, can we determine whether…

In other words: Which paths are anomalous?

Problem: Path anomaly detection

10

For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.

Problem: Path anomaly detection

11

For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.

When k=2, this corresponds to comparing a random walk with a single step of memory to a memoryless (Markovian) random walk on G.

Toy Example

12

Three Goals:

1. Introduce de Bruijn graphs as representations of sequential data

2. Show how path anomalies emerge in a simple setting

3. Show how path anomalies can be detected through a random walk simulation approach

(Spoiler: Simulation approach is infeasible for real world datasets!)

Toy Example

13

Itinerary 1

Itinerary 2

Itinerary 3

BOS JFK LAX

ATL JFK BTV

BOS JFK BTV

Toy Example

14

Pathway 1

Pathway 2

Pathway 3

XA C

B DX

XA C

Toy Example: Data

15

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Toy Example: Data

16

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

XA C

B DX

XA C

Toy Example: Data to (first-order) graph

17

edge counts

30

XA

205

XB

135

CX

100

DX

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

Toy Example: Data to (first-order) graph

18

edge counts

30

XA

205

XB

135

CX

100

DX A

BX

C

D

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

Toy Example: Data to 2nd order de Bruijn graph

19

Path counts

30

XA X B

105

CX

100

DXC BA D

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

Toy Example: Data to 2nd order de Bruijn graph

20

XA C

B DX

XA C

p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>

p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>

p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>

S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>

pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>

Total number of paths N = |S| = 235

<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>

30

Path counts

XA X B

105

CX

100

DXC BA D

XB

XA

DX

C30

X

100

21

Toy Example: Path Anomalies via Simulations

For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.

Toy Example: Path Anomalies via Simulations

22

Simulate many random walk datasetsCompute expected frequency of each pathwayand subtract from observed value

A

BX

C

D XB

XA

DX

C30 - E[ ]

XXA C

100 - E[ ]

For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.

Toy Example: Path Anomalies via Simulations

23

Path counts

30

XA X

105

B CX

100

DXC BA D +12.57

+11.69

-12.57

-11.69

XA C

XA D

DXB

B CX

Deviation from Randomized Paths

XB

XA

DX

C30 - E[ ]

XXA C

100 - E[ ]

Challenges

24

Path Anomaly Detection: Challenges

Detecting path anomalies via simulations à computationally intensive

Result is expected value, no concrete notion of significance

Alternative: detect path anomalies analytically by developing a tractable null model

25

Null Model: Challenges

26

Traditional null models (e.g. configuration model) cannot be applied directly

Null Model: Challenges

27

Traditional null models (e.g. configuration model) cannot be applied directly

Edges between higher-order nodes can not be randomized by stub-matching

XBXA DCCX✅ 🚫XA

Null Model: Challenges

28

Traditional null models (e.g. configuration model) cannot be applied directly

Edges between higher-order nodes can not be randomized by stub-matching

Need to randomize edge weight distribution in de Bruijn graph models, since connectivity structure is fixed by 1st-order topology

XBXA DCCX✅ 🚫XA

HYPA: Efficient Detection of Path Anomalies

29

Generalized Hypergeometric Ensemble

30Casiraghi et al. (2016, 2018)

Generalized Hypergeometric Ensemble

31

Generalization of the configuration model to weighted, directed networks.

Generalized Hypergeometric Ensemble

32

Generalization of the configuration model to weighted, directed networks.

Fixes the expected weight of every node, rather than the exact degree sequence.

Generalized Hypergeometric Ensemble

33

Generalization of the configuration model to weighted, directed networks.

Fixes the expected weight of every node, rather than the exact degree sequence.

Urn Problem Intuition:

○ Each pair of nodes that can possibly connect is assigned a color

DXXA

DXXB

XB CX

X CX

Urn

A

Generalized Hypergeometric Ensemble

34

Generalization of the configuration model to weighted, directed networks.

Fixes the expected weight of every node, rather than the exact degree sequence.

Urn Problem Intuition:

○ Each pair of nodes that can possibly connect is assigned a color

○ Add balls, where Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

DXXA

DXXB

XB CX

X CX

Urn

A 30x135

205x100

205x135

30x100

Generalized Hypergeometric Ensemble

35

Generalization of the configuration model to weighted, directed networks.

Fixes the expected weight of every node, rather than the exact degree sequence.

Urn Problem Intuition:

○ Each pair of nodes that can possibly connect is assigned a color

○ Add balls, where

○ Draw m edges to sample a network from the urn, where

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

DXXA

DXXB

XB CX

X CX

Urn

A 30x135

205x100

205x135

30x100

Hypergeometric Ensemble

36

Pr(Xvw = f(v, w)) /✓

Kvw

f(v, w)

◆✓Pij Kij �Kvw

m� f(v, w)

<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

Hypergeometric Ensemble

37

{Pr(Xvw = f(v, w)) /

✓Kvw

f(v, w)

◆✓Pij Kij �Kvw

m� f(v, w)

<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

Probability of observing frequency f(v,w) given the entire weighted network structure.

Hypergeometric Ensemble

38

{

Pr(Xvw = f(v, w)) /✓

Kvw

f(v, w)

◆✓Pij Kij �Kvw

m� f(v, w)

<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

Number of ways to pick f(v,w) multiedges from Kvw possible.

Hypergeometric Ensemble

39

{

Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>

Pr(Xvw = f(v, w)) /✓

Kvw

f(v, w)

◆✓Pij Kij �Kvw

m� f(v, w)

<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>

Number of ways to pick everything else.

Putting it all together: HYPA scores

40

Putting it all together: HYPA scores

41

If “close” to 0, then the pathway is underrepresented.

If “close” to 1, then pathway is overrepresented.

Putting it all together: HYPA scores

42

XB

XA

DX

CX0.99

0.97XB

XA

DX

C30 X

100

If “close” to 0, then the pathway is underrepresented.

If “close” to 1, then pathway is overrepresented.

+12.57

+11.69

-12.57

-11.69

XA C

XA D

DXB

B CX

Deviation from Randomized Paths

Observed Data Simulation Result HYPA Scores

Validation

43

Noise via Path Randomization

44

Synthetic Anomalies: Setup

45

Synthetic Anomalies: Setup

46

A

BX

C

D XB

XA

DX

CX

Start with an arbitrary first order topology, then construct the kth-order de Bruijn graph

k = 2

Synthetic Anomalies: Setup

47

A

BX

C

D XB

XA

DX

CX

Randomly choose some edges to label over-represented

k = 2

Synthetic Anomalies: Setup

48

A

BX

C

D XB

XA

DX

CX30

100

Assign heterogeneous weights based on label

k = 2

Synthetic Anomalies: Setup

49

A

BX

C

D XB

XA

DX

CX

Generate paths via random walks on this model, then evaluate ability of HYPA to detect injected anomalies (binary classifier).

30

100

k = 2

Synthetic Anomalies: ROC Example

50

Synthetic Anomalies: ROC Example

51

Injected Anomalies of Length 2

Synthetic Anomalies: AUC Results

52

Application to Flight Data

53

Airlines

54

5% sample of all US domestic flights in 2018

Data from: http://www.transtats.bts.gov/Tables.asp?DB_ID=125

AirlinesHypotheses:

1. Return flights should be over-represented, since people most often travel round trip.

55

Airlines: Return trips are over-represented

56

Fraction of over-represented return/non-return flights for various discrimination thresholds.

Airlines: Return trips are over-represented

57

Fraction of over-represented return/non-return flights for various discrimination thresholds.

AirlinesHypotheses:

1. Return flights should be over-represented, since people most often travel round trip.

2. Over-represented non-return flights are due to regional/national hubs, since people need to fly from small airports → regional hub → large airport.

58

Airlines: Trip Balance

59

AirlinesHypotheses:

1. Return flights should be over-represented, since people most often travel round trip.

2. Over-represented non-return flights are due to regional/national hubs, since people need to fly from small airports → regional hub → large airport.

3. “Efficient” paths are more likely to be over-represented.

60

Airlines: Efficiency

61

References

Tim LaRocklarock.t@husky.neu.edu

tlarock.github.iohttps://arxiv.org/abs/1905.10580

Thanks!

62

Scholtes, Ingo. "When is a network a network?: Multi-order graphical model selection in pathways and temporal networks." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.

Casiraghi et al. "Generalized hypergeometric ensembles: Statistical hypothesis testing in complex networks." arXiv:1607.02441 (2016).

Casiraghi & Nanumyan. “Generalised hypergeometric ensembles of random graphs: the configuration model as an urn problem.” arXiv:1810.06495 (2018)

R. TransStat. Origin and destination survey database. http://www.transtats.bts.gov/Tables.asp?DB_ID=125, 2018.

63

https://arxiv.org/abs/1905.10580

Definition: kth-order de Bruijn Graph

64

For a given graph G = (V,E) and positive integer k we define a k-th order

De Bruijn graph of paths in G as a graph Gk= (V k, Ek

), where (i) each node

~v := ~v0v1 . . . vk�1 2 V kis a path of length k � 1 in G, and (ii) (~v, ~w) 2 Ek

i↵

vi+1 = wi for i = 0, . . . , k � 2.

<latexit sha1_base64="nPo7/5HIQzxKitn0S+xtps/GiRw=">AAADPnicbVJNb9NAELXNVylfLRy5jEiQbOFGdjiAkCJVhVKOReqXVKfRZj1OFtu71u46URX5l3HhN3DjyIUDCHHlyK6TStAyl32aN/PezO6Oq4IpHUVfXO/a9Rs3b63dXr9z9979BxubD4+UqCXFQyoKIU/GRGHBOB5qpgs8qSSSclzg8Th/bfnjGUrFBD/Q5xUOSzLhLGOUaJMabboHb4UEAhM2Qw4TSaopJP7ewD8Kd4MkAMJTqIRi2vDAuMYJSujmXZgjpJgZV9Oc+HkSbOkpCJka+g3CjqzZhws9kUFF9FSZfqttVZW1bMnu3llu3M7yEHbP8qAbwnyKEsFnASChU+AiRdOWzJAuZs2rwRKMotkoTopUaAWz0SLfipsGEmNglIwBswbW1JoXyCfarmWqLLeaImyX85kxSvwL/bA9503Qiu0uxbLMVBgX9ixuYABzgxqTz8zNJT4bRCEsJwkh3+onQW99tNGJelEbcBXEK9BxVrE/2vicpILWJXJNC6LUaRxVerggUjNaYLOe1AorQnMywVMDOSlRDRft8zfw1GTSdppMcA1t9u+OBSmVOi/HprK073CZs8n/cae1zl4OF4xXtUZOl0ZZXYAWYP8SpEwi1cW5AYRK80co0CmRhGrz4+wlxJdXvgqO+r34ea//vt/Z3lldx5rz2Hni+E7svHC2nXfOvnPoUPej+9X97v7wPnnfvJ/er2Wp5656Hjn/hPf7D9a6/RE=</latexit>

Pseudocode

65

Naïve Baseline ComparisonFrequency-Based Anomaly Detection (FBAD)

Compute mean, µ, and standard deviation, σ, of kth order edge weights

Given scaling factor α, label edges as- Overrepresented if frequency is larger than µ + σα - Underrepresented if frequency is smaller than µ - σα

66

Synthetic Anomalies

67

Real Data

68

Exploring Motifs

69

Case Study: London TubeData:

- Origin → destination statistics between London Tube stations- (origin, destination, #observations)

- Shortest paths between stations- Assume people follow shortest paths

70

London TubeHypothesis:

- People typically use public transportation to travel large geographic distances- Overrepresented pathways should cover larger distances

Test:

- Measure distance between every station- For 2nd order transitions A-B-C, compute distance between nodes A and C- Analyze distributions of distance in over vs. under represented transitions

- Expect to see distribution shifted towards higher values for over-represented transitions

71

London Tube

72

London Tube

73

Median distance between source and destination nodes in under/over represented transitions.

Constructing Ground TruthConstruct ground truth based on the method discussed earlier:

○ Randomize path data using k-1st order random walks

○ Compute kth-order path statistics

○ Repeat m times, noting the frequency of each path

○ Estimate multinomial distribution and its CDF from these statistics

○ If CDF(path) > threshold, label over-represented

74

Tube Data - Ground Truth

75

Computational complexity

76

O(N + |V |2�k1)

<latexit sha1_base64="ab5tZyX/wq/L2KgZL0VZSB19Lxg=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUBFKUgVdFt240gr2AW0aJpNJO3QyCTMToaTd+CtuXCji1s9w5984bbPQ1gMDh3Pu4c49XsyoVJb1beSWlldW1/LrhY3Nre0dc3evIaNEYFLHEYtEy0OSMMpJXVHFSCsWBIUeI01vcD3xm49ESBrxBzWMiROiHqcBxUhpyTUP7kq38BSOGqNupcN0zkeu3R2cuGbRKltTwEViZ6QIMtRc86vjRzgJCVeYISnbthUrJ0VCUczIuNBJJIkRHqAeaWvKUUikk04PGMNjrfgwiIR+XMGp+juRolDKYejpyRCpvpz3JuJ/XjtRwaWTUh4ninA8WxQkDKoITtqAPhUEKzbUBGFB9V8h7iOBsNKdFXQJ9vzJi6RRKdtn5cr9ebF6ldWRB4fgCJSADS5AFdyAGqgDDMbgGbyCN+PJeDHejY/ZaM7IMvvgD4zPHzTNlOI=</latexit>

Scalability

77