Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang...
-
Upload
myra-allen -
Category
Documents
-
view
216 -
download
1
Transcript of Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang...
![Page 1: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/1.jpg)
Query-Based Outlier Detection in Heterogeneous Information Networks
Jonathan Kuck1, Honglei Zhuang1, Xifeng Yan2, Hasan Cam3, Jiawei Han1
1University of Illinois at Urbana-Champaign2University of California at Santa Barbara
3US Army Research Lab
![Page 2: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/2.jpg)
• Heterogeneous information networks are networks composed of multi-typed, interconnected vertices and links
• Outlier detection aims to find vertices that deviate significantly from other vertices
• However, outliers in such a heterogeneous network could be defined in many different ways– E.g. Finding outlier in authors of this paper
Jonathan Honglei Xifeng Hasan Jiawei
![Page 3: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/3.jpg)
• Heterogeneous information networks are networks composed of multi-typed, interconnected vertices and links
• Outlier detection aims to find vertices that deviate significantly from other vertices
• However, outliers in such a heterogeneous network could be defined in many different ways– E.g. Finding outlier in authors of this paper
Jonathan Honglei Xifeng Hasan Jiawei
![Page 4: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/4.jpg)
• Heterogeneous information networks are networks composed of multi-typed, interconnected vertices and links
• Outlier detection aims to find vertices that deviate significantly from other vertices
• However, outliers in such a heterogeneous network could be defined in many different ways– E.g. Finding outlier in authors of this paper
Jonathan Honglei Xifeng Hasan Jiawei
• As users have their own intuition about which kind of outliers they are interested in…
Allow users to specify queries for outlier detection
![Page 5: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/5.jpg)
Research Challenges
• How do users interact with the system to specify their queries?
• How do we define a general outlierness measure for different queries?
• How to efficiently find outliers for different queries?
![Page 6: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/6.jpg)
Outline
• Basic Concepts and Notations • Outlier Query Language• NetOut: Outlierness Measure• Processing Outlier Queries• Experimental Results• Summary
![Page 7: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/7.jpg)
Basic Concepts and Notations• Heterogeneous Information Network– An information network with multiple types of
vertices where is the set of vertices; is the set of edges; is the set of types, and assigns each vertex a type.
, , ,G V E T VE T
: V T
Network schema An instantiated network
![Page 8: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/8.jpg)
Basic Concepts and Notations• Meta-Path
– An ordered sequence of vertex types, denoted as – E.g.
• Meta-Path Instantiation– Use to denote paths between two vertices with the
types that align with the given meta-path – E.g. is shown below in red solid lines
P
, P i jv vP
Network schema An instantiated network
,i jv v
Author,Paper,VenueP
Zoe,KDD P
![Page 9: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/9.jpg)
Basic Concepts and Notations• Neighborhood
– Based on a meta-path , defined as – E.g.
• Neighbor Vector– Based on a meta-path , the neighbor vector describes how
many paths there are from to each of its neighbors– E.g.
P
Network schema An instantiated network
| , P i j P i jN v v v v
Zoe KDD:3, ICDE:2 P
Zoe KDD, ICDEPN
iv P ivP
![Page 10: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/10.jpg)
Formalization of Outlier Queries• Given a heterogeneous information network• A query can be formalized as
• where – is a set of (same-typed) candidate vertices• Specifying from where outliers are chosen from
– is a set of (same-typed) reference vertices• Specifying a sample of normal vertices• Optional. By default it equals to the candidate set
– define a weighted set of meta-paths• Specifying how two vertices are compared
, , ,c rQ S S P w
cS V
rS V
,P w
![Page 11: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/11.jpg)
Outlier Query Language
• General Formulation
FIND OUTLIERS
FROM author{“C. Faloutsos”}.paper.author
COMPARE TO venue {“KDD”}.paper.author
JUDGED BY author.paper.venue
TOP 10;
![Page 12: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/12.jpg)
Outlier Query Language
• General Formulation
FIND OUTLIERS
FROM author{“C. Faloutsos”}.paper.author
COMPARE TO venue {“KDD”}.paper.author
JUDGED BY author.paper.venue
TOP 10;
Specifying the candidate set C. Faloutsosc PS N Author, Paper, AuthorP
![Page 13: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/13.jpg)
Outlier Query Language
• General Formulation
FIND OUTLIERS
FROM author{“C. Faloutsos”}.paper.author
COMPARE TO venue {“KDD”}.paper.author
JUDGED BY author.paper.venue
TOP 10;
Specifying the candidate set
Specifying the reference set
KDDr PS N Venue, Paper, AuthorP
![Page 14: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/14.jpg)
Outlier Query Language
• General Formulation
FIND OUTLIERS
FROM author{“C. Faloutsos”}.paper.author
COMPARE TO venue {“KDD”}.paper.author
JUDGED BY author.paper.venue
TOP 10;
Specifying the candidate set
Compare vertices by neighbor vector
Specifying the reference set
Author, Paper, VenueP P iv
where
![Page 15: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/15.jpg)
Outlier Query Language
• General Formulation
FIND OUTLIERS
FROM author{“C. Faloutsos”}.paper.author
COMPARE TO venue {“KDD”}.paper.author
JUDGED BY author.paper.venue
TOP 10;
Specifying the candidate set
Specifying how vertices are compared
Specifying the reference set
Only return top 10 outliers
![Page 16: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/16.jpg)
Outlier Query Language (cont’)
• Allow complicated judging standards– E.g. multiple weighted neighbor vectorsJUDGED BY author.paper.venue: 0.6, author.paper.author: 0.4
• Allow operations on sets– E.g. select by conditionsFROM venue {“KDD”}.paper.author AS AWHERE COUNT(A.paper)> 10
![Page 17: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/17.jpg)
NetOut: A General Network-Based Outlierness Measure
• Given a meta-path (in JUDGED BY clause)• Define the connectivity between two vertices
– The more paths between them, the more similar they are• Define relative connectivity
– Compare the connectivity to self-connectivity
– Use self-connectivity as a expected connectivity to measure whether two vertices are unexpectedly connected or unexpectedly disconnected
• Outlier Measure: NetOut
P
1, ,i j i jPPv v v v
1
1
,,
,
i jPP
i j
i iPP
v vv v
v v
,
j r
i i jv S
v v v
![Page 18: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/18.jpg)
Comparing NetOut to Other Measures
Neighbor Vector VLDB KDD STOC SIGGRAPH
Reference Author(s)×100 10 10 1 1
Sarah 10 10 1 1
Rob 0 1 20 20
Lucy 0 5 10 10
*Joe 0 0 0 2
*Emma 0 0 0 30
NetOut PathSim[1] CosSim
Sarah 100.00 100.00 100.00
Rob 6.24 9.97 12.43
Lucy 31.11 32.79 32.83
Joe 50.00 1.94 7.04
Emma 3.33 5.44 7.04
A toy example.
Outlier measure comparison. The lower value, the more likely to be an outlier. Joe is not necessarily an
interesting outlier, as those papers might simply be noise
Emma is obviously an outlier and should be assigned a lower value
[1] Sun, Yizhou, et al. "PathSim: Meta path-based top-k similarity search in heterogeneous information networks." VLDB 2011.
![Page 19: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/19.jpg)
Comparison on Real Data Set
• Apply on network constructed from DBLP data set• Find outliers among Christos Faloutsos’ coauthors, in
terms of their publishing venues
authors with very few paper;uninteresting outliers
![Page 20: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/20.jpg)
Query Processing
• The calculation of NetOut can be written as
2
2
,;
,
,
,
1,
j r
j r
j r
i j
iv S i i
i j
v S i i
i jv Si
v vv Q
v v
v v
v v
v vv
![Page 21: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/21.jpg)
Query Processing
• The calculation of NetOut can be written as
• Time complexity:
2
2
,;
,
,
,
1,
j r
j r
j r
i j
iv S i i
i j
v S i i
i jv Si
v vv Q
v v
v v
v v
v vv
O(|Sr|), only calculated once
O(|Sc|), calculate over all the candidate vertices
c rO S S
![Page 22: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/22.jpg)
Pre-Materialization of Meta-Path
• We observe…– Calculation of outlierness only has time
complexity of – Retrieving for each vertex is much more
time consuming• Pre-materialization– Pre-store the materialization of all the length-2
meta-paths
iv c rO S S
![Page 23: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/23.jpg)
Selective Pre-Materialization
• Storing materialization of all length-2 meta-paths can be space-consuming
• Selective Pre-materialization– From a given set of training queries, take vertices
that frequently appear in the candidate sets (e.g. those appear in more than 10% of queries)
– Pre-materialize all the length-2 meta-paths for these frequently appearing vertices
![Page 24: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/24.jpg)
Experimental Setup
• Data Set– DBLP Data Set
• 2,244,018 publications, 1,274,360 authors• Construct the network according to the schema above• Publishing venue information is included• Terms are extracted from publication titles
• Query Sets– Design three different types of queries to find different
kinds of outliers– Generate 10,000 random queries for each type of
queries as a query set
![Page 25: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/25.jpg)
Case Study• With different queries, the outlier measure is able
to capture different outliers accordingly– E.g. Finding outliers in Christos Faloutsos’ coauthors
Outlier results by comparing publishing venues
Outlier results by comparing coauthor communities
![Page 26: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/26.jpg)
Efficiency Study• Comparing strategies
– Baseline: No pre-materialization– Pre-Materialization (PM): All length-2 meta-path instantiations are pre-
computed and indexed– Selective Pre-Materialization (SPM): Only a subset of instantiations with
relative frequency larger than 0.01 are indexed
![Page 27: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/27.jpg)
Parameter Studies for SPM
• Different thresholds for selective pre-materialization
![Page 28: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/28.jpg)
Summary
• Propose a framework for query-based outlier detection in heterogeneous information networks
• Formalize the definition of a query and provide a query language for users to interact with the system
• Design a general outlierness measure NetOut which can effectively find interesting outliers
• Present an efficient implementation to process such queries
![Page 29: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/29.jpg)
Thank you26-03-2015
• Introduction• Outlier Query Language• NetOut Measure• Query Processing• Experimental Results
![Page 30: Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d825503460f94a68306/html5/thumbnails/30.jpg)