“OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui...
-
Upload
mark-porter -
Category
Documents
-
view
217 -
download
0
Transcript of “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui...
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
OLAP onOLAP onSequence DataSequence Data
Chun Kit Chui (Kit),The University of Hong [email protected]
Chun Kit Chui (Kit),The University of Hong [email protected]
Eric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun Kit Chui and David W. CheungEric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun Kit Chui and David W. Cheung
Presenter :Presenter :
Authors :Authors :
Published in SIGMOD 2008 Vancouver, Canada.Published in SIGMOD 2008 Vancouver, Canada.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
OLAP onOLAP onSequence DataSequence Data
Problem MotivationProblem Motivation
Sequence Data Cube and CuboidsSequence Data Cube and Cuboids
Experimental evaluationsExperimental evaluations
New OLAP operationsNew OLAP operations
System architectureSystem architecture
Future worksFuture works
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature.
Web server access logs Stock market dataU.S. OIL FUND ETF
MEXCO ENERGY CORP
OLAP onOLAP onSequence DataSequence Data
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature.
Web server access logs Stock market dataU.S. OIL FUND ETF
MEXCO ENERGY CORP
Time member- ID URL Product Product type Brand
2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas
2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma
… … … …
2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas
… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil
Web server access logs (Web retailor selling sports wear products)
The product dimension is associated with a concept hierarchy in which the finest level of abstraction is product ID, followed by product type, and brand.
Sequence DataSequence Data
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Time member- ID URL Product Product type Brand
2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas
2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma
… … … …
2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas
… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil
Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature.
Web server access logs
Sequence DataSequence Data
Browsing Sequence
Member 688 Adidas shoesNike shoes Nike shoes
Web server access logs (Web retailor selling sports wear products)
From the access logs we can trace back the browsing sequences of all members.
The product dimension is associated with a concept hierarchy in which the finest level of abstraction is product ID, followed by product type, and brand.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Time member- ID URL Product Product type Brand
2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas
2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma
… … … …
2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas
… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil
I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.
Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature.
Web server access logs
Manager
Sequence DataSequence Data
Browsing Sequence
Member 688 Adidas shoesNike shoes Nike shoes
Web server access logs (Web retailor selling sports wear products)
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Time member- ID URL Product Product type Brand
2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas
2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma
… … … …
2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas
… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil
I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.
< X, Y, X > # Members
< Nike Shoes, Adidas Shoes, Nike Shoes > ?
< Nike Shoes, Puma Shoes, Nike Shoes > 5,432
< Nike Shoes, Nike Shoes, Nike Shoes > 13,200
… …
< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331
The query is referring to a particular kind of pattern in the browsing sequences.
The comparison shopping semantics can be expressed by the pattern template < X, Y, X >.
Sequence DataSequence Data
Browsing Sequence
Member 688 Adidas shoesNike shoes Nike shoes
Manager
Web server access logs (Web retailor selling sports wear products)
Pattern template
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Time member- ID URL Product Product type Brand
2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas
2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma
… … … …
2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas
… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil
I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.
<Nike shoes, Adidas Shoes, Nike Shoes> is one of the instantiations of the pattern template.Since the browsing sequence of member 688 contains/ possesses the pattern, the sequence contributes to 1 count in the cell.
Sequence DataSequence Data
Browsing Sequence
Member 688 Adidas shoesNike shoes Nike shoes
Manager
Web server access logs (Web retailor selling sports wear products)
< X, Y, X > # Members
< Nike Shoes, Adidas Shoes, Nike Shoes > 1
< Nike Shoes, Puma Shoes, Nike Shoes > ?
< Nike Shoes, Nike Shoes, Nike Shoes > ?
… …
< Adidas Shoes, Nike Shoes, Adidas Shoes > ?
< Adidas Shoes, Puma Shoes, Adidas Shoes > ?
Pattern template Instantiated pattern
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Time member- ID URL Product Product type Brand
2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas
2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma
… … … …
2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas
… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil
I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.
< X, Y, X > # Members
< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000
< Nike Shoes, Puma Shoes, Nike Shoes > 5,432
< Nike Shoes, Nike Shoes, Nike Shoes > 13,200
… …
< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331
The aggregated number of members is counted and a tabulated view of the sequence data should be returned.
Sequence DataSequence Data
Browsing Sequence
Member 688 Adidas shoesNike shoes Nike shoes
Manager
Web server access logs (Web retailor selling sports wear products)
<Nike shoes, Adidas Shoes, Nike Shoes> is one of the instantiations of the pattern template.Since the browsing sequence of member 688 contains/ possesses the pattern, the sequence contributes to 1 count in the cell.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Web server access logs (Web retailor selling sports wear products)Time member- ID URL Product Product type Brand
2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas
2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma
… … … …
2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike
… … … …
2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas
… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil
I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.
< X, Y, X > # Members
< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000
< Nike Shoes, Puma Shoes, Nike Shoes > 5,432
< Nike Shoes, Nike Shoes, Nike Shoes > 13,200
… …
< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331
Sequence OLAP systemSequence OLAP system• Support “pattern based” grouping and aggregation.
Query
Result
The aggregated number of members is counted and a tabulated view of the sequence data should be returned.
Manager
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence OLAP systemSequence OLAP system• Support “pattern based” grouping and aggregation.
I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.
< X, Y, X > # Members
< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000
< Nike Shoes, Puma Shoes, Nike Shoes > 5,432
< Nike Shoes, Nike Shoes, Nike Shoes > 13,200
… …
< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331
< X, Y, X, Z >
X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any# Members
< Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes > 15,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts >180,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000
… …
There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if
so what is the product.
Follow up Query
Result
Manager
The new query can be expressed by appending a pattern symbol “Z” to form a new pattern template <X,Y,X,Z>. The result shows the statistics of one more browsing step after the comparison shopping between Nike Shoes and Adidas Shoes
• Obtain query results in real time (OLAP feature).
+
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
The new query can be expressed by appending a pattern symbol “Z” to form a new pattern template <X,Y,X,Z>. The result shows the statistics of one more browsing step after the comparison shopping between Nike Shoes and Adidas Shoes
Sequence OLAP systemSequence OLAP system• Support “pattern based” grouping and aggregation.
I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.
< X, Y, X > # Members
< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000
< Nike Shoes, Puma Shoes, Nike Shoes > 5,432
< Nike Shoes, Nike Shoes, Nike Shoes > 13,200
… …
< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331
< X, Y, X, Z >
X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any# Members
< Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes > 15,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts >180,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000
… …
Follow up Query
Result
Manager
This manager find out the Adidas T-shirts page is the most popular page for the members who did comparison shopping between Nike shoes and Adidas shoes pages.
• Obtain query results in real time (OLAP feature).
There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if
so what is the product.
+
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.
< X, Y, X, Z >
X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any# Members
< Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes > 15,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts >180,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000
… …
The comparison shopping patterns displayed in the “product type” abstraction level is too detailed, I would like to view some higher level statistics.
Query
Result
• Support “pattern based” grouping and aggregation.
• Obtain query results in real time (OLAP feature).
Sequence OLAP systemSequence OLAP system
• Provide OLAP operations to ease sequence
analysis.
Nike
Nike shoes
Nike T-shirts
Nike Basketballs
Nike socks
There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if
so what is the product.
Manager
A simple “roll up” operation on the pattern template transforms the summary statistics to the brand abstraction level.
< X, Y, X > # Members
< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000
< Nike Shoes, Puma Shoes, Nike Shoes > 5,432
< Nike Shoes, Nike Shoes, Nike Shoes > 13,200
… …
< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331
< X, Y, X > # Members
< Nike, Adidas, Nike> 3,150,000
< Nike, Puma, Nike > 2,180,000
< Nike, Nike, Nike > 19,000,000
… …
“Product type” abstraction level
“brand” abstraction level
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Research Objective
To design and implement an OLAP system that is able to
support “pattern based” grouping and aggregation. obtain query results in real-time.
Especially optimized for interactive/iterative queries.
provide OLAP operations to ease explorative analysis of sequence data.
< X, Y, X > # Members
< Nike, Adidas, Nike> 315,000
< Nike, Puma, Nike > 2,180,000
< Nike, Nike, Nike > 189,000
… …
< X, Y > # Members
< Nike, Adidas> 1,315,000
< Nike, Puma > 6,480,000
< Nike, Nike> 3,189,000
… …
Sequence OLAP
Sequence OLAP
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
RFID Logs
Radio-frequency identification (RFID) is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags.
The smart card system in public transits Octopus card Hong Kong, Orca card in Seattle (2009)…etc Electronic money Travel history of passengers are logged in a database. Generate massive amount of sequence data.
Smart card
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
RFID Logs
Radio-frequency identification (RFID) is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags.
The smart card system in public transits Octopus card Hong Kong, Orca card in Seattle (2009)…etc Electronic money Payment can be done easily by waving the card over the card reader. Travel history of passengers are logged in a database. Generate massive amount of sequence data .
Time Card-ID Location Action Amount
2008-7-25 09:01 Kit Shatin in -
2008-7-25 09:25 Kit Central out - $5
… … … … …
2008-7-25 18:23 KitCentral
Machine #10Add value + $100
2008-7-25 18:25 Kit Central in -
… … … … …
2008-7-25 18:49 Kit Shatin out - $5
… … … … …
Smart card Card reader Event Database
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Time Card-ID Location Action Amount
2008-7-25 09:01 Kit Shatin in -
2008-7-25 09:25 Kit Central out - $5
… … … … …
2008-7-25 18:23 KitCentral
Machine #10Add value + $100
2008-7-25 18:25 Kit Central in -
… … … … …
2008-7-25 18:49 Kit Shatin out - $5
… … … … …
Event Database
The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4.
Marketing Manager
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4.
Marketing Manager
< X, Y, Y, X > # Users
< Shatin, Central, Central, Shatin > 2,032
< Shatin, Admiralty, Admiralty, Shatin> 1,982
… …
< Admiralty, Central, Central, Admiralty > 22,822
< Admiralty, Kowloon, Kowloon, Admiralty > 10,020
Query• Support “pattern based” grouping and aggregation.
• Obtain query results in real time.
Sequence OLAP systemSequence OLAP system
• Provide OLAP operations to ease explorative analysis.
Result
Round trip statistics (Stations level)Time Card-ID Location Action Amount
2008-7-25 09:01 Kit Shatin in -
2008-7-25 09:25 Kit Central out - $5
… … … … …
2008-7-25 18:23 KitCentral
Machine #10Add value + $100
2008-7-25 18:25 Kit Central in -
… … … … …
2008-7-25 18:49 Kit Shatin out - $5
… … … … …
Event Database
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Data CuboidSequence Data Cuboid
A logical view of sequence data at a particular degree of summarization.A logical view of sequence data at a particular degree of summarization.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Preliminary
Sequence Cuboid (S-Cuboid)
a logical view of sequence data at a particular degree of summarization.
sequences can be characterized by
attributes’ values of the events in the sequence (e.g. time, spending, product type)
the subsequence/ substring patterns they possess. (e.g. <X,Y,X> , <X,Y,Y,X>)
< X, Y, Y, X >#
Users
< Shatin, Central, Central, Shatin > 2
< Kowloon, Admiralty, Admiralty, Kowloon > 9
… …
An S-Cuboid
The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4.Marketing Manager
Sequence OLAP
Sequence OLAP
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 1. Sequence Formation
An event selection step to select a set of a relevant records and attributes.
Time Card-ID Location Action Amount
2008-6-09 00:01 Kit Shatin in 0
2008-6-09 02:25 Kit Central out -5
… … … … …
2008-6-14 02:23 KitCentral
Machine #10Add value +100
2008-6-14 02:25 Kit Central in 0
… … … … …
2008-6-14 18:49 Kit Shatin out -5
… … … … …
Event DatabaseTime Card-ID Location Action Amount
2008-6-09 00:01 Kit Shatin in 0
2008-6-09 02:25 Kit Central out -5
… … … … …
2008-6-14 02:25 Kit Central in 0
… … … … …
2008-6-14 18:49 Kit Shatin out -5
… … … … …
EventSelection
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 1. Sequence Formation
Time Card-ID Location Action Amount
2008-6-09 00:01 Kit Shatin in 0
2008-6-09 02:25 Kit Central out -5
… … … … …
2008-6-14 02:23 KitCentral
Machine #10Add value +100
2008-6-14 02:25 Kit Central in 0
… … … … …
2008-6-14 18:49 Kit Shatin out -5
… … … … …
Event DatabaseTime Card-ID Location Action Amount
2008-6-09 00:01 Kit Shatin in 0
2008-6-09 02:25 Kit Central out -5
… … … … …
2008-6-14 02:25 Kit Central in 0
… … … … …
2008-6-14 18:49 Kit Shatin out -5
… … … … …
EventSelection
Seq ID Sequence of events
S1 < e1, e2, e102, e180>
S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >
S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >
… …
Sequence Formation
User : Individual, Time : Day
A sequence formation step to form sequences from the event dataset.
Sequences can be formed per day and for each individual user.By doing this, we have a number of daily travel sequences of each user. E.g. S1 is Kit’s trip on Monday
Kit’s trip on monday
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 1. Sequence Formation
Time Card-ID Location Action Amount
2008-6-09 00:01 Kit Shatin in 0
2008-6-09 02:25 Kit Wan Chai out -5
… … … … …
2008-6-14 02:23 KitWan Chai
Machine #10Add value +100
2008-6-14 02:25 Kit Wan Chai in 0
… … … … …
2008-6-14 18:49 Kit Shatin out -5
… … … … …
Event DatabaseTime Card-ID Location Action Amount
2008-6-09 00:01 Kit Shatin in 0
2008-6-09 02:25 Kit Central out -5
… … … … …
2008-6-14 02:25 Kit Central in 0
… … … … …
2008-6-14 18:49 Kit Shatin out -5
… … … … …
EventSelection
Seq ID Sequence of events
S1 < e1, e2, e102, e180>
S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >
S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >
… …
Sequence Formation
User : Individual, Time : Day Kit’s trip on monday
Seq ID Sequence of events
S1 < e1, e2 , e102, e180 , e1002, e1800 , e1801 ,… >
S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 , e2134, e2135 >
S3 < e4, e5, e9, e13 , e14, e290 , e292 , e352 , e3252,…>
… …
User : Individual, Time : Year
Sequences can also be formed according to time dimension at the abstraction level of year and per individual user.
Kit’s trip in 2008
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 2. S-Cuboid construction
Seq ID Sequence of events
S1 < e1, e2, e102, e180>
S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >
S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >
… …
User : Individual, Time : Day
Time : day
Use
r :
indi
vidu
al
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
Kit
Ben
ShingKit’s trip on monday
Monday
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 2. S-Cuboid construction
Seq ID Sequence of events
S1 < e1, e2, e102, e180>
S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >
S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >
… …
User : Individual, Time : Day
Time : day
Use
r :
indi
vidu
al
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
time : day
Use
r :
fare
-gro
up
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
SequenceGrouping
A sequence grouping step to group the sequences that share the same dimensions’ values into a sequence group.E.g. travel sequences are grouped according to their fare groups.
Kit
Ben
Shing
RegularGroup
Monday
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 2. S-Cuboid construction
Seq ID Sequence of events
S1 < e1, e2, e102, e180>
S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >
S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >
… …
User : Individual, Time : Day
Time : day
Use
r :
indi
vidu
al
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
time : day
Use
r :
fare
-gro
up
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
SequenceGrouping
X (Location : station)
Y (
Loca
tion
: st
atio
n)
PatternGrouping
The pattern grouping step further groups the sequences according to the “patterns” they possess.
Pattern X,Y,Y,X
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
time : day
Use
r :
fare
-gro
up
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
PatternGrouping
Phase 2. S-Cuboid construction
Event Time Card-ID Location Action Amount
e12008-6-09
00:01Kit Shatin in 0
e22008-6-09
02:25Kit Central out -5
… … … … … …
e1022008-6-09
22:25Kit Central in 0
… … … … … …
e1802008-6-09
23:49Kit Shatin out -5
… … … … … …
X (Location : station)
Y (
Loca
tion
: st
atio
n) Each cell represents an instantiated pattern E.g. <Shatin, Central, Central, Shatin>We assign sequences to a cell if that sequence contains the instantiated pattern.
Shatin
S1
S3
Pattern X,Y,Y,X
The pattern grouping step further groups the sequences according to the “patterns” they possess.
Cen
tral
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
time : day
Use
r :
fare
-gro
up
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
PatternGrouping
Phase 2. S-Cuboid construction
X (Location : station)
Y (
Loca
tion
: st
atio
n)
Shatin
Cen
tral S1
S3
Pattern X,Y,Y,X
Count: 2
Aggregated Value
Finally, an aggregation function is applied to the sequences in each cuboid cell.
Each cell represents an instantiated pattern E.g. <Shatin, Central, Central, Shatin>We assign sequences to a cell if that sequence contains the instantiated pattern.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
time : day
Use
r :
fare
-gro
up
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
PatternGrouping
Phase 2. S-Cuboid construction
X (Location : station)
Y (
Loca
tion
: st
atio
n)
Shatin
Cen
tral S1
S3
Pattern X,Y,Y,X
Count: 2
Aggregated Value
4D S-Cuboid
< X, Y, Y, X > # Users
< Shatin, Central, Central, Shatin > 2
< Shatin, Kowloon, Kowloon, Shatin > 9
… …
4D S-Cuboid
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
time : day
Use
r :
fare
-gro
up
S1
S2
S3
S4
S100
S90
S23
S29
S388
S124
S242
S129
S1020
S9230
S2453
S2529
PatternGrouping
Phase 2. S-Cuboid construction
X (Location : station)
Y (
Loca
tion
: st
atio
n)
Shatin
Cen
tral S1
S3
Pattern X,Y,Y,X
Count: 2
Aggregated Value
4D S-Cuboid
Global Dimensions
Pattern Dimensions
< X, Y, Y, X > # Users
< Shatin, Central, Central, Shatin > 2
< Shatin, Kowloon, Kowloon, Shatin > 9
… …
4D S-Cuboid
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4.
SequenceFormation
SequenceGrouping
PatternGrouping
< X, Y, Y, X > # Users
< Shatin, Central, Central, Shatin > 2
< Shatin, Kowloon, Kowloon, Shatin > 9
… …
4D S-CuboidThe number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4.
This query specifies the construction of the S-Cuboid that answer the round trip query in the running example.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4.
SequenceFormation
SequenceGrouping
PatternGrouping
< X, Y, Y, X > # Users
< Shatin, Central, Central, Shatin > 2
< Shatin, Kowloon, Kowloon, Shatin > 9
… …
4D S-Cuboid
We specify the global dimensions in the sequence grouping step.Group the sequences with the same fare-group within the same day.
Form individual daily travel sequences.
Group the sequences according to the pattern template <X,Y,Y,X>, where X, Y are referring to the location dimension at station abstraction level.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
The predicates further increases the expression power of pattern matching in the query language.What exactly is a round-trip pattern?
SequenceFormation
SequenceGrouping
PatternGrouping
< X, Y, Y, X > # Users
< Shatin, Central, Central, Shatin > 2
< Shatin, Kowloon, Kowloon, Shatin > 9
… …
4D S-Cuboid
We specify the global dimensions in the sequence grouping step.Group the sequences with the same fare-group within the same day.
Form individual daily travel sequences.
Group the sequences according to the pattern template <X,Y,Y,X>, where X, Y are referring to the location dimension at station abstraction level.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
SequenceFormation
SequenceGrouping
PatternGrouping
Pattern templatePattern dimensions
Global dimensions
< X, Y, Y, X > # Users
< Shatin, Central, Central, Shatin > 2
< Shatin, Kowloon, Kowloon, Shatin > 9
… …
4D S-CuboidThe cell restriction defines how to deal with the situations when a data sequence contains multiple occurrences of a cell’s pattern. E.g. A sequence contribute to 1 count whenever we can find one match of the pattern in the sequence.
E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin >
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
SequenceFormation
SequenceGrouping
PatternGrouping
Pattern templatePattern dimensions
Global dimensions
< X, Y, Y, X > # Users
< Shatin, Central, Central, Shatin > 2
< Shatin, Kowloon, Kowloon, Shatin > 9
… …
4D S-CuboidThe cell restriction defines how to deal with the situations when a data sequence contains multiple occurrences of a cell’s pattern. E.g. A sequence contribute to 1 count whenever we can find one match of the pattern in the sequence.
E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin >
Any changes to the cuboid specification transforms the S-Cuboid to another.E.g. changing the pattern template to (X,Y,Y,X,Z) generates another S-Cuboid.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Exponential number of S-cuboids The length of the pattern template is infinite
Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizableRecall that changing the pattern template essentially changes the cuboid specification and thus generates a new cuboid.
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Exponential number of S-cuboids The length of the pattern template is infinite
Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
1 1 1 1 1 1 1# Sales
Mon Tu
e
Wed Thur
Fri
Sat
SunIn traditional OLAP systems, data are summarizable.i.e. Summaries in finer abstraction level can be used to construct the summary in higher abstraction level.
7
Traditional OLAP
# Sales
Whole week
Summarizable!
Finer summaries
Coarsersummaries
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Infinite number of S-cuboids The number of pattern dimensions is infinite
Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
# Sales 1 1 1 1 1 1 1
7
Traditional OLAP
Mon Tu
e
Wed Thur
Fri
Sat
Sun
# Sales
Whole week
Seq ID Sequence of events
Kit < Kowloon, Central, Kowloon, Central >
Ben < Kowloon, Central, Central, Kowloon >
< X, Y, Z > Count
< Kowloon, Central, Kowloon > 1
< Kowloon, Central, Central > 1
The S-Cuboid with pattern template <X,Y,Z>
Sequence Database S-Cuboid (Finer aggregates)
Summarizable!
< A, B, A>
< A, B, B>
< A, B >
Sequence OLAP
11
?
#SequencesFiner
summaries
Coarsersummaries
#Sequences
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Infinite number of S-cuboids The number of pattern dimensions is infinite
Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
# Sales 1 1 1 1 1 1 1
7
Traditional OLAP
Mon Tu
e
Wed Thur
Fri
Sat
Sun
# Sales
Whole week
Seq ID Sequence of events
Kit < Kowloon, Central, Kowloon, Central >
Ben < Kowloon, Central, Central, Kowloon >
< X, Y, Z > Count
< Kowloon, Central, Kowloon > 1
< Kowloon, Central, Central > 1
The S-Cuboid with pattern template <X,Y,Z>
Sequence Database S-Cuboid (Finer aggregates)
Summarizable!
< A, B, A>
< A, B, B>
< A, B >
Sequence OLAP
11
?
#SequencesFiner
summaries
Coarsersummaries
#Sequences
Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database?
< X, Y > Count
< Kowloon, Central> ?
S-Cuboid (Coarser aggregates)
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Infinite number of S-cuboids The number of pattern dimensions is infinite
Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
# Sales 1 1 1 1 1 1 1
7
Traditional OLAP
Mon Tu
e
Wed Thur
Fri
Sat
Sun
# Sales
Whole week
Seq ID Sequence of events
Kit < Kowloon, Central, Kowloon, Central >
Ben < Kowloon, Central, Central, Kowloon >
< X, Y, Z > Count
< Kowloon, Central, Kowloon > 1
< Kowloon, Central, Central > 1
Sequence Database S-Cuboid (Finer aggregates)
Summarizable!
< A, B, A>
< A, B, B>
< A, B >
Sequence OLAP
11
?
#SequencesFiner
summaries
Coarsersummaries
#Sequences
< X, Y > Count
< Kowloon, Central> 2
S-Cuboid (Coarser aggregates)
Seq ID Sequence of events
Kit < Kowloon, Central, Kowloon, Central, Central >
Ben < Kowloon, Admiralty >
< X, Y, Z > Count
< Kowloon, Central, Kowloon > 1
< Kowloon, Central, Central > 1
Sequence Database S-Cuboid (Finer aggregates)
< X, Y > Count
< Kowloon, Central> 1
S-Cuboid (Coarser aggregates)
Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database?
The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences.
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Infinite number of S-cuboids The number of pattern dimensions is infinite
Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
# Sales 1 1 1 1 1 1 1
7
Traditional OLAP
Mon Tu
e
Wed Thur
Fri
Sat
Sun
# Sales
Whole week
Seq ID Sequence of events
Kit < Kowloon, Central, Kowloon, Central >
Ben < Kowloon, Central, Central, Kowloon >
< X, Y, Z > Count
< Kowloon, Central, Kowloon > 1
< Kowloon, Central, Central > 1
Sequence Database S-Cuboid (Finer aggregates)
Summarizable!
< A, B, A>
< A, B, B>
< A, B >
Sequence OLAP
11
#SequencesFiner
summaries
Coarsersummaries
#Sequences
< X, Y > Count
< Kowloon, Central> 2
S-Cuboid (Coarser aggregates)
Seq ID Sequence of events
Kit < Kowloon, Central, Kowloon, Central, Central >
Ben < Kowloon, Admiralty >
< X, Y, Z > Count
< Kowloon, Central, Kowloon > 1
< Kowloon, Central, Central > 1
Sequence Database S-Cuboid (Finer aggregates)
< X, Y > Count
< Kowloon, Central> 1
S-Cuboid (Coarser aggregates)
Non-Summarizable!
Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database?
The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences.
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Infinite number of S-cuboids The number of pattern dimensions is infinite
Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable Coarser aggregates cannot be computed
solely from the corresponding finer aggregates.
Seq ID Sequence of events
Kit < Kowloon, Central, Kowloon, Central >
Ben < Kowloon, Central, Central, Kowloon >
< X, Y, Z > Count
< Kowloon, Central, Kowloon > 1
< Kowloon, Central, Central > 1
Sequence Database S-Cuboid (Finer aggregates)
< X, Y > Count
< Kowloon, Central> 2
S-Cuboid (Coarser aggregates)
Seq ID Sequence of events
Kit < Kowloon, Central, Kowloon, Central, Central >
Ben < Kowloon, Admiralty >
< X, Y, Z > Count
< Kowloon, Central, Kowloon > 1
< Kowloon, Central, Central > 1
Sequence Database S-Cuboid (Finer aggregates)
< X, Y > Count
< Kowloon, Central> 1
S-Cuboid (Coarser aggregates)
Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database?
The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences.
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Exponential number of S-cuboids The length of the pattern template is infinite
Pattern Template (X,Y,Y,X,A,B,…)
Full materialization is impossible! Non-summarizable
Coarser aggregates cannot be computed solely from the corresponding finer aggregates.
Partial materialization is infeasible!
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Research direction Precompute some other auxiliary data structures
so that queries can be computed online using the pre-built data structures
Properties of S-Cuboids
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
S-OLAP SpecificS-OLAP SpecificOperationsOperations
Assist explorative analysis of the sequence dataAssist explorative analysis of the sequence data
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
S-OLAP specific operations
Navigate between cuboids with ease Traditional OLAP operations for Global Dimensions
SLICE, DICE, ROLL-UP, DRILL-DOWN, etc. New S-OLAP operations for Pattern Dimensions /
Pattern Template APPEND(X) (X,Y,Y) (X,Y,Y,X) DE-TAIL (X,Y,Y,X) (X,Y,Y) PREPEND(Z) (X,Y,Y,X) (Z,X,X,Y,Y) DE-HEAD (Q,Y,Y,X) (Y,Y,X) PATTERN-ROLL-UP(X) (X,Y,Y,X) (X,Y,Y,X) PATTERN-DRILL-DOWN(X) (X,Y,Y,X) (x,Y,Y,x)
Coarser abstraction level
Finer abstraction level
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence OLAP
Sequence OLAP
< X ,Y >Tell me the summary statistics of the single trip travel patterns of passengers among different Rail Lines, please .
CUBOID by SUBSTRING(X,Y) WITH X as location at “Rail Lines”, Y as location at “Rail Lines” LEFT-MAXIMALITY (x1, y1) WITH x1.action = “in” AND y1.action = “out”
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X, Y > , X and Y at Line level # Passenger
< Tsuen Wan Line, Island Line> 120,000
< Island Line, Tsuen Wan Line > 8,000
… …
S-Cuboid 1 (10 * 10 cells)
Sequence OLAP
Sequence OLAP
< X ,Y >
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X, Y > , X and Y at Line level # Passenger
< Tsuen Wan Line, Island Line> 120,000
< Island Line, Tsuen Wan Line > 8,000
… …
S-Cuboid 1 (10 * 10 cells)
Sequence OLAP
Sequence OLAP
< X ,Y >
More detailed statistics of passengers traveling from the Tsuen Wan Line to each of the Island Line stations, please .
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X, Y > , X and Y at Line level # Passenger
< Tsuen Wan Line, Island Line> 120,000
< Island Line, Tsuen Wan Line > 8,000
… …
S-Cuboid 1 (10 * 10 cells)
Sequence OLAP
Sequence OLAP
< X ,Y >
< X, Y > , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”# Passenger
< Tsuen Wan Line, Central> 100,000
< Tsuen Wan Line, Admiralty > 8,300
< Tsuen Wan Line, Wan Chai > 4,030
< Tsuen Wan Line, Causeway Bay > 12,430
… …
S-Cuboid 2 (1 * 14 cells)
Slice, P-DRILL-DOWN
Instead of specifying the S-Cuboid construction query, a SLICE plus a P-DRILL-DOWN(Y) is done.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X, Y > , X and Y at Line level # Passenger
< Tsuen Wan Line, Island Line> 120,000
< Island Line, Tsuen Wan Line > 8,000
… …
S-Cuboid 1 (10 * 10 cells)
Sequence OLAP
Sequence OLAP
< X ,Y >
< X, Y > , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”# Passenger
< Tsuen Wan Line, Central> 100,000
< Tsuen Wan Line, Admiralty > 8,300
< Tsuen Wan Line, Wan Chai > 4,030
< Tsuen Wan Line, Causeway Bay > 12,430
… …
S-Cuboid 2 (1 * 14 cells)
Slice, P-DRILL-DOWN
< X, Y ,Y> , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”# Passenger
< Tsuen Wan Line, Central, Central > 90,000
< Tsuen Wan Line, Admiralty, Admiralty > 8,300
< Tsuen Wan Line, Wan Chai, Wan Chai > 4,030
< Tsuen Wan Line, Admiralty, Admiralty > 2,430
… …
S-Cuboid 3 (1 * 14 * 14 cells)
APPEND (Y)
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X, Y > , X and Y at Line level # Passenger
< Tsuen Wan Line, Island Line> 120,000
< Island Line, Tsuen Wan Line > 8,000
… …
S-Cuboid 1 (10 * 10 cells)
Sequence OLAP
Sequence OLAP
< X ,Y >
< X, Y > , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”# Passenger
< Tsuen Wan Line, Central> 100,000
< Tsuen Wan Line, Admiralty > 8,300
< Tsuen Wan Line, Wan Chai > 4,030
< Tsuen Wan Line, Causeway Bay > 12,430
… …
S-Cuboid 2 (1 * 14 cells)
Slice, P-DRILL-DOWN
< X, Y ,Y> , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”# Passenger
< Tsuen Wan Line, Central, Central > 90,000
< Tsuen Wan Line, Admiralty, Admiralty > 8,300
< Tsuen Wan Line, Wan Chai, Wan Chai > 4,030
< Tsuen Wan Line, Admiralty, Admiralty > 2,430
… …
S-Cuboid 3 (1 * 14 * 14 cells)
APPEND (Y)
DE-TAIL
The S-OLAP operations not only assists the exploratory analysis of the sequence data, it also hides all the technical details of specifying the S-Cuboid query from the business users.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System ArchitectureSystem Architecture
Skip
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
EventDataset
The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
EventDataset
The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.
System Architecture
SequenceQuery Engine
SequenceQuery Engine
SequenceCache
The job of the Sequence Query Engine is to compose sets of event sequences out of the event dataset (Phase 1 in S-Cuboid construction).
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
SequenceQuery Engine
SequenceQuery Engine
EventDataset
SequenceCache
UserInterface
UserInterface
The User Interface provides certain user-friendly components to help a user specify an S-cuboid.
The job of the Sequence Query Engine is to compose sets of event sequences out of the event dataset (Phase 1 in S-Cuboid construction).
The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.
Queries
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
SequenceQuery Engine
SequenceQuery Engine
EventDataset
SequenceCache
The User Interface provides certain user-friendly components to help a user specify an S-cuboid.
The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.
Sequence OLAP EngineSequence OLAP Engine
Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an S-cuboid has been previously computed and stored.
Cuboid Repository
Results
Queries
UserInterface
UserInterface
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Cuboid Repository
Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an S-cuboid has been previously computed and stored.
System Architecture
SequenceQuery Engine
SequenceQuery Engine
EventDataset
SequenceCache
The User Interface provides certain user-friendly components to help a user specify an S-cuboid.
The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.
Sequence OLAP EngineSequence OLAP Engine
AuxiliaryData Structures
The SOLAP Engine computes the S-cuboid with the help of certain Auxiliary Data Structures.
UserInterface
UserInterface
Results
Queries
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an S-cuboid has been previously computed and stored.
Cuboid Repository
System Architecture
SequenceQuery Engine
SequenceQuery Engine
EventDataset
SequenceCache
The User Interface provides certain user-friendly components to help a user specify an S-cuboid.
The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.
Sequence OLAP EngineSequence OLAP Engine
AuxiliaryData Structures
The SOLAP Engine computes the S-cuboid with the help of certain Auxiliary Data Structures.
UserInterface
UserInterface
Results
Queries
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Auxiliary Data StructuresAuxiliary Data Structures
Counter based approachCounter based approach
Inverted indices approachInverted indices approach
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Counter-Based approach
Counter-Based approach Each cell in an S-cuboid is associated with a counter. To determine the counters’ values, the entire set of sequences
is scanned. For each sequence s, we determine the cells whose
associated patterns are contained in s and increment each of such counters by 1.
Basic and simple But processing iterative queries requires Counting from
scratch.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
S-OLAP query evaluation
Inverted-Index Approach Based on the fragment cube (X. Li, J. Han, and H.
Gonzalez. VLDB 2004) concept. A set of inverted indices are created by pre-processing
the data offline. Algorithm BuildIndex (see paper)
During query processing, the relevant inverted indices are joined based on the matching pattern, in real-time.
Algorithm QueryIndices (see paper) By-products of answering a query is the creation of new
inverted indices. Newly built indices are useful to the processing of iterative S-
OLAP operations (see paper for algorithms)
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments A prototype S-OLAP system was implemented
using C++. Real Data
Passenger traveling history. KDD Cup 2000
Clickstream data from a web retailer selling legwear and legcare products.
50,524 sequences. KDD Cup 2000 Question 1
Look for page-click patterns We answer this question in an exploratory way via three
iterative queries.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments
KDD Cup 2000 Question 1 Look for page-click patterns We answer this question in an exploratory way via three
iterative queries
< X, Y>
X,Y at “page category” level# User
sessions
< Main page, Product Catalog> 6,524
… …
< Product Catalog, Legwear Product > 2,201
… …
< Main page, Promotion ad > 852
… …
< Product Catalog, Legcare Product > 150
Cuboid Qa (44*44 cells)
Comparatively speaking, there are very few visitors browse from a product catalog page to a Legcareproduct page.
The corresponding pattern template to capture the 2 steps navigation semantics is <X,Y>.
Qa: Look for the statistics of all 2- steps navigations in the “page category” level.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments< X, Y>
X,Y at “page category” level# User
sessions
< Main page, Product Catalog> 6,524
… …
< Product Catalog, Legwear Product > 2,201
… …
< Main page, Promotion ad > 852
… …
< Product Catalog, Legcare Product > 150
< X, Y > (sliced)
X at “page category” level ; Y at “page” level# User
sessions
< Product Catalog, Null> 181
< Product Catalog, PID - 34839 > 172
< Product Catalog, PID - 34897 > 163
… …
Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse?
Cuboid Qa (44*44 cells)
Cuboid Qb (1*279 cells)
The most popular product that visitors browsefrom the catalog page is the product 34839 (DKNY skin legwear collection product)
2. P-DRILL-DOWN
1.SLICE
Qa: Look for the statistics of all 2- steps navigations in the “page category” level.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments< X, Y>
X,Y at “page category” level# User
sessions
< Main page, Product Catalog> 6,524
… …
< Product Catalog, Legwear Product > 2,201
… …
< Main page, Promotion ad > 852
… …
< Product Catalog, Legcare Product > 150
< X, Y > (sliced)
X at “page category” level ; Y at “page” level# User
sessions
< Product Catalog, Null> 181
< Product Catalog, PID - 34839 > 172
< Product Catalog, PID - 34897 > 163
… …
< X, Y, Z > (sliced)
X at “page category” level ; Y, Z at “page” level# User
sessions
… …
< Product Catalog, PID - 34839, PID - 34839 > 17
< Product Catalog, PID - 34839, PID - 34897 > 14
… …
Qc: APPEND(Z)
Cuboid Qa (44*44 cells)
Cuboid Qb (1*279 cells)
Cuboid Qc (1*279*279 cells)
The runtime of II is higher than CB in Qa because we include the indices precomputation time in Qa.
Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse?
2. P-DRILL-DOWN
1.SLICE
Qa: Look for the statistics of all 2- steps navigations in the “page category” level.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
The runtime of II is higher than CB in Qa because we include the indices precomputation time in Qa.
Experiments< X, Y>
X,Y at “page category” level# User
sessions
< Main page, Product Catalog> 6,524
… …
< Product Catalog, Legwear Product > 2,201
… …
< Main page, Promotion ad > 852
… …
< Product Catalog, Legcare Product > 150
< X, Y > (sliced)
X at “page category” level ; Y at “page” level# User
sessions
< Product Catalog, Null> 181
< Product Catalog, PID - 34839 > 172
< Product Catalog, PID - 34897 > 163
… …
< X, Y, Z > (sliced)
X at “page category” level ; Y, Z at “page” level# User
sessions
… …
< Product Catalog, PID - 34839, PID - 34839 > 17
< Product Catalog, PID - 34839, PID - 34897 > 14
… …
Cuboid Qa (44*44 cells)
Cuboid Qb (1*279 cells)
Cuboid Qc (1*279*279 cells)
For the iterative queries, II takes the advantage of processing only the sequences that possess the pattern < Product catalog, Legwear Product>.
Qc: APPEND(Z)
Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse?
Qa: Look for the statistics of all 2- steps navigations in the “page category” level.
2. P-DRILL-DOWN
1.SLICE
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments on synthetic data
Study the scalability of Counter-Based approach (CB) and Inverted-Index approach (II) under a series of APPEND operations
QA1 SUBSTRING(X,Y) SLICE + APPEND QA2 (X,Y,Z) SLICE + APPEND QA3 (X,Y,Z,A) SLICE + APPEND QA4 (X,Y,Z,A,B) SLICE + APPEND QA5 (X,Y,Z,A,B,C)
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments on synthetic data
Both CB and II scale linearly w.r.t. number of sequences.II outperformed CB in all datasets in this experiment.
Cumulative runtime
II precomputation time : less than 4 secs in all cases
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments on synthetic data
Both CB and II scale linearly w.r.t. number of sequences.II outperformed CB in all datasets in this experiment.
CB scans the entire dataset once on each iterative query.For Qa1, II does not need to scan any data sequences because the query can be answered by inverted indices directly.
Cumulative runtime
Cumulative # sequence scanned
II precomputation time : less than 4 secs in all cases
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments on synthetic data
Vary Average sequence length (L) Data distribution (Skew factor) Domain of the events (I)
P-ROLL-UP operation P-DRILL-DOWN operation <X,Y,Y,X> pattern templates Substring / Subsequence pattern templates (See technical report)
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Conclusion
We propose a new online analytical processing system for sequence data analysis (The S-OLAP system).
The proposed system is motivated by real-life problems. Page click analysis RFID log analysis …etc
We defined basic concepts S-Cuboid, S-Cube
Identified two properties of S-Cube Infinite number of S-Cuboid Non-summarizable
Illustrated the usability of the proposed S-OLAP system through a prototype system that works on real data.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
The EndThe EndThank you!Thank you!
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Synthetic dataset generator
Synthetic sequence databases are synthesized in the following manner:
The generated sequence database has D sequences. Each sequence s in a dataset is generated independently
The sequence length l, with mean L, is first determined by a random variable following a Poisson distribution.
Then, we repeatedly add events to the sequence until the target length l is reached.
The first event symbol is randomly selected according to a pre-determined distribution following Zipf’s law with parameter I and Θ I is the number of possible symbols, and Θ is the skew factor
Subsequent events are generated one after the other using a Markov chain of degree 1. The conditional probabilities are pre-determined and are skewed according to
Zipf’s law. All the generated sequences form a single sequence group and that
is served as the input data to the algorithms.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Related Work
Sequence Databases: PREDATOR (Seshadri, Livny, and Ramakrishnan; SIGMOD 94,
VLDB 96) DEVise (Ramakrishnan et al.; SSDBM 98) TS-SQL (Sadri et al.; PODS 01)
OLAP Data-cube operator (Gray et al.; 95), iceberg-cube,
star-schema, …, etc.
OLAP on unconventional data RFID-cube (Gonzalez, Han, and Li; VLDB 06) Stream-cube (Chen et al.; VLDB 02) XML-cube (Wiwatwattana el al.; ICDE 07)