Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.
-
Upload
abigayle-pearson -
Category
Documents
-
view
216 -
download
0
Transcript of Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.
![Page 1: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/1.jpg)
Data Warehousing OverviewCS245 Notes 11
Hector Garcia-Molina
Stanford University
CS 2451
Notes11
![Page 2: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/2.jpg)
CS 245 Notes112
Outline
What is a data warehouse? Why a warehouse? Models & operations Implementing a warehouse
![Page 3: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/3.jpg)
CS 245 Notes113
What is a Warehouse?
Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history)
integrated time-varying non-volatile
more
![Page 4: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/4.jpg)
CS 245 Notes114
What is a Warehouse?
Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse
![Page 5: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/5.jpg)
CS 245 Notes115
Warehouse Architecture
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
![Page 6: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/6.jpg)
CS 245 Notes116
Motivating Examples
Forecasting Comparing performance of units Monitoring, detecting fraud Visualization
![Page 7: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/7.jpg)
CS 245 Notes117
Alternative to Warehousing
Two Approaches: Query-Driven (Lazy) Warehouse (Eager)
Source Source
?
![Page 8: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/8.jpg)
CS 245 Notes118
Query-Driven Approach
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
![Page 9: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/9.jpg)
CS 245 Notes119
Advantages of Warehousing
High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse
Modify, summarize (store aggregates) Add historical information
![Page 10: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/10.jpg)
CS 245 Notes1110
Advantages of Query-Driven
No need to copy data less storage no need to purchase data
More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources
![Page 11: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/11.jpg)
CS 245 Notes1111
Warehouse Models & Operators
Data Models relational cubes
Operators
![Page 12: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/12.jpg)
CS 245 Notes1112
Star
customer custId name address city53 joe 10 main sfo81 fred 12 main sfo
111 sally 80 willow la
product prodId name pricep1 bolt 10p2 nut 5
store storeId cityc1 nycc2 sfoc3 la
sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50
![Page 13: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/13.jpg)
CS 245 Notes1113
Star Schema
saleorderId
datecustIdprodIdstoreId
qtyamt
customercustIdname
addresscity
productprodIdnameprice
storestoreId
city
![Page 14: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/14.jpg)
CS 245 Notes1114
Terms
Fact table Dimension tables Measures
saleorderId
datecustIdprodIdstoreId
qtyamt
customercustIdname
addresscity
productprodIdnameprice
storestoreId
city
![Page 15: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/15.jpg)
CS 245 Notes1115
Dimension Hierarchies
store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy
city cityId pop regIdsfo 1M northla 5M south
region regId namenorth cold regionsouth warm region
sType tId size locationt1 small downtownt2 large suburbs
storesType
city region
snowflake schema constellations
![Page 16: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/16.jpg)
CS 245 Notes1116
Cube
sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8
c1 c2 c3p1 12 50p2 11 8
Fact table view: Multi-dimensional cube:
dimensions = 2
![Page 17: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/17.jpg)
CS 245 Notes1117
3-D Cube
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
dimensions = 3
Multi-dimensional cube:Fact table view:
![Page 18: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/18.jpg)
Operators
Traditional selection aggregation ...
Analysis clean data find trends ...
CS 245 Notes1118
Relational
Cube
![Page 19: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/19.jpg)
CS 245 Notes1119
Aggregates
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
81
![Page 20: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/20.jpg)
CS 245 Notes1120
Aggregates
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
ans date sum1 812 48
![Page 21: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/21.jpg)
CS 245 Notes1121
Another Example
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId date amtp1 1 62p2 1 19p1 2 48
drill-down
rollup
![Page 22: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/22.jpg)
CS 245 Notes1122
Aggregates
Operators: sum, count, max, min, median, ave
“Having” clause Using dimension hierarchy
average by region (within store) maximum by month (within date)
![Page 23: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/23.jpg)
CS 245 Notes1123
Cube Aggregation
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3sum 67 12 50
sump1 110p2 19
129
. . .
drill-down
rollup
Example: computing sums
![Page 24: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/24.jpg)
CS 245 Notes1124
Cube Operators
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3sum 67 12 50
sump1 110p2 19
129
. . .
sale(c1,*,*)
sale(*,*,*)sale(c2,p2,*)
![Page 25: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/25.jpg)
CS 245 Notes1125
c1 c2 c3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129
Extended Cube
day 2 c1 c2 c3 *p1 44 4 48p2* 44 4 48
c1 c2 c3 *p1 12 50 62p2 11 8 19* 23 8 50 81
day 1
*
sale(*,p2,*)
![Page 26: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/26.jpg)
CS 245 Notes1126
Aggregation Using Hierarchies
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
region A region Bp1 56 54p2 11 8
customer
region
country
(customer c1 in Region A;customers c2, c3 in Region B)
![Page 27: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/27.jpg)
CS 245 Notes1127
Data Analysis
Decision Trees Clustering Association Rules
![Page 28: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/28.jpg)
CS 245 Notes1128
Decision Trees
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign
trainingset
![Page 29: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/29.jpg)
CS 245 Notes1129
One Possibility
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
age<30
city=sf car=van
likely likelyunlikely unlikely
YY
Y
NN
N
![Page 30: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/30.jpg)
CS 245 Notes1130
Another Possibility
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
car=taurus
city=sf age<45
likely likelyunlikely unlikely
YY
Y
NN
N
![Page 31: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/31.jpg)
CS 245 Notes1131
Issues
Decision tree cannot be “too deep” would not have statistically significant amounts of
data for lower decisions
Need to select tree that most reliably predicts outcomes
![Page 32: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/32.jpg)
CS 245 Notes1132
Clustering
age
income
education
![Page 33: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/33.jpg)
CS 245 Notes1133
Clustering
age
income
education
![Page 34: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/34.jpg)
CS 245 Notes1134
Another Example: Text
Each document is a vector e.g., <100110...> contains words 1,4,5,...
Clusters contain “similar” documents Useful for understanding, searching
documents
internationalnews
sports
business
![Page 35: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/35.jpg)
CS 245 Notes1135
Issues
Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful?
e.g., “yuppies’’ cluster? Using clusters for disk storage
![Page 36: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/36.jpg)
CS 245 Notes1136
Association Rule Mining
tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9
transactio
n
id custo
mer
id products
bought
salesrecords:
• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9
market-basketdata
![Page 37: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/37.jpg)
CS 245 Notes1137
Association Rule
Rule: {p1, p3, p8} Support: number of baskets where these
products appear High-support set: support threshold s Problem: find all high support sets
![Page 38: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/38.jpg)
Implementation Issues
ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution
What to materialize? Efficient Analysis
Association rule mining ...
CS 245 Notes1138
![Page 39: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/39.jpg)
CS 245 Notes1139
ETL: Monitoring Techniques
Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring
A
dvan
tage
s &
Dis
adva
ntag
es!!
![Page 40: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/40.jpg)
CS 245 Notes1140
ETL: Data Cleaning
Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g.,
social security numbers) Fusion (e.g., mail list, customer merging)
Auditing: discover rules & relationships(like data mining)
billing DB
service DB
customer1(Joe)
customer2(Joe)
merged_customer(Joe)
![Page 41: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/41.jpg)
41
More details: Entity Resolution
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
![Page 42: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/42.jpg)
42
Applications
comparison shopping mailing lists classified ads customer files counter-terrorism
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
![Page 43: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/43.jpg)
Why is ER Challenging?
Huge data sets No unique identifiers Lots of uncertainty Many ways to skin the cat
43
![Page 44: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/44.jpg)
Taxonomy: Pairwise vs Global
Decide if r, s match only by looking at r, s? Or need to consider more (all) records?
44
Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212
Nm: Patrick SmithAd: 132 Main StPh: (650) 555-1212
Nm: Patricia SmithAd: 123 Main StPh: (650) 777-1111
or
![Page 45: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/45.jpg)
Taxonomy: Pairwise vs Global
Global matching complicates things a lot! e.g., change decision as new records arrive
45
Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212
Nm: Patrick SmithAd: 132 Main StPh: (650) 555-1212
Nm: Patricia SmithAd: 123 Main StPh: (650) 777-1111
or
![Page 46: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/46.jpg)
Taxonomy: Outcome
Partition of records e.g., comparison shopping
Merged records
46
Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212
Nm: Patricia SmithAd: 123 Main StPh: (650) 555-1212 (650) 777-1111Hair: Black
Nm: Patricia SmithAd: 132 Main StPh: (650) 777-1111Hair: Black
![Page 47: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/47.jpg)
47
Taxonomy: Outcome Iterate after merging
Nm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
![Page 48: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/48.jpg)
Taxonomy: Record Reuse One record related to multiple entities?
48
Nm: Pat Smith Sr.Ph: (650) 555-1212
Ph: (650) 555-1212Ad: 123 Main St
Nm: Pat Smith Jr.Ph: (650) 555-1212
Nm: Pat Smith Sr.Ph: (650) 555-1212Ad: 123 Main St
Nm: Pat Smith Jr.Ph: (650) 555-1212Ad: 123 Main St
![Page 49: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/49.jpg)
Taxonomy: Record Reuse
Partitions
49
• Merges
r s t r
s
t
rs
st
![Page 50: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/50.jpg)
Taxonomy: Record Reuse
Partitions
50
• Merges
r s t r
s
t
rs
st
• Record reuse complex and expensive!
![Page 51: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/51.jpg)
51
Taxonomy: Multiple Entity Types
person 1 person 2
Organization B
Organization Abrother
member
business
member
![Page 52: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/52.jpg)
52
Taxonomy: Multiple Entity Types
p1
p2
p5
p7
a1
a2
a3
a5
a4
authors papers
same??
![Page 53: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/53.jpg)
53
Taxonomy: Exact vs Approximate
products
camerasresolvedcameras
CDs
books
...
resolvedCDs
resolvedbooks
...
ER
ER
ER
![Page 54: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/54.jpg)
54
Taxonomy: Exact vs Approximate
terrorists terroristssortby age
B Cooper 30match against
ages 25-35
![Page 55: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/55.jpg)
Implementation Issues
ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution
What to materialize? Efficient Analysis
Association rule mining ...
CS 245 Notes1155
![Page 56: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/56.jpg)
CS 245 Notes1156
What to Materialize?
Store in warehouse results useful for common queries
Example:day 2
c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
c1p1 110p2 19
129
. . .
total sales
materialize
![Page 57: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/57.jpg)
CS 245 Notes1157
Materialization Factors
Type/frequency of queries Query response time Storage cost Update cost
![Page 58: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/58.jpg)
CS 245 Notes1158
Cube Aggregates Lattice
city, product, date
city, product city, date product, date
city product date
all
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
129
use greedyalgorithm todecide whatto materialize
![Page 59: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/59.jpg)
CS 245 Notes1159
Dimension Hierarchies
all
state
city
cities city statec1 CAc2 NY
![Page 60: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/60.jpg)
CS 245 Notes1160
Dimension Hierarchies
city, product
city, product, date
city, date product, date
city product date
all
state, product, date
state, date
state, product
state
not all arcs shown...
![Page 61: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/61.jpg)
CS 245 Notes1161
Interesting Hierarchy
all
years
quarters
months
days
weeks
time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000
conceptualdimension table
![Page 62: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/62.jpg)
Implementation Issues
ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution
What to materialize? Efficient Analysis
Association rule mining ...
CS 245 Notes1162
![Page 63: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/63.jpg)
CS 245 Notes1163
Finding High-Support Pairs
Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket)
FROM Baskets I, Baskets JWHERE I.basket = J.basket AND I.item < J.itemGROUP BY I.item, J.itemHAVING COUNT(I.basket) >= s;
![Page 64: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/64.jpg)
CS 245 Notes1164
Finding High-Support Pairs
Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket)
FROM Baskets I, Baskets JWHERE I.basket = J.basket AND I.item < J.itemGROUP BY I.item, J.itemHAVING COUNT(I.basket) >= s;
WHY?
![Page 65: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/65.jpg)
CS 245 Notes1165
Example
basket itemt1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...
basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...
![Page 66: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/66.jpg)
CS 245 Notes1166
Example
basket itemt1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...
basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...
check ifcount s
![Page 67: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/67.jpg)
CS 245 Notes1167
Issues
Performance for size 2 rulesbasket item
t1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...
basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...
bigevenbigger!
Performance for size k rules
![Page 68: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/68.jpg)
CS 245 Notes1168
Association Rules
How do we perform rule mining efficiently?
![Page 69: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/69.jpg)
CS 245 Notes1169
Association Rules
How do we perform rule mining efficiently? Observation: If set X has support t, then
each X subset must have at least support t
![Page 70: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/70.jpg)
CS 245 Notes1170
Association Rules
How do we perform rule mining efficiently? Observation: If set X has support t, then
each X subset must have at least support t For 2-sets:
if we need support s for {i, j} then each i, j must appear in at least s
baskets
![Page 71: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/71.jpg)
CS 245 Notes1171
Algorithm for 2-Sets
(1) Find OK products those appearing in s or more baskets
(2) Find high-support pairs using only OK products
![Page 72: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/72.jpg)
CS 245 Notes1172
Algorithm for 2-Sets
INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;
![Page 73: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/73.jpg)
CS 245 Notes1173
Algorithm for 2-Sets
INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;
Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;
![Page 74: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/74.jpg)
CS 245 Notes1174
Counting Efficiently
One way:
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
threshold = 3
![Page 75: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/75.jpg)
CS 245 Notes1175
Counting Efficiently
One way:
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
sort
basket I.item J.itemt3 p2 p3t3 p2 p8t1 p5 p8t2 p5 p8t3 p5 p8t2 p8 p11... ... ...
threshold = 3
![Page 76: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/76.jpg)
CS 245 Notes1176
Counting Efficiently
One way:
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
sort
basket I.item J.itemt3 p2 p3t3 p2 p8t1 p5 p8t2 p5 p8t3 p5 p8t2 p8 p11... ... ...
count &remove count I.item J.item
3 p5 p85 p12 p18... ... ...
threshold = 3
![Page 77: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/77.jpg)
CS 245 Notes1177
Counting Efficiently
Another way:
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
threshold = 3
![Page 78: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/78.jpg)
CS 245 Notes1178
Counting Efficiently
Another way:
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
scan &count
count I.item J.item1 p2 p32 p2 p83 p5 p85 p12 p181 p21 p222 p21 p23... ... ...
keep counterarray in memory
threshold = 3
![Page 79: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/79.jpg)
CS 245 Notes1179
Counting Efficiently
Another way:
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
remove count I.item J.item3 p5 p85 p12 p18... ... ...
scan &count
count I.item J.item1 p2 p32 p2 p83 p5 p85 p12 p181 p21 p222 p21 p23... ... ...
keep counterarray in memory
threshold = 3
![Page 80: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/80.jpg)
CS 245 Notes1180
Counting Efficiently
Another way:
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
remove count I.item J.item3 p5 p85 p12 p18... ... ...
scan &count
count I.item J.item1 p2 p32 p2 p83 p5 p85 p12 p181 p21 p222 p21 p23... ... ...
keep counterarray in memory
threshold = 3
![Page 81: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/81.jpg)
CS 245 Notes1181
Yet Another Way
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
(1)scan &hash &count
count bucket1 A5 B2 C1 D8 E1 F... ...
in-memoryhash table threshold = 3
![Page 82: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/82.jpg)
CS 245 Notes1182
Yet Another Way
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
(1)scan &hash &count
count bucket1 A5 B2 C1 D8 E1 F... ...
in-memoryhash table threshold = 3
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p5 p8t5 p12 p18t8 p12 p18... ... ...
(2) scan &remove
![Page 83: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/83.jpg)
CS 245 Notes1183
Yet Another Way
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
(1)scan &hash &count
count bucket1 A5 B2 C1 D8 E1 F... ...
in-memoryhash table threshold = 3
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p5 p8t5 p12 p18t8 p12 p18... ... ...
(2) scan &remove
false positive
![Page 84: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/84.jpg)
CS 245 Notes1184
Yet Another Way
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
(1)scan &hash &count
count bucket1 A5 B2 C1 D8 E1 F... ...
in-memoryhash table threshold = 3
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p5 p8t5 p12 p18t8 p12 p18... ... ...
(2) scan &remove
count I.item J.item3 p5 p81 p8 p115 p12 p18... ... ...
(3) scan& count
in-memorycounters
false positive
![Page 85: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/85.jpg)
CS 245 Notes1185
Yet Another Way
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...
(1)scan &hash &count
count bucket1 A5 B2 C1 D8 E1 F... ...
in-memoryhash table threshold = 3
basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p5 p8t5 p12 p18t8 p12 p18... ... ...
(2) scan &remove
count I.item J.item3 p5 p85 p12 p18... ... ...
(4) removecount I.item J.item
3 p5 p81 p8 p115 p12 p18... ... ...
(3) scan& count
in-memorycounters
false positive
![Page 86: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/86.jpg)
CS 245 Notes1186
Discussion
Hashing scheme: 2 (or 3) scans of data Sorting scheme: requires a sort! Hashing works well if few high-support pairs
and many low-support ones
![Page 87: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/87.jpg)
CS 245 Notes1187
Discussion
Hashing scheme: 2 (or 3) scans of data Sorting scheme: requires a sort! Hashing works well if few high-support pairs
and many low-support ones
item-pairs ranked by frequency
fre
que
ncy
threshold
iceberg queries
![Page 88: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/88.jpg)
Implementation Issues
ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution
What to materialize? Efficient Analysis
Association rule mining ...
CS 245 Notes1188
![Page 89: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/89.jpg)
Extra: Data Mining in the InfoLab
CS 245 Notes1189
quarters
Recommendations in CourseRank
user q1 q2 q3 q4u1 a: 5 b: 5 d: 5 u2 a: 1 e: 2 d: 4 f: 3u3 g: 4 h: 2 e: 3 f: 3u4 b: 2 g: 4 h: 4 e: 4u a: 5 g: 4 e: 4
![Page 90: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/90.jpg)
Extra: Data Mining in the InfoLab
CS 245 Notes1190
quarters
Recommendations in CourseRank
user q1 q2 q3 q4u1 a: 5 b: 5 d: 5 u2 a: 1 e: 2 d: 4 f: 3u3 g: 4 h: 2 e: 3 f: 3u4 b: 2 g: 4 h: 4 e: 4u a: 5 g: 4 e: 4
u3 and u4 are similar to u Recommend h
![Page 91: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/91.jpg)
Extra: Data Mining in the InfoLab
CS 245 Notes1191
quarters
Recommendations in CourseRank
user q1 q2 q3 q4u1 a: 5 b: 5 d: 5 u2 a: 1 e: 2 d: 4 f: 3u3 g: 4 h: 2 e: 3 f: 3u4 b: 2 g: 4 h: 4 e: 4u a: 5 g: 4 e: 4
Recommend d (and f, h)
![Page 92: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/92.jpg)
Sequence Mining
Given a set of transcripts, use Pr[x|a]to predict if x is a good recommendationgiven user has taken a.
Two issues...
CS 245 Notes1192
![Page 93: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/93.jpg)
Pr[x|a] Not Quite Right
CS 245 Notes1193
transcript containing1 -2 a3 x4 a -> x5 x -> a
Pr[x|a] = 2/3
Pr[x|a~x] = 1/2
target user’s transcript:[ ... a .... || unknown ]
recommend x?
![Page 94: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/94.jpg)
User Has Taken >= 1 Course
User has taken T= {a, b, c} Need Pr[x|T~x] Approximate as Pr[x|a~x b~x c~x ] Expensive to compute, so...
CS 245 Notes1194
![Page 95: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d6e5503460f94a4f7b8/html5/thumbnails/95.jpg)
CourseRank User Study
CS 245 Notes1195
good, expected
good, unexpected
per
cent
ag
e o
f ra
ting
s