Rich relational data from thin air john stinson
-
Upload
john-stinson -
Category
Data & Analytics
-
view
37 -
download
0
Transcript of Rich relational data from thin air john stinson
Rich Relational Data from Thin-Air
How to Fake it ndash Pydata London6 Dec 2016
John Stinson
Why Simulate Data
bull Early System Testingbull Cloud Migrationbull Training Course
Device Type over Time
Source httpswwwnakonocomtekcartacompany-profilesmicrosoft-xbox-onemicrosoft-xbox-360-and-xbox-one-developer-programme-market-adoption-and-analysis
Usage by Time of Day
Source httpdownloadsbbccoukmediacentreiplayeriplayer-performance-jan16pdf
Tables
bull usersbull pagebull devicebull page_hit
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer
devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer
]
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Why Simulate Data
bull Early System Testingbull Cloud Migrationbull Training Course
Device Type over Time
Source httpswwwnakonocomtekcartacompany-profilesmicrosoft-xbox-onemicrosoft-xbox-360-and-xbox-one-developer-programme-market-adoption-and-analysis
Usage by Time of Day
Source httpdownloadsbbccoukmediacentreiplayeriplayer-performance-jan16pdf
Tables
bull usersbull pagebull devicebull page_hit
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer
devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer
]
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Device Type over Time
Source httpswwwnakonocomtekcartacompany-profilesmicrosoft-xbox-onemicrosoft-xbox-360-and-xbox-one-developer-programme-market-adoption-and-analysis
Usage by Time of Day
Source httpdownloadsbbccoukmediacentreiplayeriplayer-performance-jan16pdf
Tables
bull usersbull pagebull devicebull page_hit
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer
devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer
]
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Usage by Time of Day
Source httpdownloadsbbccoukmediacentreiplayeriplayer-performance-jan16pdf
Tables
bull usersbull pagebull devicebull page_hit
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer
devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer
]
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Tables
bull usersbull pagebull devicebull page_hit
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer
devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer
]
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer
devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer
]
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer
devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer
]
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Page Table
pagepage_id (PK) 214url httpradio
httpiplayer hellipplatform radio iplayer
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Chrome Browser History
CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Regular Expression
re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)
httpbbccoukhttpwwwbbccoukiplayer
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Closer Look at Browser History
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Read in SQLite dump as dictionary
reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)
urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)
[id url title visit_count typed_count last_visit_time hidden favicon_id]
httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Users table
usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Surname
import random
last_name = randomchoice(all_last_names_list)
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Distribute Users among Regions
region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]
my_region = get_random_key(region_distribution)
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Find Random Keygiven List of Key Probabilities
def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key
my_region = get_random_key(region_distribution)
gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Fact Table
page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791
PK
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Hits per time of day
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Normal Distribution
my_list = numpynormal(mean std_dev samples)
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
numpynormal(20 05 5000)mean = 20 hours
std_dev = 05 hours
samples = 5000
see also randomnormalvariate(mean std_dev)
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2013 2014 2015
01000200030004000500060007000
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Device User Page
bull Use
get_random_key_according_to_distribution()
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Device Hits per Year
2013 2014 20150
102030405060708090
100
TabletMobile DeviceComputer
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
Star Schema
page
device
page_hit
users
Fact Table
DimensionTables
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-
bull johnstinson99gmailcom
- Rich Relational Data from Thin-Air
- Why Simulate Data
- Device Type over Time
- Usage by Time of Day
- Tables
- Star Schema
- Device Table
- Star Schema (2)
- Page Table
- Chrome Browser History
- Regular Expression
- Closer Look at Browser History
- Slide 13
- Read in SQLite dump as dictionary
- Star Schema (3)
- Users table
- Surname
- Distribute Users among Regions
- Find Random Key given List of Key Probabilities
- Generate a User
- Star Schema (4)
- Fact Table
- Hits per time of day
- Normal Distribution
- numpynormal(20 05 5000)
- Slide 26
- Slide 27
- Page Hits per Hour 2013-2015
- Device User Page
- Device Hits per Year
- Star Schema (5)
- Slide 32
-