Rich relational data from thin air john stinson

32
Rich Relational Data from Thin-Air How to Fake it – Pydata London 6 Dec 2016 John Stinson

Transcript of Rich relational data from thin air john stinson

Page 1: Rich relational data from thin air   john stinson

Rich Relational Data from Thin-Air

How to Fake it ndash Pydata London6 Dec 2016

John Stinson

Why Simulate Data

bull Early System Testingbull Cloud Migrationbull Training Course

Device Type over Time

Source httpswwwnakonocomtekcartacompany-profilesmicrosoft-xbox-onemicrosoft-xbox-360-and-xbox-one-developer-programme-market-adoption-and-analysis

Usage by Time of Day

Source httpdownloadsbbccoukmediacentreiplayeriplayer-performance-jan16pdf

Tables

bull usersbull pagebull devicebull page_hit

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer

devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer

]

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 2: Rich relational data from thin air   john stinson

Why Simulate Data

bull Early System Testingbull Cloud Migrationbull Training Course

Device Type over Time

Source httpswwwnakonocomtekcartacompany-profilesmicrosoft-xbox-onemicrosoft-xbox-360-and-xbox-one-developer-programme-market-adoption-and-analysis

Usage by Time of Day

Source httpdownloadsbbccoukmediacentreiplayeriplayer-performance-jan16pdf

Tables

bull usersbull pagebull devicebull page_hit

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer

devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer

]

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 3: Rich relational data from thin air   john stinson

Device Type over Time

Source httpswwwnakonocomtekcartacompany-profilesmicrosoft-xbox-onemicrosoft-xbox-360-and-xbox-one-developer-programme-market-adoption-and-analysis

Usage by Time of Day

Source httpdownloadsbbccoukmediacentreiplayeriplayer-performance-jan16pdf

Tables

bull usersbull pagebull devicebull page_hit

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer

devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer

]

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 4: Rich relational data from thin air   john stinson

Usage by Time of Day

Source httpdownloadsbbccoukmediacentreiplayeriplayer-performance-jan16pdf

Tables

bull usersbull pagebull devicebull page_hit

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer

devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer

]

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 5: Rich relational data from thin air   john stinson

Tables

bull usersbull pagebull devicebull page_hit

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer

devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer

]

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 6: Rich relational data from thin air   john stinson

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer

devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer

]

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 7: Rich relational data from thin air   john stinson

Device Tabledevicedevice_id (PK) 1 2 3device_name Tablet Mobile Device Computer

devices = [device_id 1 device_name Tabletdevice_id 2 device_name Mobile Devicedevice_id 3 device_name Computer

]

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 8: Rich relational data from thin air   john stinson

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 9: Rich relational data from thin air   john stinson

Page Table

pagepage_id (PK) 214url httpradio

httpiplayer hellipplatform radio iplayer

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 10: Rich relational data from thin air   john stinson

Chrome Browser History

CUsersltusergtAppDataLocalGoogleChromeUser DataDefaultHistory (Windows)~LibraryApplication SupportGoogleChromeDefaultHistory (Mac)

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 11: Rich relational data from thin air   john stinson

Regular Expression

re = httpsbbccouk[a-zA-Z0-9- _=]+url_list = refindall(re file_contents)

httpbbccoukhttpwwwbbccoukiplayer

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 12: Rich relational data from thin air   john stinson

Closer Look at Browser History

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 13: Rich relational data from thin air   john stinson

Read in SQLite dump as dictionary

reader = csvDictReader(my_file delimiter=t)print(readerfieldnames)

urls = [(row[url] row[title] row[visit_count]) for row in reader if bbccouk in row[url]]for url title visitcount in urls print(url | title | visitcount)

[id url title visit_count typed_count last_visit_time hidden favicon_id]

httpwwwbbccoukiplayer | BBC iPlayer | 53httpwwwbbccoukiplayerguide | BBC iPlayer - TV Guide - London - Saturday 10 September 2016 | 3httpwwwbbccoukiplayerlivebbcone | BBC iPlayer - Watch BBC One live | 5httpwwwbbccoukiplayera-z | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-za | BBC iPlayer - A to Z - A | 4httpwwwbbccoukiplayera-zt | BBC iPlayer - A to Z - T | 1httpwwwbbccoukiplayera-zi | BBC iPlayer - A to Z - I | 1httpwwwbbccoukiplayera-zq | BBC iPlayer - A to Z - Q | 1

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 14: Rich relational data from thin air   john stinson

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 15: Rich relational data from thin air   john stinson

Users table

usersuser_id (PK) 12346first_name Regenlast_name Schattschneiddate_of_birth 1963-11-22region South East Englandgender memail_address regenschattschneidcoelurosauriacom

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 16: Rich relational data from thin air   john stinson

Surname

import random

last_name = randomchoice(all_last_names_list)

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 17: Rich relational data from thin air   john stinson

Distribute Users among Regions

region_distribution = [(NORTH EAST ENGLAND 014) (THE MIDLANDS 013) (NORTH WEST ENGLAND 012) (EAST ANGLIA 011) (SOUTH CENTRAL ENGLAND 01) (SOUTH EAST ENGLAND 009) (SCOTLAND 008) (LONDON 008) (SOUTH WEST ENGLAND 008) (WALES 004) (NORTHERN IRELAND 003)]

my_region = get_random_key(region_distribution)

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 18: Rich relational data from thin air   john stinson

Find Random Keygiven List of Key Probabilities

def get_random_key(list_of_distribution_tuples) rnd = randomrandom() cumulative_probability = 00 for (key probability) in list_of_distribution_tuples cumulative_probability += probability if rnd lt= cumulative_probability return key

my_region = get_random_key(region_distribution)

gender_distribution = [(lsquofrsquo 51) (lsquomrsquo 49)]my_gender = get_random_key(gender_distribution)

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 19: Rich relational data from thin air   john stinson

Generate a Userdef get_random_user() gender = get_random_gender() first_name = get_random_first_name(gender last_name = randomchoice(all_last_names_list) email = get_random_email(first_name last_name) region = get_random_region() return (gender gender first_name first_name last_name last_name email email region region)

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 20: Rich relational data from thin air   john stinson

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 21: Rich relational data from thin air   john stinson

Fact Table

page_hitpage_id (FK) 214user_id (FK) 12346device_id (FK) 1start_access_time 2016-09-20 004800access_duration_mins 5791

PK

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 22: Rich relational data from thin air   john stinson

Hits per time of day

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 23: Rich relational data from thin air   john stinson

Normal Distribution

my_list = numpynormal(mean std_dev samples)

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 24: Rich relational data from thin air   john stinson

numpynormal(20 05 5000)mean = 20 hours

std_dev = 05 hours

samples = 5000

see also randomnormalvariate(mean std_dev)

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 25: Rich relational data from thin air   john stinson

hits_per_hour_peak = [35 25 1 05 04 04 15 2 3 33 36 38 4 4 4 5 6 7 8 9 9 10 95 55]

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 26: Rich relational data from thin air   john stinson

Page Hits per Hour 2013-20150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2013 2014 2015

01000200030004000500060007000

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 27: Rich relational data from thin air   john stinson

Device User Page

bull Use

get_random_key_according_to_distribution()

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 28: Rich relational data from thin air   john stinson

Device Hits per Year

2013 2014 20150

102030405060708090

100

TabletMobile DeviceComputer

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 29: Rich relational data from thin air   john stinson

Star Schema

page

device

page_hit

users

Fact Table

DimensionTables

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32
Page 30: Rich relational data from thin air   john stinson

bull johnstinson99gmailcom

  • Rich Relational Data from Thin-Air
  • Why Simulate Data
  • Device Type over Time
  • Usage by Time of Day
  • Tables
  • Star Schema
  • Device Table
  • Star Schema (2)
  • Page Table
  • Chrome Browser History
  • Regular Expression
  • Closer Look at Browser History
  • Slide 13
  • Read in SQLite dump as dictionary
  • Star Schema (3)
  • Users table
  • Surname
  • Distribute Users among Regions
  • Find Random Key given List of Key Probabilities
  • Generate a User
  • Star Schema (4)
  • Fact Table
  • Hits per time of day
  • Normal Distribution
  • numpynormal(20 05 5000)
  • Slide 26
  • Slide 27
  • Page Hits per Hour 2013-2015
  • Device User Page
  • Device Hits per Year
  • Star Schema (5)
  • Slide 32