Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In...
Transcript of Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In...
![Page 1: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/1.jpg)
GoDataDrivenPROUDLY PART OF THE XEBIA GROUP
Building a Big Data DWH
Friso van VollenhovenCTO
Data Warehousing on Hadoop
![Page 2: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/2.jpg)
-- Wikipedia
“In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.”
![Page 3: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/3.jpg)
![Page 4: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/4.jpg)
![Page 5: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/5.jpg)
![Page 6: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/6.jpg)
ETL
![Page 7: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/7.jpg)
![Page 8: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/8.jpg)
How to:• Add a column to the facts table?• Change the granularity of dates from day
to hour?• Add a dimension based on some
aggregation of facts?
![Page 9: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/9.jpg)
Schema’s are designed with questions in mind.
Changing it requires to redo the ETL.
![Page 10: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/10.jpg)
Schema’s are designed with questions in mind.
Changing it requires to redo the ETL.
Push things to the facts level.
Keep all source data available all times.
![Page 11: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/11.jpg)
![Page 12: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/12.jpg)
And now?• MPP databases?• Faster / better / more SAN?• (RAC?)
![Page 13: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/13.jpg)
![Page 14: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/14.jpg)
distributed storage
distributed processing
metadata + query engine
![Page 15: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/15.jpg)
![Page 16: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/16.jpg)
EXTRACTTRANSFORM
LOAD
![Page 17: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/17.jpg)
![Page 18: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/18.jpg)
![Page 19: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/19.jpg)
![Page 20: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/20.jpg)
![Page 21: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/21.jpg)
![Page 22: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/22.jpg)
• No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any Java libs
![Page 23: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/23.jpg)
![Page 24: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/24.jpg)
• Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monitor• Integrates with version control
![Page 25: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/25.jpg)
![Page 26: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/26.jpg)
Deployment
git push jenkins master
![Page 27: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/27.jpg)
•Scheduling•Simple deployment of ETL code•Scalable•Developer friendly
![Page 28: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/28.jpg)
![Page 29: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/29.jpg)
![Page 30: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/30.jpg)
![Page 31: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/31.jpg)
![Page 32: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/32.jpg)
![Page 33: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/33.jpg)
'februari-22 2013'
![Page 34: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/34.jpg)
![Page 35: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/35.jpg)
A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.
![Page 36: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/36.jpg)
![Page 37: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/37.jpg)
![Page 38: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/38.jpg)
þ
![Page 39: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/39.jpg)
þ
![Page 40: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/40.jpg)
TSV ==thorn separated values?
![Page 41: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/41.jpg)
þ == 0xFE
![Page 42: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/42.jpg)
or -2, in HiveCREATE TABLE browsers ( browser_id STRING, browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';
![Page 43: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/43.jpg)
![Page 44: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/44.jpg)
![Page 45: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/45.jpg)
![Page 46: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/46.jpg)
![Page 47: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/47.jpg)
![Page 48: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/48.jpg)
![Page 49: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/49.jpg)
• The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over-logging)• Other people test in production too (and you get the
data from it)• Etc., etc.
![Page 50: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/50.jpg)
•Simple deployment of ETL code•Scheduling•Scalable• Independent jobs•Fixable data store• Incremental where possible•Metrics
![Page 51: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/51.jpg)
Independent jobs
source (external)
staging (HDFS)
hive-staging (HDFS)
Hive
HDFS upload + move in place
MapReduce + HDFS move
Hive map external table + SELECT INTO
![Page 52: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/52.jpg)
Out of order jobs
• At any point, you don’t really know what ‘made it’ to Hive•Will happen anyway, because some days the data
delivery is going to be three hours late• Or you get half in the morning and the other half
later in the day• It really depends on what you do with the data• This is where metrics + fixable data store help...
![Page 53: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/53.jpg)
Fixable data store
• Using Hive partitions• Jobs that move data from staging create partitions•When new data / insight about the data arrives,
drop the partition and re-insert• Be careful to reset any metrics in this case• Basically: instead of trying to make everything
transactional, repair afterwards• Use metrics to determine whether data is fit for
purpose
![Page 54: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/54.jpg)
Metrics
![Page 55: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/55.jpg)
Metrics service
• Job ran, so may units processed, took so much time• e.g. 10GB imported, took 1 hr• e.g. 60M records transformed, took 10 minutes• Dropped partition• Inserted X records into partition
![Page 56: Building a Big Data DWH...Friso van Vollenhoven CTO Data Warehousing on Hadoop-- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database](https://reader034.fdocuments.in/reader034/viewer/2022050402/5f7ffe85cc3aa11d674f71df/html5/thumbnails/56.jpg)
GoDataDriven
We’re hiring / Questions? / Thank you!
Friso van VollenhovenCTO