“Big” numbers for GP today
• 70K/day - Query Rate • 6.5PB – Dataset Size • +100GB/s – Analysis Rate • +3GB/s – Net Loading Rate • 100,000/s – Transaction Rate• 56 TB / kW, 1.6 GB/s/kW – Power Rate• 100s – Number of Data/Compute nodes
04/18/23 2
Things I’ve Heard
• Tiered computing– Organizational / Political / Geographic
boundaries require it
• Metadata computing for HEP– “10TB sounds small but it’s not easy”
• Processing for Radio Astronomy, HEP– Data intensive computing– Requires an efficient pipeline from raw to
consumables
04/18/23 3
Thoughts
• A lot of plumbing! Moving data around, pipeline processing– Core engine should do this so the plumbing
isn’t done over and over
• Need for specialized access methods and storage classes
• “Computing in data” is key to success
04/18/23 4
GP Basic Features
• Access Methods– Compression, Column Store, Heap Store,
External Tables, Indexes (GIST, GIN, Rtree, Bitmap, B-Tree, …)
– Network Ingest / Export directly into parallel pipeline
– Logical Partitioning by Range, List
• Parallel Programming Languages– SQL 2003 with Analytics– Map Reduce in Perl, Python, C, SQL, …– PL/R,python,perl,C,pgSQL,SQL, …
04/18/23 5
From Enterprise Data Clouds
• Elastic / adaptive infrastructure for data warehousing and analytics
– IT Operations deploy pools of low-cost commodity infrastructure
• Physical servers, virtual infrastructure, or onramp to public cloud
– DBAs and Analysts provision sandboxes and warehouses in minutes
• Assemble the data they need (common, private, etc) for agile analytics
04/18/23 6 Proprietary & Confidential
DBA
Analyst
ConsumerDivision
PackagedGoods
Finance
4040
881616 1616
120Free 1616 1616
68Free
9696 4040 64Free
Infrastructure
Warehouses
IT Operations
Use Case: Big TelcoData Mart Consolidation
04/18/23 7 Proprietary & Confidential
Goals:•Reduce maintenance and support costs from proliferation of data mart platforms
•Reduce risks and exposure due to data in shadow IT systems
•Break down silo walls - provide a unified way to find and access all data
Approach:•Embrace data – encourage ‘physical consolidation’ in advance of data model unification
•Provide ‘self serve’ model to bring shadow IT into the light
•Allow unified data access and pragmatic ‘logical’ data model unification incrementally
DataSources
US- West100 nodes
XX
X
X
XX
X
X
X
Use Case: Big Ad NetworkProject Sandboxes
04/18/23 8 Proprietary & Confidential
Goals:•Remove IT barriers to analyst productivity and value creation
•Dramatically reduce IT resource constraints and delays – i.e. realize ideas sooner
•Combine centralized ‘EDW’ data with freshly discovered feeds and other useful sources
Approach:•Self-serve creation of project warehouses in minutes – and elastically expand as needed
•Load new data feeds without requiring formal modeling
•Bring together any data within the EDC – even if globally distributed – and analyze
US- East100 nodes
Analyst’s New Warehouse
Analyst’s New Warehouse
Analyst’s Private
Data Feed
Analyst’s Private
Data Feed
EDC
Self-ServeDashboard
GP is Software – Develop Now
• Download at:– Gpn.greenplum.com– Get the VMWare image or use it on OSX, Linux,
Solaris
04/18/23 9
Think Big. Think Fast.
Top Related