Add A Billion Row Data Warehouse To Your App
description
Transcript of Add A Billion Row Data Warehouse To Your App
Add A Billion Row Data Warehouse To Your App
with Redshift, sql and duct tape
James Crisp, Tech Principal @ Getup
ContextGetUp! is an independent movement to build a progressive Australia and bring participation back into our democracy.
Lots of online campaigns, field actions, social media. Supported by small donations. 600K+ members. Budget, medicare, uni fees, barrier reef, forests etc.
Big Rails App• CMS + Petitions, emailing MPs,
donations etc• Email blasting & segmenting• Back office & mini-crm• 2 X [3 app, 2 worker, 1 DB servers], Au
and Sg
Data Warehouse… why??
• Reporting & exports• Experimental data science• Stop locking up transactional DB!!• More data sources (logs, CRM, ..) =>
customer• Different schema & faster queries
Options• Read-only replica of transactional DB• Mongo / Cassandra / ..• Hadoop, Pig, Hive• Elastic search• BIG Sql, eg Redshift
Why BIG Sql?• Team skills: tech & data scientists• Easy integration from SQL DB• Good hosted options• Fast performance, column based• Sets and aggregations• Can do JSON for less structured data
Why Redshift?• Fully hosted & managed multi-node• Fast & Column based, semi-compressed• Relatively cheap and easy to try• Good import options• Massively expandable• (Security & backup options)
What is Redshift… really?
• Heavily modified fork of PostgreSQL 8
• Specialised data storage & query engine
• Can use normal ODBC/JDBC/Postgres clients to connect
string connString = "Driver={PostgreSQL Unicode};" + String.Format("Server={0};Database={1};" + "UID={2};PWD={3};Port={4};SSL=true;Sslmode=Require", server, DBName, masterUsername, masterUserPassword, port);
OdbcConnection conn = new OdbcConnection(connString);conn.Open();
OdbcDataAdapter da = new OdbcDataAdapter(sql, conn);da.Fill(ds);dt = ds.Tables[0];foreach (DataRow row in dt.Rows){ // Do something useful}
$DBConnectionString = "Driver={PostgreSQL UNICODE}:Server=$MyServer;Port=$MyPort;Database=$MyDB;Uid=$MyUid;Pwd=$MyPass;"
$DBConn = New-Object System.Data.Odbc.OdbcConnection;$DBConn.ConnectionString = $DBConnectionString;$DBConn.Open();
$DBCmd = $DBConn.CreateCommand();$DBCmd.CommandText = "SELECT * FROM mytable;";$DBCmd.ExecuteReader();$DBConn.Close();
psql.exe -h $DBSERVER -U $DBUSER -d $DBName -f script.sql
How much does it cost?• Min cluster size 2, leader is free• 2 X 2TB HDD, 15G RAM: Syd $2.50/h, US
$1.70/h (reserved 1yr Syd $13.2K, US $8.8K)
• Up to 2.56TB flash or 16TB HDD and 244GB RAM per node. Cluster up to 1.6 PB.
Data Sources so far…
Transactional DBApplication request logs
Hooking up DB Data
RedshiftMySQL S3 LOADCSVs
Map & Dump CSVs
DB Server
Fire Drop/Create tables, Load data
Hooking up Request Logs
RedshiftS3
Map & LoadJSON
Upload JSON logs
App ServersFire data load
Demos
• Scripts & SQL for hooking up• AWS Redshift console• Connect & query
What have we used it for?
• Faster data science / reporting without locking up transactional DB
• Combining sharded tables• Request logs (browser agent, params, ..)• Marking/tracking segments +
debugging info
Other uses
• Data exploration with tools like Tableau• BI tools• Denormalised data• Dump in more data for single view of
user – FB info, CRM, etc
Conclusion
• Easy to set up and use (for Win/.NET too)
• Super fast• Reasonable price• Met our needs well so far
We are hiring atm!