Add A Billion Row Data Warehouse To Your App

Post on 23-Feb-2016

17 views 0 download

description

Add A Billion Row Data Warehouse To Your App. with Redshift, sql and duct tape. James Crisp, Tech Principal @ Getup. Context. GetUp ! is an independent movement to build a progressive Australia and bring participation back into our democracy . - PowerPoint PPT Presentation

Transcript of Add A Billion Row Data Warehouse To Your App

Add A Billion Row Data Warehouse To Your App

with Redshift, sql and duct tape

James Crisp, Tech Principal @ Getup

ContextGetUp! is an independent movement to build a progressive Australia and bring participation back into our democracy.

Lots of online campaigns, field actions, social media. Supported by small donations. 600K+ members. Budget, medicare, uni fees, barrier reef, forests etc.

Big Rails App• CMS + Petitions, emailing MPs,

donations etc• Email blasting & segmenting• Back office & mini-crm• 2 X [3 app, 2 worker, 1 DB servers], Au

and Sg

Data Warehouse… why??

• Reporting & exports• Experimental data science• Stop locking up transactional DB!!• More data sources (logs, CRM, ..) =>

customer• Different schema & faster queries

Options• Read-only replica of transactional DB• Mongo / Cassandra / ..• Hadoop, Pig, Hive• Elastic search• BIG Sql, eg Redshift

Why BIG Sql?• Team skills: tech & data scientists• Easy integration from SQL DB• Good hosted options• Fast performance, column based• Sets and aggregations• Can do JSON for less structured data

Why Redshift?• Fully hosted & managed multi-node• Fast & Column based, semi-compressed• Relatively cheap and easy to try• Good import options• Massively expandable• (Security & backup options)

What is Redshift… really?

• Heavily modified fork of PostgreSQL 8

• Specialised data storage & query engine

• Can use normal ODBC/JDBC/Postgres clients to connect

string connString = "Driver={PostgreSQL Unicode};" + String.Format("Server={0};Database={1};" + "UID={2};PWD={3};Port={4};SSL=true;Sslmode=Require", server, DBName, masterUsername, masterUserPassword, port);

OdbcConnection conn = new OdbcConnection(connString);conn.Open();

OdbcDataAdapter da = new OdbcDataAdapter(sql, conn);da.Fill(ds);dt = ds.Tables[0];foreach (DataRow row in dt.Rows){ // Do something useful}

$DBConnectionString = "Driver={PostgreSQL UNICODE}:Server=$MyServer;Port=$MyPort;Database=$MyDB;Uid=$MyUid;Pwd=$MyPass;"

$DBConn = New-Object System.Data.Odbc.OdbcConnection;$DBConn.ConnectionString = $DBConnectionString;$DBConn.Open();

$DBCmd = $DBConn.CreateCommand();$DBCmd.CommandText = "SELECT * FROM mytable;";$DBCmd.ExecuteReader();$DBConn.Close();

psql.exe -h $DBSERVER -U $DBUSER -d $DBName -f script.sql

How much does it cost?• Min cluster size 2, leader is free• 2 X 2TB HDD, 15G RAM: Syd $2.50/h, US

$1.70/h (reserved 1yr Syd $13.2K, US $8.8K)

• Up to 2.56TB flash or 16TB HDD and 244GB RAM per node. Cluster up to 1.6 PB.

Data Sources so far…

Transactional DBApplication request logs

Hooking up DB Data

RedshiftMySQL S3 LOADCSVs

Map & Dump CSVs

DB Server

Fire Drop/Create tables, Load data

Hooking up Request Logs

RedshiftS3

Map & LoadJSON

Upload JSON logs

App ServersFire data load

Demos

• Scripts & SQL for hooking up• AWS Redshift console• Connect & query

What have we used it for?

• Faster data science / reporting without locking up transactional DB

• Combining sharded tables• Request logs (browser agent, params, ..)• Marking/tracking segments +

debugging info

Other uses

• Data exploration with tools like Tableau• BI tools• Denormalised data• Dump in more data for single view of

user – FB info, CRM, etc

Conclusion

• Easy to set up and use (for Win/.NET too)

• Super fast• Reasonable price• Met our needs well so far

We are hiring atm!