Big data cloud architecture

Big Data in the Cloud

S&P Capital IQS&P Capital IQ combines two of our strongest brands - S&P, with its long history and experience in the financial markets and Capital IQ, which is known among professionals globally for its comprehensive company and financial information and powerful analytical tools.

Agenda

• Creation of Excel Plug-in with Global Data, Global Sales and US based servers

• High Performance data gets for Big Historical Time Series Data.

• QA

S&P Capital IQ Excel Plug-in

• Excel Plug-in provides thousands of data points on Demand

• Allows customers anywhere in the world to use our data assets on their desktops on demand

• It needs to be a fast user experience everywhere in the world

Global Customers US Data Center

Average Response Time MillisecondsFrom: London To: New Jersey 400

From: New York To: New Jersey 30

From: Melbourne To: New Jersey 800Response times rounded

Global Customers Global Data Center

Average Response Time MillisecondsFrom: London To: Ireland 400 to 40

From: New York To: New Jersey 30 to 30

From: Melbourne To: Singapore 800 to 60Response times rounded

Cloud Architecture

New Jersey DC

HTTPS HTTPS

HTTPS

HTTPS

HTTPS

HTTPS

Secure connecti

on

Secure connection

Secure connection

Data Synchronization



How do we make it even faster?

Router

Pre-send data

Smart Cache

- Move data the customers uses the most to their desktop.

- Automatically get the data for the customer. - Learn to send the right data to the customer.

Smart Cache

1. User Opts into Smart Cache2. The system pre-sends data package to customer 3. User makes a request for data

a. Smart Cache Checks Locally first b. Not local grab data from the cloud

4. Smart Cache sends usage logs 5. Pre-sent data package is altered for the customer

Router

Smart Cache2.

3.1.

a

b

5. 4.

Smart Caching Data

1. Collect logs from smart cache2. Collect and decrypt cloud and local usage logs3. Apply logs to Mahout4. Use customer profile 5. Mahout comes out with an update suggestion list6. Customer specific package is created 7. Prepared package is ready for pick by smart cache

Router

Smart Cache1.

7.

2.3.4.

5. 6.

Smart Caching Data Lessons Learned

• The algorithm works similar to a website matching engine for shopping.

• Different in that the customer does not see the recommendations they just have a faster experience

• All data sets are used to learn but only large data sets are custom packaged for delivery

• Sometimes it is easier to just send the entire package when the data set is small enough and used by the customer.

• Don’t expect success day 1 or day 30 the longer you learn the more accurate it should become

• Not a replacement for simple logic • Algorithm requires constant feeding and attention. • There are cases where you can’t learn about your user such as when

they share ID’s.

High Performance Data Gets


• Some data assets due to size are still routed back to the US

• Big Data sets ~10T of time series data • As those data assets became more popular we needed to

move the right data to the cloud• Cannot synchronize the data so fast loads are required• Single Milliseconds get times


• Using Hadoop learn what are the most used large data assets.

• Move the subset of data identified as the most used data to the cloud.

• Fast loading of millions of records • Allow for Single Milliseconds data retrieval times


Hbase http://hbase.apache.org/ HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Hbase is similar to an RDBMS in that it has the concept of tables; however, columns in Hbase tables are not fixed in number or data type and can have any data type which varies from one row to the other.

Cassandra http://cassandra.apache.org/ Apache Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.Cassandra was open sourced by Facebook in 2008, where it was designed by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik ( Facebook Engineer ). In a lot of ways you can think of Cassandra as Dynamo 2.0 or a marriage of Dynamo and BigTable. Cassandra is in production use at Facebook but is still under heavy development.We tried to do a similar POC using Cassandra with a smaller subset of data because of above mentioned hardware restrictions. Unlike Hbase and RDBMS, there is no concept of a table. Instead we have columns, column families and Keypsaces.

http://hbase.apache.org/

http://research.google.com/archive/bigtable.html

http://cassandra.apache.org/

http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf

http://research.google.com/archive/bigtable-osdi06.pdf

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html


• Data Get – Time to pull 1 security and 1 data point

• Data Load – Time take to load 6 million securities

Cassandra Hbase Oracle

Data Get 400 Microseconds 1 Milliseconds 5 Seconds

Data Load 10 Minutes 10 Minutes 10 Minutes


• Virtual Oracle instances did not meet our performance needs.

• EMR needed for Hbase was not cost effective for data gets.

• Hbase is difficult to implement in AWS due to the hardware requirements of Hadoop

• Cassandra can be segmented logically for Big Data Assets with minimal to no performance degradation in AWS

Questions?

Big data cloud architecture

Technology

Transcript of Big data cloud architecture