Modern Big Data Systems for Machine Learning

Post on 18-Aug-2015

418 views 3 download

Tags:

Transcript of Modern Big Data Systems for Machine Learning

Modern Big Data Systems for

Machine Learning

Antonio Roldao, Ph.D. CQF. 1

10/July/2015, Thomson Reuters, London, UK

About Me

http://anton.io @roldao

2

This Talk on Big Data Systems

Data Big Data as a Buzzword and the useless 4V’s Basic Aspects of Data Advanced Aspects of Data Small Data Innovations

Algorithms for Machine Learning ML Overview Optimization Problems Solving Systems of Linear Equations Accelerating ML Using Different Technologies

Distributed Computing Computing at Scale Platform Examples

Antonio Roldao, Ph.D. CQF. 3

Big Data

4Vs of BD?! Volume Variety Velocity Veracity

Too simplistic and technically useless!

“Any amount of data that is too big for Excel to process.”

1956 Hard-drive with 5 MB

Mostly a marketing Buzzword which mean different things to different people.

Antonio Roldao, Ph.D. CQF. 4

Understanding Data – Basic Storage formats

Uncompressed <-> Compressed Unencrypted <-> Encrypted Human-readable <-> Binary Rigid <-> Templated <-> Self-describing Mainly regular <-> Irregular Different types and encodings…

Generation (write) modes parallel <-> sequential append-only in-place updates random inserts…

Consumption (read) modes parallel <-> sequential random <-> well defined access…

Antonio Roldao, Ph.D. CQF. 5

Understanding Data – Advanced

Represents: How concepts are connected (graph) How connections evolve with time (time series)

Bitemporal (e.g. value depends on time frame) Time value of data (e.g. Useful today, but not tomorrow) Sensitivity (e.g. Medical, Economical, Political, Privacy…) Interdependency (e.g. one wrong bit destroys everything) Cleanliness (e.g. how Noisy it is) Truthfulness (e.g. how Accurate it is) Redundancy (e.g. how safe does it need to be) Density (e.g. how Redundant it is) Accessibility (e.g. Local <-> Global)

Cost / BudgetAntonio Roldao, Ph.D. CQF. 6

Myriad of Data-stores/bases File-Systems

local, distributed, p2p,… rom, tape, spindle, flash, ram,…

Key-Value Stores Relational Object Geo-location Row-based Column-based Time-Series Graph-based ACID compliant or not Sharding Support Replication Support HA Support Blockchain LayerFS …

Antonio Roldao, Ph.D. CQF. 7

Recent Innovations in “Small-data”

XML (1996) YAML (2001) JSON BSON Google Protocol Buffers (initial release 2008) Cap’n Proto Thrift Avro FAST FIX/BFIX Flat Buffers Simple Binary Encoding (2014) Dynamically Adaptive Encoding (Future)

http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop

Antonio Roldao, Ph.D. CQF. 8

Processing Data

Antonio Roldao, Ph.D. CQF. 9

Machine Learning

Antonio Roldao, Ph.D. CQF. 10

ML / AI – Boils down to…

Given an input (X) and/or state (S) produce a output (Y)

X may include Index or Time element (e.g. time series)

S may include: a feedback-loop (e.g. reinforcement learning) a previously trained dataset (e.g. supervised learning)

Y divides into two types: predictions (e.g. weather, trading, ...) categorizations

known categories (e.g. object/speech recognition, …) unknown categories (e.g. insight generation, …)

Antonio Roldao, Ph.D. CQF. 11

Dimensionality Reduction

Principal Component Analysis

First component

Subsequent components

Antonio Roldao, Ph.D. CQF. 12

Clustering

k-Means

For x observations cluster into k partitions the where ui represents the mean of points in Si

Antonio Roldao, Ph.D. CQF. 13

General Al

Genetic Algorithms

For n mutations select mi that minimizes the difference between output yi and a given reference (r):

where

Antonio Roldao, Ph.D. CQF. 14

Artificial Neural Networks

Deep Convolutional Neural Network (d-CNN)

Optimization involving Stochastic Gradient Descent + Back-propagation

Antonio Roldao, Ph.D. CQF. 15

All About Optimization

All these schemes involve solving for some constants that Minimize or Maximize some Cost function

Require fundamental Optimization algorithms such as: Direct Methods

Combinatorial Algorithms Greedy Algorithm Minimax Algorithm with alpha-beta pruning …

Iterative Methods Gradient Methods Karmarkar’s Algorithm …

Antonio Roldao, Ph.D. CQF. 16

At the Core of Optimization…

…there is a solution of a System of Linear equation of the form:

with x subject to some constraints.

Which need algos that can be subdivided into two categories:

Direct Methods Gaussian, LU, QR, Cholesky, LDL, …

Iterative Methods MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, …

Antonio Roldao, Ph.D. CQF. 17

Accelerating Machine Learning

CPU GPGPU FPGA

Sequential Processing Parallel Processing

High FlexibilityHigh AbstractionsMany Libraries…Direct Methods

Ultra-Low-LatencyHigh BandwidthFine grain optimization...Iterative MethodsNeural NetworksMarkov ChainsMonte Carlo

Antonio Roldao, Ph.D. CQF. 18

Networked Computing Systems

Mainframe Computing Cluster Computing Distributed Computing Grid Computing

Orbital Computing Interstellar Computing Galactic Computing Inter-Universe Computing

Cloud Computing

Antonio Roldao, Ph.D. CQF. 19

Modern Big Data Systems – Basic Components Dynamic (abstraction) + Statically-Typed (speed) Languages Need to rethink and re-engineer main systems:

Data & Code Stores Logging Code Revision and Deployment Compute Nodes and Brokers Management Graceful Failure and Recovery Credentials and Access Controls Task Schedulers Messaging Bus Web/Mobile Interfaces Regression Testing…

Containerize and Standardize Services

Antonio Roldao, Ph.D. CQF. 20

Examples – Modern Big Data Systems Finance

Athena/Hydra @ JP Morgan Quartz/Sandra @ Bank of America Slang/SecDB @ Goldman Sachs Optimus/DAL @ Morgan Stanley WSQ Tech @ n-prop shops &

datapark.io @ quants / prop-shops

Machine Learning

Alpha/DL @ Muse.Ai

Antonio Roldao, Ph.D. CQF. 21

Thank you

http://anton.io @roldao