Fraud Detection Architecture

Real Time Fraud DetectionPatterns and reference architectures

Ted Malaska // PSA Gwen Shapira // Software Engineer

2

• Intro• Review Problem• Quick overview of key technology• High level architecture• Deep Dive into NRT Processing• Completing the Puzzle – Micro-batch, Ingest and Batch

Overview

©2014 Cloudera, Inc. All rights reserved.

3©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume

• @gwenshap

Gwen Shapira

4

• Ted Malaska (PSA at Cloudera)

• Hadoop for ~5 years

• Contributed to – HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch

• Co-Author to O’Reilly Hadoop Application Architectures

• Worked with about 70 companies in 8 countries

• Marvel Fan Boy

• Runner

Hello


5

The Problem


6

Credit Card Transaction Fraud


7

Ikea Meat Balls


8

Coupon Fraud


9

Video Game Strategy


10

Health Insurance Fraud


11

• Typical Atomic Card Fraud Detection• Ikea Meat Ball• Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud• Kid Coming Home From School

Review of the Problem


12

How do we React

• Human Brain at Tennis – Muscle Memory– Reaction Thought– Reflective Meditation


13

Overview of Key Technologies


14

Kafka

©2014 Cloudera, Inc. All Rights Reserved.


•Messages are organized into topics•Producers push messages•Consumers pull messages• Kafka runs in a cluster. Nodes are called brokers

The Basics


Topics, Partitions and Logs


Each partition is a log


Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2


Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client


Consumers

Consumer Group Y

Consumer Group X

Consumer

Kafka Cluster

Topic

Partition A (File)

Partition B (File)

Partition C (File)

Consumer

Consumer

Consumer

Order retained with in partition

Order retained with in partition but not over

partitionsO

ff S

et

X

Off

Set

X

Off

Set

X

Off

Set

YO

ff S

et

YO

ff S

et

Y

Off sets are kept per consumer group

22

Flume

23

Sources Interceptors Selectors Channels Sinks

Flume Agent

Short Intro to FlumeTwitter, logs, JMS, webserver, Kafka

Mask, re-format, validate…

DR, criticalMemory, file,

KafkaHDFS, HBase,

Solr

24

Flume and/or Kafka


Flume

UpStream

Flume Source

Interceptor

Flume Channel

Flume Sink

Down Stream

SelectorCan Be KafkaCan Be KafkaCan Be Kafka


Interceptors

• Mask fields• Validate information against external source• Extract fields• Modify data format• Filter or split events

26

SparkStreaming

27

Spark Streaming Example


1. val conf = new SparkConf().setMaster("local[2]”)

2. val ssc = new StreamingContext(conf, Seconds(1))

3. val lines = ssc.socketTextStream("localhost", 9999)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

8. SSC.start()

28

Spark Streaming Example


1. val conf = new SparkConf().setMaster("local[2]”)

2. val sc = new SparkContext(conf)

3. val lines = sc.textFile(path, 2)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

29Confidentiality Information Goes Here

DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

30Confidentiality Information Goes Here

DStream

DStream

DStreamSpark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

31

Spark Streaming and HBase


Driver

Walker Node

Configs

Executor

Static Space

Configs

HConnection

Tasks Tasks

Walker Node

Executor

Static Space

Configs

HConnection

Tasks Tasks

32

High Level Architecture


33

Real-Time Event Processing Approach


Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

Reduce

Spark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

34

NRT Processing


35

Focus on NRT First



SolR

Hadoop Cluster I


Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

Reduce

Spark

Search

Automated & Manual



Adjusting NRT Stats

HDFSEventSink

SolR Sink


Automated & Manual

Review of NRT


Local Cache

Kafka


Web App

NRT Event Processing with Context

36

Streaming Architecture – NRT Event Processing


Flume Source

Flume Source

Kafka

Initial Events Topic

Flume Source

Flume Interceptor

Event Processing Logic

Local Memory

HBase Client

Kafka

Answer Topic

HBase

Kafk

a C

onsu

mer

Kafk

a P

roduce

r

Able to respond with in 10s of milliseconds

37

Partitioned NRT Event Processing


Flume Source

Flume Source

Kafka

Initial Events Topic Flume Source

Flume Interceptor

Event Processing Logic

Local Memory

HBase Client

Kafka

Answer Topic

HBase

Kafk

a C

onsu

mer

Kafk

a P

roduce

r

Topic

Partition A

Partition B

Partition C

Producer

Partitio

ner

Producer

Partitio

ner

Producer

Partitio

ner

Custom Partitioner

Better use of local memory

38

Completing the Puzzle


39

Micro Batching



SolR

Hadoop Cluster I


Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

Reduce

Spark

Search

Automated & Manual



Adjusting NRT Stats

HDFSEventSink

SolR Sink


Automated & Manual

Review of NRT


Local Cache

Kafka


Web App

Micro Batching

Micro BatchingMicro Batching

40

Complex Topologies


Kafka


Spark Streaming

Kafk

a D

irect

C

onnect

ion

Dag Topologies

Kafka


Spark Streaming

Kafka Receivers Dag Topologies

Kafka Receivers

Kafka Receivers

• Manages Offset• Stores Offset is RDD• No longer needs HDFS for initial RDD check

pointing

• Lets Kafka Manage Offsets• Uses HDFS for initial RDD recovery

1.3

1.2


MicroBatch Bad-Input Handling

0 1 2 3 4 5 6 7 8 910

11

12

13

Kafka – incoming events topic

Dag Topologies

0 1 2 3 4 5 6 7 8 910

11

12

13

Kafka – bad events topic

0 1 2 3 4 5 6 7 8 910

11

12

13

Kafka – resolved events topic

0 1 2 3 4 5 6 7 8 910

11

12

13

Kafka – results topic

42

Ingestion



SolR

Hadoop Cluster I


Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

Reduce

Spark

Search

Automated & Manual



Adjusting NRT Stats

HDFSEventSink

SolR Sink


Automated & Manual

Review of NRT


Local Cache

Kafka


Web App

Ingestion

Ingestion

43

Ingestion


Flume HDFS SinkKafka Cluster

Topic

Partition A

Partition B

Partition C

Sink

Sink

Sink

HDFS

Flume SolR SinkSink

Sink

SinkSolR

Flume Hbase SinkSink

Sink

SinkHBase

44

Reflective Thoughts



SolR

Hadoop Cluster I


Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

Reduce

Spark

Search

Automated & Manual



Adjusting NRT Stats

HDFSEventSink

SolR Sink


Automated & Manual

Review of NRT


Local Cache

Kafka


Web App

Research and Searching

Fraud Detection Architecture

Data & Analytics

Transcript of Fraud Detection Architecture