Fraud Detection Architecture
-
Upload
gwen-chen-shapira -
Category
Data & Analytics
-
view
1.187 -
download
0
Transcript of Fraud Detection Architecture
Real Time Fraud DetectionPatterns and reference architectures
Ted Malaska // PSA Gwen Shapira // Software Engineer
2
• Intro• Review Problem• Quick overview of key technology• High level architecture• Deep Dive into NRT Processing• Completing the Puzzle – Micro-batch, Ingest and Batch
Overview
©2014 Cloudera, Inc. All rights reserved.
3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume
• @gwenshap
Gwen Shapira
4
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~5 years
• Contributed to – HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch
• Co-Author to O’Reilly Hadoop Application Architectures
• Worked with about 70 companies in 8 countries
• Marvel Fan Boy
• Runner
Hello
©2014 Cloudera, Inc. All rights reserved.
11
• Typical Atomic Card Fraud Detection• Ikea Meat Ball• Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud• Kid Coming Home From School
Review of the Problem
©2014 Cloudera, Inc. All rights reserved.
12
How do we React
• Human Brain at Tennis – Muscle Memory– Reaction Thought– Reflective Meditation
©2014 Cloudera, Inc. All rights reserved.
15©2014 Cloudera, Inc. All rights reserved.
•Messages are organized into topics•Producers push messages•Consumers pull messages• Kafka runs in a cluster. Nodes are called brokers
The Basics
18©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
19©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
20©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
21©2014 Cloudera, Inc. All rights reserved.
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in partition
Order retained with in partition but not over
partitionsO
ff S
et
X
Off
Set
X
Off
Set
X
Off
Set
YO
ff S
et
YO
ff S
et
Y
Off sets are kept per consumer group
23
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to FlumeTwitter, logs, JMS, webserver, Kafka
Mask, re-format, validate…
DR, criticalMemory, file,
KafkaHDFS, HBase,
Solr
24
Flume and/or Kafka
©2014 Cloudera, Inc. All rights reserved.
Flume
UpStream
Flume Source
Interceptor
Flume Channel
Flume Sink
Down Stream
SelectorCan Be KafkaCan Be KafkaCan Be Kafka
25©2014 Cloudera, Inc. All rights reserved.
Interceptors
• Mask fields• Validate information against external source• Extract fields• Modify data format• Filter or split events
27
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
28
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
29Confidentiality Information Goes Here
DStream
DStream
DStream
Spark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first Batch
First Batch
Second Batch
30Confidentiality Information Goes Here
DStream
DStream
DStreamSpark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first Batch
First Batch
Second Batch
Stateful RDD 1
Stateful RDD 2
Stateful RDD 1
31
Spark Streaming and HBase
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
33
Real-Time Event Processing Approach
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
Reduce
Spark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
35
Focus on NRT First
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
Reduce
Spark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
NRT Event Processing with Context
36
Streaming Architecture – NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local Memory
HBase Client
Kafka
Answer Topic
HBase
Kafk
a C
onsu
mer
Kafk
a P
roduce
r
Able to respond with in 10s of milliseconds
37
Partitioned NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Events Topic Flume Source
Flume Interceptor
Event Processing Logic
Local Memory
HBase Client
Kafka
Answer Topic
HBase
Kafk
a C
onsu
mer
Kafk
a P
roduce
r
Topic
Partition A
Partition B
Partition C
Producer
Partitio
ner
Producer
Partitio
ner
Producer
Partitio
ner
Custom Partitioner
Better use of local memory
39
Micro Batching
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
Reduce
Spark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
Micro Batching
Micro BatchingMicro Batching
40
Complex Topologies
©2014 Cloudera, Inc. All rights reserved.
Kafka
Initial Events Topic
Spark Streaming
Kafk
a D
irect
C
onnect
ion
Dag Topologies
Kafka
Initial Events Topic
Spark Streaming
Kafka Receivers Dag Topologies
Kafka Receivers
Kafka Receivers
• Manages Offset• Stores Offset is RDD• No longer needs HDFS for initial RDD check
pointing
• Lets Kafka Manage Offsets• Uses HDFS for initial RDD recovery
1.3
1.2
41©2014 Cloudera, Inc. All rights reserved.
MicroBatch Bad-Input Handling
0 1 2 3 4 5 6 7 8 910
11
12
13
Kafka – incoming events topic
Dag Topologies
0 1 2 3 4 5 6 7 8 910
11
12
13
Kafka – bad events topic
0 1 2 3 4 5 6 7 8 910
11
12
13
Kafka – resolved events topic
0 1 2 3 4 5 6 7 8 910
11
12
13
Kafka – results topic
42
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
Reduce
Spark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
Ingestion
Ingestion
43
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Flume HDFS SinkKafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR SinkSink
Sink
SinkSolR
Flume Hbase SinkSink
Sink
SinkHBase
44
Reflective Thoughts
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
Reduce
Spark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
Research and Searching