Post on 13-Feb-2017
1
Fighting Cyber Fraud with Hadoop Niel Dunnage
Senior Solutions Architect
2 ©2014 Cloudera, Inc. All rights reserved.
Big Data is an increasingly powerful enterprise asset with many potential user cases in this case we’ll explore the relationship between big data and cyber security.
Summary
3
Quick facts Founded 2008, by former employees of
Employees Over 750
Global 24x7 Support Follow-the-sun capability; Pro-active & Predictive Support Programs Dedicated Support Engineers; Support Centers in NA, Europe & Asia
Professional Services World class services delivery teams worldwide
Mission Critical Thousands of enterprise customers rely on Cloudera: 50% of the Fortune 50; 65% of the Fortune 500, Top Defense & Intelligence Agencies
The Largest Ecosystem Over 1200 Members of our Partner Program, ClouderaConnect
Cloudera University Over 40,000 people trained around the world
Open Source Leaders Cloudera employees are founders of most of the Apache Hadoop ecosystem projects, and leading contributors to all of them
4
2008 CLOUDERA FOUNDED BY MIKE OLSON AMR AWADALLAH & JEFF HAMMERBACHER
2009 HADOOP CREATOR
DOUG CUTTING JOINS CLOUDERA
2009 CLOUDERA RELEASES CDH THE FIRST COMMERCIAL APACHE HADOOP DISTRIBUTION
2010 CLOUDERA MANAGER:
FIRST MANAGEMENT APPLICATION FOR
HADOOP
2011 CLOUDERA REACHES 100 PRODUCTION CUSTOMERS
2011 CLOUDERA UNIVERSITY
EXPANDS TO 140 COUNTRIES
2012 CLOUDERA ENTERPRISE 4 THE STANDARD FOR HADOOP IN THE ENTERPRISE
2012 CLOUDERA CONNECT
REACHES 300 PARTNERS
2014 THE ENTERPRISE DATA HUB LAUNCHED
2013 CLOUDERA IMPALA CLOUDERA NAVIGATOR CLOUDERA SEARCH
2013 TOM REILLY JOINS AS CEO
OVER 800 PARTNERS IN CLOUDERA CONNECT
CDH Cloudera Manager
CLOUDERA ENTERPRISE
4
ASK BIGGER QUESTIONS
ENTERPRISE DATA HUB
Leading the way in data management powered by Hadoop
5
Agenda Data: - The new oil
• How to scale out with unreliable workers
• How enterprises pool and share storage and computation resources
• Enterprise data governance and growing up Hadoop security
• Deploying machine learning at scale
• Empowering creative data science
©2014 Cloudera, Inc. All rights reserved.
6 ©2014 Cloudera, Inc. All rights reserved.
This Morning
7 ©2014 Cloudera, Inc. All rights reserved.
• DDOS
• Data Exfiltration • Confidential customer records
• Transaction data
• Reputation attack • False flag
• Fake data
• Insider Threat
Cyber Security:- Data is a valuable commodity
Operations designed to deceive in such a way that the operations appear as though they are being carried out by entities, groups or nations other than those who actually planned and executed them http://en.wikipedia.org/wiki/False_flag
@security_511 has continued to support OpSaudi, claiming further
attacks on websites connected to Saudi Aramco.
The @SQLiNairb hacker has released a database dump from a US fantasy football website (http://www.fftoday.com/), claiming that it was timed to coincide with the NFL draft
Anonymous Italy and Operation Green Rights (OpGR) have released the contents of an email account connected to an Italian steel producer, in
connection to accusations of pollution against the company
9
Cloudera’s Approach to Hadoop Security
Compliance-Ready
Comprehensive
Transparent
• Standards-based Authentication • Centralized, Granular Authorization • Native Data Protection • End-to-End Data Audit and Lineage
• Meet compliance requirements • HIPAA, PCI-DSS, … • Encryption and key management
• Security at the core • Minimal performance impact • Compatible with new components • Insight with compliance
9 ©2014 Cloudera, Inc. All rights reserved.
10
Operational Efficiency Perform existing workloads faster, cheaper, better
Innovation and Advantage Ask bigger questions in the pursuit of discovering something incredible
©2013 Cloudera, Inc. All Rights Reserved.
Enterprise Data Hub Users Cases
ETL Acceleration
EDW Optimization
Active Archive
OSINT Analysis
Fraud Detection
Deep Exploratory
BI
Historical Compliance
Log Processing
Performance Management
Risk Manageme
nt
11
Our Design Strategy The Enterprise Data Hub
©2014 Cloudera, Inc. All rights reserved. 11
One pool of data
One metadata model
One security framework
One set of system resources
A fully integrated Hadoop ecosystem
Storage
Integration REST (Webhdfs), File (Fuse) Flume, Sqoop
Resource Management YARN
Met
adat
a, N
avig
ato
r
Batch Processing
Spark, MAPREDUCE,
HIVE & PIG
Stream Processing
Spark streaming
HDFS Hbase/ Accumulo
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
Interactive
SQL
CLOUDERA IMPALA
Interactive
Search
CLOUDERA SEARCH
Machine
Learning Spark
Mlib,MAHOUT,Oryx
Math &
Statistics
SAS, R
Secu
rity
, N
avig
ato
r, S
entr
y
graph.vertices.filter{case(id, _) => id==13669222}.collect
Select CPU_Met from application WHERE (USAGE > 1000) LEFT OUTER JOIN ON application_ID where application_type IS Non_Critical
12
Offence:- Fraud Detection
User Cases
• Distributed parallel execution with chained joins
• Historical processing at scale
• Machine Learning, malware/anomaly detection, spam filters etc
• Combined real time and batch predictors
12
Fully Automated at scale
13
Big Data Economics Ask bigger questions
• Agile (2 week cycle) • Linear scaling • Robust and economic crypto
security • Creative fail fast innovation • Powers productivity insights
• Increasing infrastructure ROI • Increasing business ROI • Defeating fraudulent activity • Evaluating risk
Ingest
Discover Predict
Innovate
©2013 Cloudera, Inc. All Rights Reserved. 13
14
store buffer
Data Ingest
• NRT Ingest • Flume
• Optimized to flow real time event data into the Hadoop cluster
• Spark Streaming for near real time micro batch aggregations
• Twitter streaming • Kafka • Log
• API
• Bulk Load • Sqoop for structured • Fuse file system access • API • Web / Hue
• Data Enrichment • Flume interceptors • Kite Morplines module
• Configuration based interceptors that can enrich data. For example extracting facets, entity extraction applying regulatory tags
©2014 Cloudera, Inc. All rights reserved.
Client
Client
Client
Client
Agent
Agent
Agent
enrich collect
15
Near Real time Access to threats
• View the geographic distribution of Slowloris DDOS taken from Apache web server logs
• Help isolate unpatched servers
• Identify source of attacks
©2014 Cloudera, Inc. All rights reserved.
LogUtils.createStream(...) .filter(_.getText.contains(”408 Error")) .countByWindow(Seconds(10)) stream.join(historicCounts).filter { case (word, (curCount, oldCount)) => curCount > oldCount }
16
Machine Learning
16
Real-time large-scale machine learning predictive analytics infrastructure build on Hadoop • Collaborative filtering and
recommendation • Classification and
regression, • Clustering
17
VARs and Monte Carlo Simulations “Under reasonable circumstances, how much can you expect to lose?”
• “Monte Carlo simulation, involves posing thousands or millions of random market scenarios and observing how they tend to affect a portfolio of financial instruments”
• VAR based on TimePeriod, Portfolio and Confidence level
• This technique is easily parallelizable and as such is a great fit for Hadoop and Spark in particular
• Until recently required complex MPI C++ code
• Easily implemented in Hadoop and feasible across hierarchies of financial instruments (P&L Accounts)
• Backtest to validate the VAR • Curation of Market Factors is important • Can shape portfolio investments for
instruments that trial as loss making
©2014 Cloudera, Inc. All rights reserved.
18
Applying VAR Techniques to Cyber Threat Monitoring with Hadoop
• Historical event data processing at scale
• Hadoop as a service shared with financial governance applications
• Treat £££s spent on vendors and software like instruments and portfolios?
• Anomaly detection of network traffic by learning what is normal
• Siloed applications have previously made it hard to have a tangible value of finanicial risk.
• Risk calculations tend towards the subjective ie low (FIS APT) high (insider threat)
©2014 Cloudera, Inc. All rights reserved.
19
Internal Threat Dashboard
Ranked List of High Risk Personnel:
Name Risk Score
Kim Burgess 94
Guy Hughes 93
Jeff Maclaen 87
Ed Snowden 86
Mary Smith 82
Customers with Risk Scores that Recently Changed
Name Old Score
New Score
John Smith 34 94
Rob Jones 26 93
Jim Fisher 17 87
Henry Johnson 45 86
Sue Leefield 12 82
Overall Risk Assessment:
Risk Per Category: Online Banking Access: Public Records: Financial transaction rate: Online Activity: Social Media Activity: Regular purchases Foreign Travel:
Open Cases:
Name Risk Score Customers
Dodgy Ecomm.biz 94 John Smith, Rob Jones.
Brentford Shopping Centre 93 Jim Fisher, Henry Johnson
20
Analytics
21 ©2014 Cloudera, Inc. All rights reserved.
• Hadoop Security: - Kerberos simplified deployment with Cloudera Manager
• Sentry: - provides unified authorization with a single policy for Hive, Impala and Search
• HDFS Extended ACL’s and HBase cell level access control
• Navigator encrypt and key trustee deliver compliant data security • Via Gazzang acquisition
• Navigator provides data management layer including audit, access control reviews, data classification and discovery, and lineage
Defense: - Security Features
22
Kerberos Security
Perimeter Security
• Guarding access to the cluster
itself
• Technical Concepts:
• Authentication
• Network isolation
Kerberos • Kerberos: A computer network authentication protocol that works on basis of
tickets to allow nodes to prove identity to each other in a secure manner using encryption extensively
• Messages are exchanged between:
• Client • Server • Kerberos Key Distribution Center (KDC). • Note this is not part of Hadoop, but most Linux Distros come with MIT
Kerberos KDC. • Passwords are not sent across network, Instead passwords are used to compute
encryption keys • Authentication status is cached (don’t need to send credentials with each request) • Timestamps are essential to Kerberos (make sure system clocks are synchronized !)
©2014 Cloudera, Inc. All rights reserved.
23
Apache Sentry
Access Security Sentry
©2014 Cloudera, Inc. All rights reserved.
• Sentry provides unified authorization across multiple access paths
• A single authorization policy will be enforced for Impala, Hive and Search
• Role based access at Server, Database, Table or View granularity
• Multi-tenant: Separate policies for each database / schema
• Access
• Defining what users and applications can do with
data
• Technical Concepts:
• Permissions
• Authorization
24
Cloudera Navigator
Visibility Cloudera Navigator
©2014 Cloudera, Inc. All rights reserved.
• Auditing and Access Management • View, granting and revoke permissions across the Hadoop stack • Identify access to a data asset around the time of security breach • Generate alert when a restricted data asset is accessed
• Lineage • Given a data set, trace back to the original source • Understand the downstream impact of purging/modifying a data set
• Metadata Tagging and Discovery • Search through metadata to find data sets of interest • Given a data set, view schema, metadata and policies
• Lifecycle Management • Automate periodic ingestion of data • Compress/encrypt a data set at rest • Purge a dataset/replicate data set to a remote site
• Visibility
• Reporting on where data came from and how it’s
being used
• Technical Concepts:
• Auditing
• Lineage
25 ©2014 Cloudera, Inc. All rights reserved.
26 ©Gazzang gazzang.com/products/cloudencrypt-for-aws
Linux Server / VM Encrypt client
Linux File, Directory
AES-256 Encryption
Process Based ACL’s
GPG
Linux Server / VM Key Trustee Server
Encryption at rest Navigator Encrypt and Key Trustee
• Encrypt any File, Directory • AES-256 Encryption
• Unique Access controls • Process Based, NOT users / groups
• 100% Transparent
• Separation of Duties
• Key Management • AES encryption keys stored on
separate Key Trustee server • Key manager breach, data is safe
• Data Server breach, data is safe