High-Performance Analytics - myVertica · High-Performance Analytics ... Risk Control System OBIEE...

33

Transcript of High-Performance Analytics - myVertica · High-Performance Analytics ... Risk Control System OBIEE...

High-Performance AnalyticsManaging Risk While Preparing for Hyper Growth

Wen He, ChinaPnR

Aug, 2016

WHO WE ARE

Who We Are

Leading Payment Solutions

Provider for Vertical Industry

• Airline ticketing

• Mutual funds

• ……

Top Acquirer for Small & Micro

Business

Acquiring(merchant service)

business

• Wholesale market

• Shopping mall

• Dining

• Supermarket

• Life style

Integrated Finance Enabler

Wealth management

• Trust , Asset management product

Credit Loan

• Small & micro loan

Account Custodian

• P2P

20062011

2013

New Financial Eco-system Strategy

“New Financial Eco-system” Strategy

‣ Provide “Fundamental Infrastructure" for new financial eco-system

‣ Leverage distribution channels and customers to cross sell

‣ Align with traditional and new financial institutions to integrate financial products

Fundamental Infrastructure

Infrastructure

Payment

Operation

Account

Data

THE EVOLUTION OF DATA

ARCHITECTURE

Data Architecture v1.0

TTY Fund System

P2P Payment

System

Financial Data

External Data

Data Analysis

OBIEE

Reporting System

Data Query

DW1.0_v0.1_20160717

Oracle RACStandby

Database

Synchronizing

Challenge

Payment Transaction VolumeRMB, Billion

What We Need

Fast data transfer High scalabilityData analysis

capabilities

Stability and

reliability

Simple

management

Data Architecture v2.0

TTY Fund System

P2P Payment

System

Financial Data

External Data

Real-time

Streaming Data

Storm

Topology

Data API

Redis

Data Query

Data Portal

Risk Control

System

OBIEE Reporting

System

Data Analysis

DW2.0_v0.5_20160810

Oracle RAC

Kafka

Standby

Database

DWH

VerticaVertica Vertica

ETL

SRC

Benefit

• Overall data processing performance has improved by a factor of ten

• Data volume for analysis has expanded from a few months’ worth to 3-5 years’ worth

• The speed of data extraction has gone from 4-5 hours down to 1-2 hours, an improvement of 300 percent

• Data queries have become 100 times faster

Data architecture v3.0

Hadoop

DWH3

DW3.0_v0.7_20160810

Data API

e2

Data Query

Data Portal

Risk Control

System

Data Analysis

(R, Python)Vertica

DW3

Real-time

Streaming DataStorm RedisKafka

Other System

TTY Fund System

P2P Payment

System

Financial Data

External Data

Standby

Database

Vertica

SRC

Oracle RACOBIEE Reporting

System

ETL

Log Analysis

Network Behavior

Analysis

THE APPLICATION OF

HPE VERTICA

Data Query and Data Reports

4,691 96.4%

Times of

Data Query

Avg. Feedback

Rate

386

Total

Data Reports

Jan 2016 – Jun 2016

Business

Mgmt

Finance

Risk

Control

Operation

Mgmt

PR

BUs

Data Portal

• Display 400 business units’ financial indices, risk

indicators and operation indices

• Use a variety of forms such as pie chart, broken

line chart and geographic map

• Support up to 3 years historical data query and

users can get insight in seconds

• Provide real-time analysis and track real-time

trading trend and regional distribution

Data Portal - Topological Graph

Vertica

SRC

Vertica

DW3

Data API

(Java)

ATAT

e2

Web Service

Web

Data Portal

r2 DSL

Engine

Standby

Database

Date Report

Module

Data Query

Module

BI Module

e2

(Python)

IFS

ETL Cache

Management

Module

Data Access

Module

Data Storage

Module

MySQL

Data Portal

Console

Mobile

Data Portal

Data Report

Report

Download

Self Service

Data Query

Modeling

BI

Data Drilling

User

Management

System

Setting

Property

Management

DP_v0.1_20160810

DW3H

Hadoop

Data Portal - Time Series Analysis

• Extract trend components in the trading time series data and analyze performance

• Identify outliers and interpret the abnormal changes combining the practical business

• Compare the trading time series with the same period of last year and find the difference

Data Portal - Data Drilling

• Summarize a corpus of data in multiple ways and illustrate different representations of the same

basic information

• Give deep insight into the problem and help find what caused the problem

Drill Down Drill Down

Data Portal - Real-time Trading Volume Forecast

• The time series of historical trading volume of P2P payment is

significantly autocorrelation

• Test the time series for stationarity in logs and build ARIMA

model by quoting dummy variables to make a predict of the

tendency of trading volume in next half hour

• Test whether the change of the real trading volume take place

during the confidence interval and identify the outliers to

prevent risks

• Provide a reference for decision makers and reflect the

development status of business

P2P Payment Trading Volume Forecast

Data Portal - User Behavior Tracking

Real-time regional distribution of trading

• Compute the trading volume of each region in real-time and

cover more than 600 different cities

• Reflect the effectiveness promptly when it comes marketing

strategy such as coupon strategy, sending promotional

messages and so on

• Comparing the amount of the transaction and observing the

activities in each region, help product managers to discover the

location of the key costumers

Customer Centric Data Warehouse

Fact Labels

Derived Insights

• User Loss Probability

• Product and Content

Recommendations

• Credit Scores

• …

• Information Retrieval

• Complaints

• Marketing Times

• ...

• Customer Lifetime Value

• Activity Score

• Investment Style

• Risk Preference

• …

• Demographic Attributes

• Purchase Transactions

• Website Browsing

• …

360-Degree

Customer View

Customer Level

Interaction Status

Customer Centric Data Warehouse - Overview

100.0%200.0%

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

0:0

0

1

2

3

4

5

6

7

Browse the catalog

Click to buy

Pay the order

Complete transaction

Customer Centric Data Warehouse - Profile

Marketing Times: Once

Complaints: None

Name: Mr. Hong

Gender: Male

Age: 32

Location: Zhenjiang, Jiangsu

P2P investment: 3,604,676.07 RMB

Account Number: 10

Fund investment: 9,000.00 RMB

ARPU: 2,978.14

RFM analysis: Rank 3

Active Evaluation: Grade 6

Purchase Evaluation: Grade 2

Yield Preference: 10%-15%

Fund Type Reference: Stock Fund

2013.05.13 09:28:19

Opened fund investment account2013.05.13 14:22:25

Invested a stock fund

1,000.00 RMB

2014.09.18 14:09:09

Opened P2P payment account

2014.09.18 15:00:32

Used online chat

2014.10.15 14:18:01

Received repayment

7.39 RMB

2016.07.25 18:00:07

Last P2P payment

37,000.00 RMB

Customer Centric Data Warehouse - Industry Report

2014-2015Q1

• High Costs of reaching

targeted customers

• Mobile Era

• Social Channels

2015Q2

• Chinese Dama

2015Q3

• Liquidity Preference

2015

• White-collared Workers

• Decision Making in Short

Time

MACHINE LEARNING

Risk Identification

Supay illegal-cash-advance identification

• Screen variables such as average transaction amount, trading

period, used credit card number and so on according risk

control experience

• Process PCA algorithms to avoid subjective interferes and get

the principal components based on contribution rate

• Compute Silhouette Coefficient to determine the number of

clusters and use both K-Means and K-Medoid clustering

algorithm

• Compare the differences of each group and identify illegal cash

advance group of great suspicion

‘Bad’ Signature Recognition

• Randomly drawing over 60,000 samples and label good or bad

signature

• Process grey scale and normalize each sample to a n*n matrix

• Create test set (2,000 samples) and training set (838 samples)

and balance the scales of bad and good signatures

• Running CNN model with TensorFlow and confirm the best

model parameters through iterations

• Evaluate the performance of the classification model and the

identification accuracy is about 90%

Speech Recognition

• Collect over 2000 mandarin speech sample and the speech

content is digital from 0 to 9

• Make up sound processing named mel-frequency cepstrum

and extract features

• Set up training set and testing set and they are mutually

exclusive

• Get n-gram model from training speech corpus using a speech

recognition toolkit called Kaldi

• Optimize model parameters in GMM-HMM algorithm and test

stability of the model

• The identification accuracy is 95% above

■ Text Analysis

• Scrap web pages including P2P platform websites, news and

forum

• Delete the duplicated and irrelevant web pages, tokenize the

text and apply filters

• Use Naïve Bayes classifier to do prediction such as identify

whether a loan is unsecured

• Monitor opinions in news and forum and analyze text semantic

orientation to perfect early warning systems

T H A N K S

共 创 新 金 融