2017-24-03 Deutsche Bundesbank big_data_workshop

Giancarlo NicolaUniversità di Pavia

Big Data Workshop, Bundesbank, Frankfurt, 24th March 2017

Textual data

analysis in Finance

220150831_Pavia.pptx

This research aims at integrating textual and numerical data to enhance bank distress forecasting

Description of the research project

Description

Aim

Tools> Modeling tools (Python & Machine Learning)

– Document embeddings for semantic space representation

– Neural network for supervised classification task

> Enhanced predictive model for bank stress events

– Bank stress predictions by integrating textual and numerical data in a single framework of analysis

Data

> Textual data

– 6M Reuters news articles

> Numerical data (12 variables, quarterly)

– European Bank accountings, Country level banking sector data, Country level macrofinancial data


We applied first an exploratory analysis on the textual dataset to understand the topics of discussions

Topics of the textual news dataset

Topic 1: Investing and market expectations

Highest prob Frex

> softwar

> network

> communic

> forwardlook

> integr

> distribut

> organ

> common

> prospectus

> program

> design

> rtgs

> mgrs

> matdebt

> experi

> nyse

> enterpris

> site

> email

> approxim

> softwar

> network

> communic

> forwardlook

> prospectus

> rtgs

> mgrs

> matdebt

> nyse

> healthcar

> innov

> award

> locat

> user

> digit

> highgrad

> baabbbbbb

> capabl

> advertis

> educ

Topic 2: Market performance

Highest prob Frex

> brussel

> thirdquart

> instrument

> firsthalf

> notif

> auction

> belgian

> diari

> firstquart

> notifi

> proxi

> luxembourg

> certif

> secondquart

> dexia

> holder

> aegon

> offeroroffere

> trigger

> nomine

> brussel

> thirdquart

> firsthalf

> notif

> belgian

> diari

> notifi

> proxi

> luxembourg

> dexia

> offeroroffere

> nomine

> doubleclick

> threshold

> ceas

> breakdown

> inbev

> tiger

> omega

> arcelormitt

Topic 3: Countries and sovereign

Highest prob Frex

> greec

> ireland

> hong

> kong

> india

> australia

> conf

> singapor

> south

> purchasesel

> billiton

> spain

> chines

> greek

> russia

> brazil

> telecom

> mexico

> spanish

> dubai

> hong

> kong

> conf

> purchasesel

> greek

> russia

> brazil

> dubai

> paidreceiv

> indian

> kosovo

> russian

> purchasesal

> athen

> nameeg

> peso

> emiss

> taiwan

> serbia

> natixi

Topic 4: Bank crisis

Highest prob Frex

> henderson

> court

> unaudit

> fitch

> switzerland

> committe

> jpmorgan

> restructur

> rescu

> familiar

> scheme

> investig

> socgen

> settlement

> formula

> hire

> fair

> lehman

> pension

> legal

> henderson

> court

> unaudit

> socgen

> formula

> bonus

> buyout

> pantlin

> tier

> lawyer

> kerviel

> probe

> consortium

> lawsuit

> alleg

> italian

> senat

> fraud

> taxpay

> virgin

Topic 5: Materials, currencies, indexes

Highest prob Frex

> cell

> gold

> prefer

> miner

> sterl

> tokyo

> climb

> nikkei

> peak

> downgrad

> versus

> auction

> shed

> sharpli

> jone

> sharp

> survey

> weaker

> copper

> tumbl

> cell

> gold

> sterl

> climb

> nikkei

> versus

> shed

> weaker

> copper

> steadi

> rebound

> ounc

> bernank

> overnight

> drag

> payrol

> straight

> usytrr

> gainer

> threemonth


The most prevalent topics regard safe assets, materials and bank crisis

Model result: Topics prevalence [2007-2010]

Investing and market expectations

Countries and sovereign

Market performance

Materials, currencies, indexes

Bank crisis

Topic title


From the wordclouds is possible to further grasp insights about discussion topics Topic 1: Investing and mkt. expect. Topic 3: Countries and sovereign Topic 2: Market performance

Topic 5: Materials, currencies, indexesTopic 4: Bank crisis


When comparing Italian and German banks related news, Italian news are more focused on bank crisis

Topics prevalence difference between Italy and Germany

Market performance Bank crisis


MPS tweet topics evolution and prevalence over six months

Topics prevalence difference between Italy and Germany [2007-2010]

Deep learning bank distress in news and financial data

Big Data Workshop, Bundesbank, Frankfurt, 24th March 2017

Paola Cerchiello1, Giancarlo Nicola1, Samuel Rönnqvist2, Peter Sarlin3

1University of Pavia, Italy2Hanken School of Economics, Finland3Åbo Akademi University, Finland


The model derives distress signals from textual and numerical data using a semantic deep-learning approach divided in two steps

Description of the two steps approach

and financial data

Learning semantic vectors through nearby words prediction

Step 1:

Step 2: Learning to predict distress, based on semantic vectors of articles and financial data

Distress: event of capital injection, request for central bank intervention or default

Word embedding model


Word embeddings map words to a reduced semantic space where particular properties can be extracted by vector operations

Properties of word embeddings

> 200 - 800 dimensions depending on the task and the available corpus

> Words that are near in this space have similar semantic value

> Wiser space to operate classification task because of the reduced dimensionality and preserves the semantic value of words


The model takes in input textual news data and numerical financial data to predict the distress condition of a bank

Scheme of textual and numerical data integration model

Financial numerical data

Reuters articles

(70 nodes)(612 nodes) (2 nodes)

(Doc2Vec – Gensim implementation)

Document embeddings (unsupervised learning)

Neural network classification (supervised learning)


The news data are preprocessed in order to extract sentences related to banks and assign them to the corresponding bank

>The news are linked to each bank on the basis of references to that bank in the article

>Each article is split into sentences and only the sentences referring to a bank are kept

>Sentences with multiple bank references are discarded

>Each kept sentence is transformed into a 600-dimensional sentence vector holding its semantic content

>Timestamp of each sentence is kept so to enable aggregation of distress prediction on a monthly basis.

News data preprocessing


The two steps perform first an unsupervised feature learning (semantic vectors) and then a distress signal classification task

Two steps paradigm

Financial numerical data

612 70


We exploit numerical data to add contextual information to the news interpretation and classification task

Logic behind the integration of numerical and textual data

>The same news info could impact differently banks in different countries or with different accounts and banking sector situation

> Numerical data

– Country Macrofinancial

– Banking sector

– Bank accounts


The numerical data include bank accountings, banking sector and macrofinancial quarterly data for 62 banks from 2007 to 2014

Numerical dataset description

Model modification

Bank accountings data Banking sector data Country macrofinancial data

Unique ID

12 variables

62 banks

2007-2014

Quarterly data


To benchmark the model performance we apply the Usefulness measure

Definition of the Usefulness measure

> The Usefulness allows to account for:

> Error preference

> Performance gain vs. baseline (absolute)

> Performance gain vs perfect model (relative)

> Baseloss (Lb) definition

– The best guess according to prior probabilities p(obs) and error preference µ

> Relative usefulness (Ur)

– Relative Usefulness Ur relates the gain to that of a perfect model

> Absolute usefulness (Ua)

– Absolute Usefulness Ua

measures the gain vis-à-vis the baseline case

(Model loss)


The integration of numerical and textual data enhances the relative usefulness of the model

Preliminary results

> The Relative Usefulness obtained with the combination of textual and numerical data is on average 43%

43%

31%

12%


The sensitivity analysis shows that the results are stable for a large number of network configurations (Numeric + Text data case)

Sensitivity analysis

0.00

0.10

0.20

0.30

0.40

5 10 15 20 25 30 35

Rela

tive U

sefu

lness

Hidden nodes number

Numerical data hidden layer sizesensitivity

0.08

0.10

0.12

0.14

35 40 45 50 55 60 65

Rela

tive U

sefu

lness

Hidden nodes number

Textual data hidden layer sizesensitivity

0.05

0.15

0.25

0.35

0.45

35 45 55 65 75 85 95 105

Rela

tive U

sefu

lness

Hidden nodes number

Numerical and Textual data hiddenlayer size sensitivity


Conclusions and next steps

> Results show that adding textual data to numerical ones brings increased predictive power in the distress prediction task. The model shows an increase relative usefulness.

> The model is easily scalable: the more news data are added, the better is the performance.

> News are still employed in a non efficient way: splitting a single news into several sentences causes a pauperization of the information base.

> Improving the aggregation of news data at bank and month level (e.g. feed the entire monthly content of news to the model to let it decide which are the most important)

Pros

Next steps

Cons


Information class classifier

for your attention

2017-24-03 Deutsche Bundesbank big_data_workshop

Economy & Finance

Transcript of 2017-24-03 Deutsche Bundesbank big_data_workshop