2017-24-03 Deutsche Bundesbank big_data_workshop
-
Upload
paolo-giudici -
Category
Economy & Finance
-
view
38 -
download
0
Transcript of 2017-24-03 Deutsche Bundesbank big_data_workshop
Giancarlo NicolaUniversità di Pavia
Big Data Workshop, Bundesbank, Frankfurt, 24th March 2017
Textual data
analysis in Finance
220150831_Pavia.pptx
This research aims at integrating textual and numerical data to enhance bank distress forecasting
Description of the research project
Description
Aim
Tools> Modeling tools (Python & Machine Learning)
– Document embeddings for semantic space representation
– Neural network for supervised classification task
> Enhanced predictive model for bank stress events
– Bank stress predictions by integrating textual and numerical data in a single framework of analysis
Data
> Textual data
– 6M Reuters news articles
> Numerical data (12 variables, quarterly)
– European Bank accountings, Country level banking sector data, Country level macrofinancial data
320150831_Pavia.pptx
We applied first an exploratory analysis on the textual dataset to understand the topics of discussions
Topics of the textual news dataset
Topic 1: Investing and market expectations
Highest prob Frex
> softwar
> network
> communic
> forwardlook
> integr
> distribut
> organ
> common
> prospectus
> program
> design
> rtgs
> mgrs
> matdebt
> experi
> nyse
> enterpris
> site
> approxim
> softwar
> network
> communic
> forwardlook
> prospectus
> rtgs
> mgrs
> matdebt
> nyse
> healthcar
> innov
> award
> locat
> user
> digit
> highgrad
> baabbbbbb
> capabl
> advertis
> educ
Topic 2: Market performance
Highest prob Frex
> brussel
> thirdquart
> instrument
> firsthalf
> notif
> auction
> belgian
> diari
> firstquart
> notifi
> proxi
> luxembourg
> certif
> secondquart
> dexia
> holder
> aegon
> offeroroffere
> trigger
> nomine
> brussel
> thirdquart
> firsthalf
> notif
> belgian
> diari
> notifi
> proxi
> luxembourg
> dexia
> offeroroffere
> nomine
> doubleclick
> threshold
> ceas
> breakdown
> inbev
> tiger
> omega
> arcelormitt
Topic 3: Countries and sovereign
Highest prob Frex
> greec
> ireland
> hong
> kong
> india
> australia
> conf
> singapor
> south
> purchasesel
> billiton
> spain
> chines
> greek
> russia
> brazil
> telecom
> mexico
> spanish
> dubai
> hong
> kong
> conf
> purchasesel
> greek
> russia
> brazil
> dubai
> paidreceiv
> indian
> kosovo
> russian
> purchasesal
> athen
> nameeg
> peso
> emiss
> taiwan
> serbia
> natixi
Topic 4: Bank crisis
Highest prob Frex
> henderson
> court
> unaudit
> fitch
> switzerland
> committe
> jpmorgan
> restructur
> rescu
> familiar
> scheme
> investig
> socgen
> settlement
> formula
> hire
> fair
> lehman
> pension
> legal
> henderson
> court
> unaudit
> socgen
> formula
> bonus
> buyout
> pantlin
> tier
> lawyer
> kerviel
> probe
> consortium
> lawsuit
> alleg
> italian
> senat
> fraud
> taxpay
> virgin
Topic 5: Materials, currencies, indexes
Highest prob Frex
> cell
> gold
> prefer
> miner
> sterl
> tokyo
> climb
> nikkei
> peak
> downgrad
> versus
> auction
> shed
> sharpli
> jone
> sharp
> survey
> weaker
> copper
> tumbl
> cell
> gold
> sterl
> climb
> nikkei
> versus
> shed
> weaker
> copper
> steadi
> rebound
> ounc
> bernank
> overnight
> drag
> payrol
> straight
> usytrr
> gainer
> threemonth
420150831_Pavia.pptx
The most prevalent topics regard safe assets, materials and bank crisis
Model result: Topics prevalence [2007-2010]
Investing and market expectations
Countries and sovereign
Market performance
Materials, currencies, indexes
Bank crisis
Topic title
520150831_Pavia.pptx
From the wordclouds is possible to further grasp insights about discussion topics Topic 1: Investing and mkt. expect. Topic 3: Countries and sovereign Topic 2: Market performance
Topic 5: Materials, currencies, indexesTopic 4: Bank crisis
620150831_Pavia.pptx
When comparing Italian and German banks related news, Italian news are more focused on bank crisis
Topics prevalence difference between Italy and Germany
Market performance Bank crisis
720150831_Pavia.pptx
MPS tweet topics evolution and prevalence over six months
Topics prevalence difference between Italy and Germany [2007-2010]
Deep learning bank distress in news and financial data
Big Data Workshop, Bundesbank, Frankfurt, 24th March 2017
Paola Cerchiello1, Giancarlo Nicola1, Samuel Rönnqvist2, Peter Sarlin3
1University of Pavia, Italy2Hanken School of Economics, Finland3Åbo Akademi University, Finland
920150831_Pavia.pptx
The model derives distress signals from textual and numerical data using a semantic deep-learning approach divided in two steps
Description of the two steps approach
and financial data
Learning semantic vectors through nearby words prediction
Step 1:
Step 2: Learning to predict distress, based on semantic vectors of articles and financial data
Distress: event of capital injection, request for central bank intervention or default
Word embedding model
1020150831_Pavia.pptx
Word embeddings map words to a reduced semantic space where particular properties can be extracted by vector operations
Properties of word embeddings
> 200 - 800 dimensions depending on the task and the available corpus
> Words that are near in this space have similar semantic value
> Wiser space to operate classification task because of the reduced dimensionality and preserves the semantic value of words
1120150831_Pavia.pptx
The model takes in input textual news data and numerical financial data to predict the distress condition of a bank
Scheme of textual and numerical data integration model
Financial numerical data
Reuters articles
(70 nodes)(612 nodes) (2 nodes)
(Doc2Vec – Gensim implementation)
Document embeddings (unsupervised learning)
Neural network classification (supervised learning)
1220150831_Pavia.pptx
The news data are preprocessed in order to extract sentences related to banks and assign them to the corresponding bank
>The news are linked to each bank on the basis of references to that bank in the article
>Each article is split into sentences and only the sentences referring to a bank are kept
>Sentences with multiple bank references are discarded
>Each kept sentence is transformed into a 600-dimensional sentence vector holding its semantic content
>Timestamp of each sentence is kept so to enable aggregation of distress prediction on a monthly basis.
News data preprocessing
1320150831_Pavia.pptx
The two steps perform first an unsupervised feature learning (semantic vectors) and then a distress signal classification task
Two steps paradigm
Financial numerical data
612 70
1420150831_Pavia.pptx
We exploit numerical data to add contextual information to the news interpretation and classification task
Logic behind the integration of numerical and textual data
>The same news info could impact differently banks in different countries or with different accounts and banking sector situation
> Numerical data
– Country Macrofinancial
– Banking sector
– Bank accounts
1520150831_Pavia.pptx
The numerical data include bank accountings, banking sector and macrofinancial quarterly data for 62 banks from 2007 to 2014
Numerical dataset description
Model modification
Bank accountings data Banking sector data Country macrofinancial data
Unique ID
12 variables
62 banks
2007-2014
Quarterly data
1620150831_Pavia.pptx
To benchmark the model performance we apply the Usefulness measure
Definition of the Usefulness measure
> The Usefulness allows to account for:
> Error preference
> Performance gain vs. baseline (absolute)
> Performance gain vs perfect model (relative)
> Baseloss (Lb) definition
– The best guess according to prior probabilities p(obs) and error preference µ
> Relative usefulness (Ur)
– Relative Usefulness Ur relates the gain to that of a perfect model
> Absolute usefulness (Ua)
– Absolute Usefulness Ua
measures the gain vis-à-vis the baseline case
(Model loss)
1720150831_Pavia.pptx
The integration of numerical and textual data enhances the relative usefulness of the model
Preliminary results
> The Relative Usefulness obtained with the combination of textual and numerical data is on average 43%
43%
31%
12%
1820150831_Pavia.pptx
The sensitivity analysis shows that the results are stable for a large number of network configurations (Numeric + Text data case)
Sensitivity analysis
0.00
0.10
0.20
0.30
0.40
5 10 15 20 25 30 35
Rela
tive U
sefu
lness
Hidden nodes number
Numerical data hidden layer sizesensitivity
0.08
0.10
0.12
0.14
35 40 45 50 55 60 65
Rela
tive U
sefu
lness
Hidden nodes number
Textual data hidden layer sizesensitivity
0.05
0.15
0.25
0.35
0.45
35 45 55 65 75 85 95 105
Rela
tive U
sefu
lness
Hidden nodes number
Numerical and Textual data hiddenlayer size sensitivity
1920150831_Pavia.pptx
Conclusions and next steps
> Results show that adding textual data to numerical ones brings increased predictive power in the distress prediction task. The model shows an increase relative usefulness.
> The model is easily scalable: the more news data are added, the better is the performance.
> News are still employed in a non efficient way: splitting a single news into several sentences causes a pauperization of the information base.
> Improving the aggregation of news data at bank and month level (e.g. feed the entire monthly content of news to the model to let it decide which are the most important)
Pros
Next steps
Cons
2020150831_Pavia.pptx
Information class classifier
for your attention