Data acquisition and FIRST datasets
description
Transcript of Data acquisition and FIRST datasets
Miha Grčar,Jožef Stefan Institute
Data acquisition and FIRST datasets
FIRST Y3 Review Meeting
Activity in Y3
Ontology evolution
Data acquisition software (DacqPipe)
FIRST dataset of news & blogs
Luxembourg, Nov 2013FIRST Y3 Review Meeting 2
Ontology evolution
Luxembourg, Nov 2013FIRST Y3 Review Meeting 3
Semantic & lexical resources,
IDMS API
Topic detection & tracking Active learning
FIRST ontology
Indices, stocks,companies, geo-entities,
actors…Sentiment vocabularyTopic taxonomies
(Nearly) Static partDynamic part
Ontology evolution
Semantic & lexical resources,
IDMS API
Topic detection & tracking Active learning*
Models for canyon flow visualization
(Nearly) Static partDynamic part
FIRST ontologyModels for sentiment
classification*
“Knowledge base”*Smailović, Grčar, Lavrač, Žnidaršič: Stream-based active learning for sentiment analysis in the financial domain (to appear)
Data acquisition pipeline (DacqPipe)
Resembles big data streaming architectures such as Twitter Storm Running continuously since April 2011 Several scientific contributions
Boilerplate remover & gold standard dataset Ontology & ontology-based information extractor
Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip Source code: https://github.com/project-first/dacqpipe
Luxembourg, Nov 2013FIRST Y3 Review Meeting 5
0MQchannel
Emit
OBIE
OBIE
HTML tokenizer
B'plate remover &
duplicate detector
Language detector Filter NLP pipe DB writer
HTML tokenizer
RSS reader
RSS reader
B'plate remover &
duplicate detector
Language detector Filter NLP pipe DB writer
Read & parse CleanSyntacticanalysis Store
DB
Semanticpreprocessing
Data acquisition pipeline (DacqPipe)
Resembles big data streaming architectures such as Twitter Storm Running continuously since April 2011 Several scientific contributions
Boilerplate remover & gold standard dataset Ontology & ontology-based information extractor
Executable available at http://first.ijs.si/software/DacqPipeJun2013.zip Source code: https://github.com/project-first/dacqpipe
Luxembourg, Nov 2013FIRST Y3 Review Meeting 6
0MQchannel
Emit
OBIE
OBIE
HTML tokenizer
Language detector Filter NLP pipe DB writer
HTML tokenizer
RSS reader
RSS reader
Language detector Filter NLP pipe DB writer
Read & parse CleanSyntacticanalysis Store
DB
Semanticpreprocessing
Boilerplate removal
Luxembourg, Nov 2013FIRST Y3 Review Meeting 7
Streaming setting
Luxembourg, Nov 2013FIRST Y3 Review Meeting 8
Hypothesis
Web pages at similar Web addresses share common boilerplate, while main content is unique
Luxembourg, Nov 2013FIRST Y3 Review Meeting 9
URL Tree
Luxembourg, Nov 2013FIRST Y3 Review Meeting 10
Stream
How many times did I see “About us” in this part of
the tree?
http://www.bbc.co.uk/sports/story2371.html
“About us”
Evaluation
Dataset569,583 time-stamped documents (stream)292,053 documents after URL normalizationOct 24 – Dec 19, 2011; 31 Web sitesPart of the FIRST dataset of news & blogs
Gold standard56,436 documents annotated with manually
designed regex tailored for specific Web sites
Luxembourg, Nov 2013FIRST Y3 Review Meeting 11
Evaluation
Luxembourg, Nov 2013FIRST Y3 Review Meeting 12
Reset
Gold standard datasethttp://first.ijs.si/urltreedataset
Luxembourg, Nov 2013FIRST Y3 Review Meeting 13
Conclusion: Final results of WP3
Data acquisition pipeline software (DacqPipe)Since April 2011https://github.com/project-first/dacqpipe
FIRST dataset of news & blogs219 Web sites; ~15 million unique documentshttp://first.ijs.si/FIRSTDataset
FIRST ontologySemantic + lexical partInformation extraction + sentiment analysishttp://first.ijs.si/FIRSTOntology/y3
Luxembourg, Nov 2013FIRST Y3 Review Meeting 14
Achim Klein (UHOH), 20 November, Luxembourg
Technical Presentations and Demos- Sentiment Analysis -
Knowledge-basedSentiment Extraction
a) Direct sentiment Example: „I expect the S&P 500 to rise“
positive sentiment
Addressed by rules
b) Indirect sentiment, using indicators Example: „I think U.S. interest rates will rise“
negative sentiment
Addressed by ontology
UC Retail Brokerage/Market Surveillance:Economic Indicators
Debt to EquityDividend YieldEarnings to Price RatioNew ProductsProfit MarginSales…
Interest RateInflationM2 Change RateDurable Goods OrdersUnemploymentPrivate HousingNew Building Permits…
Advance/Decline RatioBear FlagBreak OutDouble BottomRSISupportResistance…
Example Insights:Unemployment Indicator
1/1/2013 4/1/2013 7/1/2013 10/1/20130
20
40
60
80
100
120
140
160
180
Unemployment Indicator Volume
Official US unemployment statistics release dates.
Record Greek unemployment numbers released.
UC Reputational Risk:Reputation Indicators (Y3)
Reputation Indicator
Social Responsibility
Positive Correlation
Negative Correlation
Human Resources
Positive Correlation
Negative Correlation
Business Behavior
Positive Correlation
Negative Correlation
Corporate Governance
Positive Correlation
Negative Correlation
Exposure on Critical Markets
Positive Correlation
Negative Correlation
CharityDonationEducation
CrimeBullyingSlave
ProfessionalTalentManpower
Lay offJob cutsWrongdoers
TransparentResponsibleCampaign
DebtForeclosurePrice-fixing
AccountableTier 1 ratioAML
BreachShady fundsLaw suit
SubsidyLiquidityCustomers
Subprime MortgageCDS spread
Positive and negative sample indicators per reputation topic
Total number of indicators: 1451
Reputation Sentiment Classification Performance
Precision Recall0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
67.7%
23.7%
71.2%
44.9%
Without Indicators With Indicators
Higher recall of (indirect) sentiments by means of indicators
67.7%71.2%
23.7%
44.9%
13.09
.2013
15.09
.2013
17.09
.2013
19.09
.2013
21.09
.2013
23.09
.2013
25.09
.2013
27.09
.2013
29.09
.2013
01.10
.2013
03.10
.2013
05.10
.2013
07.10
.2013
09.10
.2013
11.10
.2013
13.10
.2013
15.10
.2013
17.10
.2013
19.10
.2013
21.10
.2013
23.10
.2013
25.10
.2013
27.10
.2013
29.10
.2013
31.10
.2013
02.11
.2013
04.11
.2013
0
50
100
150
200
250
300
Reputational Insights: JPMorgan
Corporate Governance 11.10.2013“JPMorgan’s Dimon Posts First Loss on $7.2 Billion Legal Cost”[BLOOMBERG]
Volu
me
of C
orpo
rate
Gov
erna
nce
Corporate Governance 19.09.2013“Scandals cost JPMorgan $1 billion in fines”[REUTERS]
Fuzzy Sentiment Classification
4. Two separate document-level machine-learning fuzzy classifiers with 5 degrees of …(1) positive, (2) negative
2. Classify sentiment per object in each sentence
3. Generate machine-learning input:Sentiments and words of all sentences that refer to the same object
1. Extract sentiment objects„Apple‘s earnings are rising“
„Sales might decrease because of the financial crisis“
Enhanced Gold Standard Corpus (Y3):Retail Brokerage/Market Surveillance
Y2 Y30
200
400
600
800
1000
1200
409
1021
Corpus size
Precision Recall0%
10%
20%
30%
40%
50%
60%
70%
80%
62.6%54.2%
69.0%62.4%
Y2 Y3
Improved hybrid sentiment classifier performance
62.6%69.0%
54.2%62.4%
Main Results
Deep knowledge-based sentiment analysisSpecific to a feature of an object using rules
(e.g., reputation of a company)Economic and reputation indicators improve classifier
performance and provide valuable insights for usersGlass-box approach with drill-down capabilitiesBest paper award at IEEE
Fuzzy classifier with 5 degrees of positivity and negativity for better decision making
Fuzzy-level Gold Standard CorpusAnalyzed >3 million documentsOpen source available
git://github.com/project-first/semanticinformationextraction.git
Thank you
Luxembourg, November 20th, 2013
WP6 Technical Presentation & DemosMarko Bohanec, Miha Grčar,
Jan Muntermann, Michael Siering
WP6Status End of Y2
26 2719 20 21 22 23 24 25 28 29 30 31 32 33 34 35 36
• Mainly presenting basic stand-alone prototypes • Presentation of the first models• First visualisation components
FIRST Y3 Review Meeting
WP6Achievements Y3
26 2719 20 21 22 23 24 25 28 29 30 31 32 33 34 35 36
T 6.2 /T 6.3
Machine Learning & Qualitative
Models
T 6.4 Visualisation Components
• Refinements of qualitative models based on domain experts’ feedback• Highly scalable implementations• FIRST pipeline integration• Delivery of D6.3 in M33
• Development of additional and revised visualisation components based on domain experts’ feedback
• Highly scalable implementations• Delivery of D6.4 in M34
FIRST Y3 Review Meeting
Agenda
Qualitative and quantitative modelsReputational Risk Management Market Surveillance
VisualizationsRetail Brokerage
29FIRST Y3 Review Meeting
Reputational Risk Problem Formulation (1/2)
General Area:Production and distribution of investment products and services by banks and other financial institutions.
Specific Use Case:Assessment of reputational risk (RI) based on assessments of MPS counterparties.
Reputational Risk:Risk arising from negative perception on the part of customers, counterparties, shareholders, investors, debt-holders, market analysts, other relevant parties or regulators that can adversely affect a bank’s ability to maintain existing, or establish new, business relationships and continued access to sources of funding.
FIRST Y3 Review Meeting
Reputational Risk Problem Formulation (2/2)
Goal: to develop• a multi-criteria model for the assessment of MPS reputational risk
(RIM)• that serves as the main component of corresponding DSS
Approach: expert modeling, qualitative multi-attribute modeling (method DEX)
TIME SERIES
Financial, trading data
Sentiment data
Evaluation
novelties
FIRST Y3 Review Meeting
RIM: Main Components
PRODUCT
CUSTOMER B
asic
dat
a pr
oces
sing
Qualitative evaluation (DEXi model)
qRI1
Aggregation
relative CUSTOMER numbers
relative PRODUCT volumes
→ Customer → Product → Counterpart → Bank
RI
COUNTERPART
bank data
FIRST Y3 Review Meeting
RIM: Basic Data Processing
PRODUCT VOLUMES
MISMATCHING
SENTIMENT
PERFORMANCE
COUN
TERP
ART
CUST
OM
ER
PRO
DUCT
SLP
SSP S
PP
BP
SRI
w
– f
ΔB
qS
qP
RP – qM
BAN
K da
ta
ΔM
P
CUSTOMER VOLUMES
V1
qRV1C VC
RV1C ÷
VP
TA
NP
TN
÷
÷
RVP
RNP
FIRST Y3 Review Meeting
RIM: Qualitative Evaluation
Aim: qualitative assessment of Reputational Index for one customer and product
Model: qualitative hierarchical rule-based DEX model
qS
qP
qM
qRV1C
qPM qRI1
DEXi RIM_D63.dxi 10.7.13 Page 2 61 very-neg medium:high medium:high high62 * very-high >=high very-high63 >=low-neg very-high >=medium very-high64 >=high-neg >=high very-high very-high65 very-neg >=medium very-high very-high66 very-neg very-high >=medium-low very-high qP qM qPM 1 <=low in-line in-line2 in-line low:medium low3 <=medium low low4 medium <=low low5 >=medium in-line low6 in-line high medium7 low:medium medium medium8 >=high low medium9 <=low very-high high
10 low >=high high11 low:high high high12 high medium:high high13 >=high medium high14 >=medium very-high very-high15 very-high >=high very-high
FIRST Y3 Review Meeting
RIM: Aggregation
Aim: gradually aggregate qRI1 into the overall Reputation Risk Index (RI):• hierarchical aggregation: Customer → Product → Counterpart → Bank• taking into account relative product volumes and relative customer
numbers
C/P → PRODUCT PRODUCT → COUNTERPART COUNTERPART → BANK
qRI1FIRST Y3 Review Meeting
RIM Reports: Topmost Level (Bank)
FIRST Y3 Review Meeting
RIM Summary
Developed and implemented a decision support model component for the assessment of bank reputational risk Approach: expert modeling using a variety of modeling methods (qualitative, quantitative, hierarchical, relational)Novel aspects: taking into account sentiment assessments of counterparts advancing the present RI assessment modelBenefits for the users: obtaining a comprehensive RI as time series for different groups
(customers, products, counterparts, bank) ability to analyse and explain assessments at different levels by
drilling down through the RIM hierarchy
FIRST Y3 Review Meeting
Agenda
Qualitative and quantitative modelsReputational Risk Management Market Surveillance
VisualizationsRetail Brokerage
FIRST Y3 Review Meeting
Problem Formulation:Market Surveillance
Pump & Dump market manipulation: Manipulation of the share price by the dissemination of false positive information in order to take profit from an increased price level.
FIRST Y3 Review Meeting
Pump & Dump Example (1/2)
„Shares can multiply dramatically in value
over short time periods.“
„Thursday's pick is a story straight out of
Hollywood!“
„SAPX - Wake Up, Put It On Your Screen NOW“
„Could this company be the next
blockbuster?“
Source: http://newsletter.hotstocked.com/newsletters/view/Could_this_company_be_the_next_blockbuster_-92301
FIRST Y3 Review Meeting
Pump & Dump Example (2/2)
Shares Purchased
Seven Arts Entertainment, Inc. (SAPX)
Shares Sold
Pump & Dump campaign July, 24th – 28th 2011> 30 different recommendations
Source: Yahoo! Finance
FIRST Y3 Review Meeting
How to address Pump & Dump Manipulations?
Qualitative Modeling
Quantitative Modeling
Based on expert knowledge Qualitative attributes Decision problem divided
into sub problems Goal: daily assessments
Based on machine learning algorithms
Quantitative attributes Goal: assessment of single
documents
Country Black List
IndustryBlack List
Company Black List
Age
Bankrupt
Trading Volume
Number of Trades
Market Capitalization
Market Segments
Sentiment
Content
Black List
History
Market
Trading
News
Company
Financial Instruments
Comp_FinInst Pump & Dump
FIRST Y3 Review Meeting
Qualitative Multi-Attribute Model Development
Country Black List
Industry Black List
Company Black List
Age
Bankrupt
Trading Volume
Number of Trades
Market Capitalization
Market Segment
Sentiment
Content
Black List
History
Market
Trading
News
Company
Financial Instrument
Comp_FinInst Pump & Dump
FIRST Y3 Review Meeting
FIRST Y3 Review Meeting
From initial DEXi Model (M24) to Processing of Data Stream (M33)
Initial development of the model structure was distributed as DEXi-files.
Models can be applied within DEXi-environment only (M24). To address of the models capability to process large-scale data
streams, a JAVA-based prototype was implemented (M33).
Definition of Data Sources
Regulatory Authorities web
pages
FIRST Y3 Review Meeting
Model Configuration and Evaluation (M24)
Number of P&D values
Percentage
v-high 482 0.66high 57588 78.75med 12342 16.88low 2498 3.42v-low 215 0.29
Sum 73125 100
Model Configuration• V-high: 3 configurations• High: 9• Medium: 7• Low: 5• V-low: 1
Evaluation based on predefined configuration:
Evaluation• 1700 OTC-traded companies• Dataset: 01.2012 to 06.2013
(370 trading days)• on average 157 alerts per day
for v-high and high
FIRST Y3 Review Meeting
Reconfiguration of the Rules in Y3
FIRST Y3 Review Meeting
FIRST Y3 Review Meeting
Model Configuration and Evaluation (M33)
Number of P&D values
Percentage
v-high 982 0.8high 22215 18.8med 92049 78.0low 2779 2.4v-low 57 0.0
Sum 118082 100
Configuration:• V-high: 3 configurations• High: 7• Medium: 8• Low: 6• V-low: 1
Evaluation resultsbased on reconfigured model:
Evaluation:• 1700 OTC-traded companies• Dataset: 01.2012 to 09.2013
(435 trading days)• on average 53 alerts per day
for v-high and high
Research Impact
Alic, I.; Siering, M.; Bohanec, M. (2013)
Hot Stock or Not? A Qualitative Multi-Attribute Model to Detect Financial Market Manipulation
Proceedings of the 26th Bled eConference; Bled, Slovenia
FIRST Y3 Review Meeting
How to address Pump & Dump Manipulations?
Qualitative Modeling
Quantitative Modeling
Based on expert knowledge Qualitative attributes Decision problem divided
into sub problems Goal: daily assessments
Based on machine learning algorithms
Quantitative attributes Goal: assessment of
documents
Country Black List
IndustryBlack List
Company Black List
Age
Bankrupt
Trading Volume
Number of Trades
Market Capitalization
Market Segments
Sentiment
Content
Black List
History
Market
Trading
News
Company
Financial Instruments
Comp_FinInst Pump & Dump
FIRST Y3 Review Meeting
Research Objective: Development of a Pump & Dump Classifier
Learning phase:
Application phase:
New documents PredictionsClassifier
?not
suspicious
suspiciou
s?
Labeled documents
Trainingalgorithm Classifier
Support Vector Machine++ - -Evaluation of
Training Documents
Evaluation of Machine Learning
Algorithms
Integration in FIRST Pipeline
FIRST Y3 Review Meeting
Evaluation of Training Documents
Event study: capital market reaction during / after pump and dump campaignsignificant abnormal returns during campaignprice decrease after campaign has ended
Siering, M. (2013) All Pump, No Dump? The Impact of Internet Deception on Stock Markets In: Proceedings of the 21st European Conference on Information Systems; Utrecht, Netherlands
FIRST Y3 Review Meeting
Evaluation of different machine learning algorithms
Neural Network: reduced feature set SVM: parameter optimisation according to
Hsu et al. (2003)
Evaluation of Machine Learning Algorithms
Class suspicious Class non-suspiciousAccuracy
Precision
Recall F1 Precision
Recall F1
Decision Tree 95.10 94.65 95.60 95.12 95.56 94.60 95.08
Naïve Bayes 97.30 96.28 98.40 97.33 98.36 96.20 97.27
k-NN, k =1 78.10 80.09 74.80 77.35 76.36 81.40 78.80k-NN, k=2 73.60 68.67 86.80 76.68 82.07 60.40 69.59k-NN, k=3 75.30 77.32 71.60 74.35 73.56 79.00 76.18k-NN, k=4 74.20 71.68 80.00 75.61 77.38 68.40 72.61k-NN, k=5 75.10 77.34 71.00 74.03 73.20 79.20 76.08Neural Network
97.00 96.81 97.20 97.00 97.19 96.80 96.99
SVM 99.30 99.20 99.40 99.30 99.40 99.20 99.30
Hsu, C. W., Chang, C. C., & Lin, C. J. (2003). A practical guide to support vector classification. National Taiwan University, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf (accessed on 10/16/2011) .
Decision Tree
Naive Bayes
k-NN Neural Network
Support Vector
Machine
050
100150200250300350400450500
Computing requirements of different learners(10-Fold Cross Validation, in sec.)
FIRST Y3 Review Meeting
Integration in FIRST Pipeline
FIRST Y3 Review Meeting
Boiler-plate
removal
Datasources
HTML preproc.
Languagedetection
Near-duplicateremoval
Database
HTM
L pa
ges
Internal Data Sources
Cle
an te
xt
External Data (Web) Sources
Quant.Pump and
Dump Model
Integration of Quantitative and Qualitative Models: Multi Classifier Approach
Country Black List
Industry Black List
Company Black List
Age
Bankrupt
Trading Volume
Number of Trades
Market Capitalization
Market Segment
Sentiment
Content
Black List
History
Market
Trading
News
Company
Financial Instrument
Comp_FinInst Pump & Dump
FIRST Y3 Review Meeting
Market Surveillance Summary
Developed and implemented multi-classifier decision support component for the assessment of information-based market manipulationApproach: expert modeling using a variety of modeling methods (qualitative, quantitative, hierarchical, relational) and machine learningNovel aspects: taking into account sentiment assessments of published documents multi-classifier component integrates qualitative and quantitative
modelBenefits for the users: obtaining the ability to monitor information-based market
manipulation in market segments with a large number of financial instruments.
FIRST Y3 Review Meeting
Agenda
Qualitative and quantitative modelsReputational Risk Management Market Surveillance
VisualizationsRetail Brokerage
FIRST Y3 Review Meeting
Visualisation Components
Video
Online Demo
FIRST Y3 Review Meeting
Retail Brokerage Summary
Developed and implemented visualisation components providing the basis for data/document-driven DSS in the Retail Brokerage use case scenario.
Approach: Clustering of document topics, aggregation of document sentiments and publication statistics.Novel aspects: Visualisation components that condense social media activity Aggregation of media topics to explore social media contentsBenefits for the users: Explore activity, sentiment and topics of social media in a user-
friendly way
FIRST Y3 Review Meeting
Thank you for your attention!
Questions?
FIRST Y3 Review Meeting