Case Studies in Creating Quant Models from Large Scale Unstructured Text by Sameena Shah
-
Upload
quantopian -
Category
Data & Analytics
-
view
346 -
download
2
Transcript of Case Studies in Creating Quant Models from Large Scale Unstructured Text by Sameena Shah
Case Studies in Creating Quant Models from Large Scale Unstructured Text Dr. SAMEENA SHAH ([email protected])
DISRUPTIONS • LARGE SCALE DATA ANALYSIS:
– Hadoop, Spark
• NATURAL LANGUAGE PROCESSING: – Sentiment, context, text mining
• NOVEL/EFFICIENT ALGORITHMS: – Deep Learning, Topic Modeling
• NOVEL DATA SETS: – Twitter, satellite images
• Accessibility
Some large scale textual datasets • Social Media
• SEC filings
• News
• Courtwires
• Patents
ANALYZING UNSTRUCTURED TEXT IN SEC FILINGS
• All public companies, domestic and foreign, trading on any of the US exchanges, are required to file registration statements, periodic reports, insider trading forms and other forms describing any significant changes to the SEC.
• Typically contain financial statements as well as large amounts of `unstructured text' describing the past, present and anticipated future performance of the company.
For example, if a company changed its accounting methods to inflate its earnings, or changed its fiscal year end to include some extra sales, or shifted some expenses to a later period or included revenues which are not yet payable, or expensed or capitalized certain items.
Can we • Create an automated system that identifies
“abnormal” sentences in filings, hence alerting regulators/investors faster
• This usually requires a deep amount of domain expertise even for humans to recognize such sentences.
• Value is clear … but
• > 3TB in compressed format
• Running this on a small subset of data on a dual core machine gave us an estimate of few months
Text modeling on hadoop • Reading compressed files through a custom
inputreader
• Parsing of sections
• Division into sentences and comparison across different reference groups
• Scoring each sentence wrt reference group model
• Divergence of scores from distribution of
reference group
• All this under 30 minutes for 8 years of filings
TEXT PROCESSING • Use of text processing techniques to check for
– Clarity in overall disclosure compared to peers – Redundancy in language – Comparison of language model across sector and market
cap peers – Comparison of model with its own (the company in
context) historical model – If overly vague or ‘boilerplate’ disclosures in recognition of
revenue
SIGNALS FROM SOCIAL MEDIA
Winning Traders • Questions:
– Can we find good traders and follow them to make money?
• Method: – Identify trading-related tweets ( buy/sell a specific stock) – Evaluate traders based on past performance – Follow their trades
16
Why People Express their Trading Positions? • Everyone has an opinion !
• Positive Motivations – Enhance reputation/brand – Build network by attracting other experts – Benefit personal trading positions
• Negative Motivations – Hired to promote a position – Nothing else to do ….
17
18
The Winners strategy gains 9.48% while S&P 500 lost 3.55%
13.03% difference
Cost does cost you! ( 0.2% per transaction)
Conclusions • While Twitter signal to noise is very low, targeted data
collection and mining can be more promising
• In event-based sentiment analysis, we assumed stock market related tweets posted after a bad (good) news has a negative (positive) polarity. The data can be used to training a supervised model.
• User-based analysis (following traders with good record of trading based on their tweets) also showed adapting traders move in the market could be a winning strategy.
• M. Makrehchi, S. Shah, W. Liao. Stock prediction using Event information from Twitter. In Web Intelligence, 2013.
19
Q & A
21
The Winners strategy gains 19.76% while S&P 500 lost 3.55%