Case Studies in Creating Quant Models from Large Scale Unstructured Text by Sameena Shah

Case Studies in Creating Quant Models from Large Scale Unstructured Text Dr. SAMEENA SHAH ([email protected])

DISRUPTIONS •  LARGE SCALE DATA ANALYSIS:

–  Hadoop, Spark

•  NATURAL LANGUAGE PROCESSING: –  Sentiment, context, text mining

•  NOVEL/EFFICIENT ALGORITHMS: –  Deep Learning, Topic Modeling

•  NOVEL DATA SETS: –  Twitter, satellite images

•  Accessibility

Some large scale textual datasets •  Social Media

•  SEC filings

•  News

•  Courtwires

•  Patents

ANALYZING UNSTRUCTURED TEXT IN SEC FILINGS

•  All public companies, domestic and foreign, trading on any of the US exchanges, are required to file registration statements, periodic reports, insider trading forms and other forms describing any significant changes to the SEC.

•  Typically contain financial statements as well as large amounts of `unstructured text' describing the past, present and anticipated future performance of the company.

For example, if a company changed its accounting methods to inflate its earnings, or changed its fiscal year end to include some extra sales, or shifted some expenses to a later period or included revenues which are not yet payable, or expensed or capitalized certain items.

Can we •  Create an automated system that identifies

“abnormal” sentences in filings, hence alerting regulators/investors faster

•  This usually requires a deep amount of domain expertise even for humans to recognize such sentences.

•  Value is clear … but

•  > 3TB in compressed format

•  Running this on a small subset of data on a dual core machine gave us an estimate of few months

Text modeling on hadoop •  Reading compressed files through a custom

inputreader

•  Parsing of sections

•  Division into sentences and comparison across different reference groups

•  Scoring each sentence wrt reference group model

•  Divergence of scores from distribution of

reference group

•  All this under 30 minutes for 8 years of filings

TEXT PROCESSING •  Use of text processing techniques to check for

–  Clarity in overall disclosure compared to peers –  Redundancy in language –  Comparison of language model across sector and market

cap peers –  Comparison of model with its own (the company in

context) historical model –  If overly vague or ‘boilerplate’ disclosures in recognition of

revenue

SIGNALS FROM SOCIAL MEDIA

Winning Traders •  Questions:

–  Can we find good traders and follow them to make money?

•  Method: –  Identify trading-related tweets ( buy/sell a specific stock) –  Evaluate traders based on past performance –  Follow their trades

16

Why People Express their Trading Positions? •  Everyone has an opinion !

•  Positive Motivations –  Enhance reputation/brand –  Build network by attracting other experts –  Benefit personal trading positions

•  Negative Motivations –  Hired to promote a position –  Nothing else to do ….

17

18

The Winners strategy gains 9.48% while S&P 500 lost 3.55%

13.03% difference

Cost does cost you! ( 0.2% per transaction)

Conclusions •  While Twitter signal to noise is very low, targeted data

collection and mining can be more promising

•  In event-based sentiment analysis, we assumed stock market related tweets posted after a bad (good) news has a negative (positive) polarity. The data can be used to training a supervised model.

•  User-based analysis (following traders with good record of trading based on their tweets) also showed adapting traders move in the market could be a winning strategy.

•  M. Makrehchi, S. Shah, W. Liao. Stock prediction using Event information from Twitter. In Web Intelligence, 2013.

19

21

The Winners strategy gains 19.76% while S&P 500 lost 3.55%

Case Studies in Creating Quant Models from Large Scale Unstructured Text by Sameena Shah

Data & Analytics

Transcript of Case Studies in Creating Quant Models from Large Scale Unstructured Text by Sameena Shah