HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public...

31
HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior? Presented by Louis Wong 24 th March 2009

Transcript of HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public...

Page 1: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

HKUST CSE Dept.

Research

Text Mining

COMP630P Selected Paper PresentationCan the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presented by Louis Wong

24th March 2009

Page 2: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presentation Outline

• Background Information• Main Idea of Papers• Overview of Data Set• Overview of Methodology• Result Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work

Page 3: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presentation Outline

• Background Information• Main Idea of Papers• Overview of Data Set• Overview of Methodology• Result Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work

Page 4: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Background Information

Page 5: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presentation Outline

• Background Information• Main Idea of Papers• Overview of Data Set• Overview of Methodology• Result Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work

Page 6: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Main Idea of Paper

• The paper wants to deliver the following messages:

• Traditional statistical Model can process numerical data efficiently but fail to process unstructured data like news

• Traditional processing of news by financial investors is inefficient and inconsistent

• Market reaction to asset specific news should be extensively studied

Page 7: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presentation Outline

• Background Information• Main Idea of Papers• Overview of Data Set• Overview of Methodology• Result Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work

Page 8: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Data Set – Data Source

• Data Source: Bloomberg Professional Service• Studied Target: S&P 500, FTSE 100 & ASX 100 (totally 283stocks)• Data Range: July 2005 ~ November 2006• Number of Articles: 500000• Number of Data Source Provider: Around 200 News source provider

• Features:• Largest scale of data set for financial text mining experiments• My comment: Data is the king of data-mining

Page 9: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Data Set – Data Preprocessing

• Data ( news ) are preprocessed for the following purposes:• Remove number, URL, e-Mail address, meaningless symbol & formatting• Porter Stemmer Algorithm used to remove suffix of words:

• For instance, Finance, Finances ,Financed & Financing refer to the word of Finance. Without removing suffix, in processing, they are regarded as different words.

Page 10: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presentation Outline

• Background Information• Main Idea of Papers• Overview of Data Set• Overview of Methodology• Result Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work

Page 11: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Methodology: GARCH

• When is news affecting stock time series ?

• In this paper, in the minds of authors, GARCH is used to model volatility of stock price series. Once the error is larger than expected, the error caused is attributed to stock specific news

• Log return of stock price series & realized volatility

Delta-T = Time Interval (minute)P = periodN = number of returns

Page 12: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Methodology: GARCH• After computing the volatility (annualized), GARCH model is used forecast

volatility, based upon previous return & volatility.

• Once the forecasting error is larger than mean by N S.D ( mean & SD are computed by pervious 20 trading days ‘ data), the position on time axis is identified as the appearance of abnormal behavior.

• GARCH parameters are optimized by 1 month data

Page 13: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Methodology: Alignment of Documents

• After identifying the position of abnormal forecasting error, all the news that are within last Delta t minutes are called as “interesting documents”. It will be used in the future training set.

• A dictionary is created to include the unique words that occur in documents.

• At the same time, two statistical counting are computed. One is term count (dj), meaning frequency of term in all documents while another one is dfj that records the number of documents containing the term.

Page 14: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Methodology: Term Weighing

• After establishing dictionary, term weighing are implemented to measure which term is important or not.

• Totally 3 popular term weighing approaches are used: • Binary Version of Gain Ratio• ADBM25• TFIDF (Term Frequency Inverse Document Frequency)

Page 15: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Methodology: Binary Version of Gain Ratio

• This method normally selects the term that provides the most information That means, terms help us discriminate the documents more effectively

• E (n , m) is a generic formula of computing entropy value. Given featured term, “warrant”, 20 out of 30 documents contain this word. Substituting n to be 20 and m to be 30

dfj is # of documents containing term jdj is frequency of term j in all documentsN is total # of documentR is number of interesting documentr is number of interesting document containing term j

dfj is # of documents containing term jdj is frequency of term j in all documentsN is total # of documentR is number of interesting documentr is number of interesting document containing term j

Page 16: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Methodology: Binary Version of Gain Ratio

Gain Ratio Formula: It consists of 3 parts:• 1st Part : Entropy value of ratio of interesting documents to uninteresting

documents.• 2nd Part : Entropy value of ratio of interesting document to uninteresting

document, both of which contains the desired term.• 3rd Part : Entropy value of ratio of uninteresting documents to documents ,both

of which contain the desired term.

11

dfj is # of documents containing term jdj is frequency of term j in all documentsN is total # of documentR is number of interesting documentr is number of interesting document containing term j

dfj is # of documents containing term jdj is frequency of term j in all documentsN is total # of documentR is number of interesting documentr is number of interesting document containing term j

Page 17: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Methodology: TFIDF

• A common term weighing formula• First part values the frequency of term occurrence• Second part penalizes non-featured words { a, an, the }• Computationally inexpensive {linear scanning}

dfj is # of documents containing term jdj is frequency of term j in all documentsN is total # of document

dfj is # of documents containing term jdj is frequency of term j in all documentsN is total # of document

Page 18: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Methodology: ADBM 25

• 1st part: Normalized frequency of term, by taking into account the length of document containing term and average document length

• 2nd part: This also factors in r, R,dfj & N. Normally, it penalizes some less• important words but appear in many interesting documents.

K1 & b = constantdj = freq. of doc. containing term jdl = document lengthavdl = average document length R = number of interesting document r = number of interesting document containing term jdfj = # of doc containing term j

K1 & b = constantdj = freq. of doc. containing term jdl = document lengthavdl = average document length R = number of interesting document r = number of interesting document containing term jdfj = # of doc containing term j

Page 19: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Term weighing Ranking• After computing the importance of terms, terms are ranked.• The most important N words in different interesting documents form a binary vector. It will be trained by the following

two models:

• SVM light classifier is used to learn these training samples and make classification on the unseen document in testing phase.

• Decision Tree (C4.5) , an improvement of old fashioned ID3 Decision Tree.

Page 20: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presentation Outline

• Background Information• Main Idea of Papers• Overview of Data Set• Overview of Methodology• Result Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work

Page 21: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Result : Interesting Document Ratio

• Interesting observation: US has the largest document sets as it has largest stock market all over the world. UK is second and AU is the last one.

Page 22: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Result : Parameter Selection

• Shorter window size has a relatively higher classification accuracy• Justify the presence of EMH (Efficient Market Hypothesis and RET (Rational

Exception Theory)• It also increases the abundance of noise

Page 23: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Result : Effect of Term Weighing & Classifier

Trade off of terms considered and accuracyTrade off of terms considered and accuracy

Page 24: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Result : ROC Curve

The proposed method outperformsthan discrimination curve in each country

The proposed method outperformsthan discrimination curve in each country

Page 25: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Result : Effect of Historical Window (Length of Training Set

Parameters are fixed as follows: GARCH (P, Q) = 3,3 and Delta T is 5 minutes & SD = 6Observation: More historical knowledge can improve the classification result

Parameters are fixed as follows: GARCH (P, Q) = 3,3 and Delta T is 5 minutes & SD = 6Observation: More historical knowledge can improve the classification result

Page 26: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Result : Best Classifiers

• The following is a list of best classifiers for different countries ‘ data:

Page 27: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presentation Outline

• Background Information• Main Idea of Papers• Overview of Data Set• Overview of Methodology• Result Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work

Page 28: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Pros & Cons• Pros:• Large scale experiments: 500000 documents are involved• Impact of news come from different countries on classification result are considered• Different Classifiers, Term weighing method are used• Detail studies about parameters and full interpretation

• Cons:• Relatively less newly proposed methods• The paper style is like an industrial paper or empirical studying

Page 29: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Presentation Outline

• Background Information• Main Idea of Papers• Overview of Data Set• Overview of Methodology• Result Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work

Page 30: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Future Work Suggested by Author• First Area:

Determine whether the news can affect the co-movement behavior of assets

• Second Area:Stock trends can be grouped according to sectors of stock and observe whether the Can some of the abnormal behavior of macro-economic news affect a few stock at the same time ?? Oil price related news & China Petrol and Oil price related news with Sinopec

• Third Area:In this paper, all news are published in English. Will the results be influenced by the language ? German with DAX market and French with CNC Market

Page 31: HKUST CSE Dept. Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?

Q & A Session

Thank for your listening