Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf ·...
Transcript of Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf ·...
![Page 1: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/1.jpg)
Data Science for Computational Journalism
Chengkai Li Associate Professor, Department of Computer Science and Engineering Director, Innovative Database and Information Systems Research (IDIR) Laboratory University of Texas at Arlington PyData Dallas, April 26, 2015
![Page 2: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/2.jpg)
Research at the Innovative Database and Information Systems Research (IDIR) Laboratory
o computational journalism o crowdsourcing and human computation o data exploration by
ranking/skyline/preference queries
o database testing o entity search and entity query o graph database usability
Research areas o Big Data and Data Science (Database, Data Mining, Wed Data Management,
Information Retrieval)
Theme of current research o building large-scale human-assisting and human-assisted data and information systems
with high usability, high efficiency and applications for social good
Research directions
![Page 3: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/3.jpg)
Our Computational Journalism Project o Started in 2010. Collaborative project with Duke,
Google Research, HP Labs, Stanford
o Fact finding: finding and monitoring number-based facts pertinent to real-world events. The facts are leads to news stories.
o Fact checking: discovering and checking factual claims in debates, speeches, interviews, news
![Page 4: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/4.jpg)
FactWatcher Tuple t for new real world event appended to database
Find constraint-measure pair (C, M) such that t is in the contextual skyline
Constraint Measure month=Feb pts, ast, reb opp_team=Nets ast, reb team=Celtics ∧ opp_team=Nets
ast, reb
… …
Wesley had 12 points, 13 assists and 5 rebounds on February 25, 1996 to become the first player with a 12/13/5 (points/assists/rebounds) in February.
Generate factual claim
http://en.wikipedia.org/wiki/Basketball
![Page 5: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/5.jpg)
Factual Claims Prominent streaks o “This month the Chinese capital has experienced 10 days with a maximum temperature
in around 35 degrees Celsius – the most for the month of July in a decade.” o “The Nikkei 225 closed below 10000 for the 12th consecutive week, the longest such
streak since June 2009.” Situational facts o “Paul George had 21 points, 11 rebounds and 5 assists to become the first Pacers player
with a 20/10/5 (points/rebounds/assists) game against the Bulls since Detlef Schrempf in December 1992.”
o “The social world’s most viral photo ever generated 3.5 million likes, 170,000 comments and 460,000 shares by Wednesday afternoon.”
Domains: politics, sports, weather, crimes, transportation, finance, social media analytics, publications
![Page 6: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/6.jpg)
http://idir.uta.edu/factwatcher/
![Page 7: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/7.jpg)
![Page 8: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/8.jpg)
People Make Claims All The Time “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012.
http://en.wikipedia.org/wiki/Mitt_Romney http://www.thebrainchildgroup.com/
![Page 9: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/9.jpg)
Fact Checking is not Easy “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012.
http://en.wikipedia.org/wiki/Mitt_Romney http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf
![Page 10: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/10.jpg)
Fact Checking is not Easy “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012.
http://en.wikipedia.org/wiki/Mitt_Romney http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf http://en.wikipedia.org/wiki/United_States_Navy
vs
![Page 11: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/11.jpg)
Existing Fact Checking Projects Journalists and reporters spend good amount of time on fact checking Politifact http://www.politifact.com/ FactCheckEU https://factcheckeu.org/ FullFact http://fullfact.org/ Snopes http://www.snopes.com/info/whatsnew.asp Factcheck http://www.factcheck.org/
![Page 12: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/12.jpg)
ClaimBusters Long-term goal o (Partly) automate fact checking process
social media interviews
debates speeches
news
factual claims ranked by importance
classification& ranking
checked by algorithms / journalists/citizens /crowd (e.g., Twitter users)
o Plan for Election 2016
Current progress o Classification models for finding check-worthy factual statements o Preliminary exploration of crowdsourcing fact-checking
![Page 13: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/13.jpg)
Factual Claim Classification Dataset: presidential debates o Source: http://www.debates.org/index.php?page=debate-transcripts o All 30 debates (11 elections) in history: 1960, 1976—2012 o 20k sentences by presidential candidates: removed very short (< 5 words) sentences
Classify each sentence into 1 of 3 classes
![Page 14: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/14.jpg)
Examples of Sentences Important factual claims “We spend less on the military today than at any time in our history.” “The President’s position on gay marriage has changed.” “More people are unemployed today than four years ago.”
Unimportant factual claims “I was in Iowa yesterday.” “My mother enjoys cooking.” “I ran for President once before.”
Sentences with no factual claims (just opinions, questions & declarations) “Iran must not get nuclear weapons.” “7% unemployment is too high.” “My opponent is wishy-washy.” “I will be tough on crime.” "Why should we do that?“ “Hello, New Hampshire!” “Our plan is to reduce tax rate by 10%.”
![Page 15: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/15.jpg)
Ground Truth Collection Each sentence is labelled by two of many participants. The ground truth includes the sentence only if the two participants agreed on its class label.
![Page 16: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/16.jpg)
How We Use Python Data wrangling o Use NLTK (Natural Language Toolkit) to transform debate files into structured data format o Use mysql-python-connector to store extracted features into an MySQL database o Use matplotlib to plot classifiers’ performance.
Feature extraction
o Use AlchemyAPI (Python wrapper) to extract rich features of sentences: keywords, POS (part-of-speech) tags, sentiments, entities, concepts, taxonomy
Classification o Use scikit-learn to build classification models
![Page 17: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/17.jpg)
Feature Extraction Keywords, POS (part-of-speech) tags import nltk sentence = 'The tax policy for the middle class is bad.' pos = nltk.pos_tag(nltk.word_tokenize(sentence)) print(pos) [('The', 'DT'), ('tax', 'NN'), ('policy', 'NN'), ('for', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('class', 'NN'), ('is', 'VBZ'), ('bad', 'JJ')]
![Page 18: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/18.jpg)
Feature Extraction Sentiments from alchemyapi import AlchemyAPI alchemyapi = AlchemyAPI() sentence = ‘The tax policy for the middle class is bad.' response = alchemyapi.sentiment('text', sentence) sentiment = response['docSentiment']['score'] print(sentiment) -0.6532
![Page 19: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/19.jpg)
Feature Extraction Entities response = alchemyapi.combined('text', sentence, {'sentiment': 1}) print(response['entities']) [{'sentiment': {'type': 'negative', 'score': '-0.653232'}, 'count': '1', 'type': 'FieldTerminology', 'relevance': '0.33', 'text': 'tax policy'}]
![Page 20: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/20.jpg)
Feature Extraction Concepts print(response[‘concepts']) [{'opencyc': 'http://sw.opencyc.org/concept/Mx4rvViw25wpEbGdrcN5Y29ycA', 'dbpedia': 'http://dbpedia.org/resource/Middle_class', 'freebase': 'http://rdf.freebase.com/ns/m.01lbc_', 'text': 'Middle class', 'relevance': '0.921176'}, {'dbpedia': 'http://dbpedia.org/resource/Social_class', 'freebase': 'http://rdf.freebase.com/ns/m.07714', 'text': 'Social class', 'relevance': '0.869326'}]
![Page 21: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/21.jpg)
Feature Extraction Taxonomy print(response[‘taxonomy']) /law, govt and politics / legal issues / legislation
![Page 22: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/22.jpg)
Classification Models Use scikit-learn to build classification models o Naïve Bayes Classifier(NBC)
o Support Vector Machine (SVM) LinearSVC (linear kernel, multi-class classification) o Random Forest Classifier (RFC) 200 trees in the forest (n_estimators = 200)
![Page 23: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/23.jpg)
Preliminary Experiments 3 classes o NFS (non-factual-statement), NO (unimportant factual claim), YES (important
factual claim) 5 categories of features o K: keyword; ET: entity type; P: POS tag; C: concept; T: taxonomy
5 combinations of features (+sentiment, +length) o K; K+P; K+P+ET; K+P+ET+C; K+P+ET+C+T
Instances o 1571 sentences in ground truth o training data : test data = 3:1 o 4-fold cross validation
![Page 24: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/24.jpg)
Classification Using scikit-learn #last column is the class attribute features = data.columns[0:-1] #splitting train/test data (handout) msk = np.random.rand(len(data)) <= 0.75 train = data[msk][features] test = data[~msk][features] train_verdict = data[msk].verdict test_verdict = data[~msk].verdict #building and applying the model clf = RandomForestClassifier(n_estimators=200)#GaussianNB()#LinearSVC() clf.fit(train, train_verdict) prediction = clf.predict(test) #cross validation cv = np.sqrt(abs(cross_val_score(clf, data[features], data.verdict, cv=4, scoring='accuracy').mean()))
![Page 25: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/25.jpg)
Results: Precision
NBC RFC
SVM
![Page 26: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/26.jpg)
Results: Recall
NBC RFC
SVM
![Page 27: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/27.jpg)
Results: F-Measure
NBC RFC
SVM
![Page 28: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/28.jpg)
You are Invited
http://bit.ly/1FSj9pt
![Page 29: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/29.jpg)
Acknowledgment UTA Students o Naeemul Hassan o Afroza Sultana o Gensheng Zhang
o Joseph Minumol o Jisa Sebastine
Collaborators o Bill Adair (Duke) o Pankaj Agarwal (Duke) o Sarah Cohen (Columbia) o James Hamilton (Stanford) o Ping Luo (Chinese Academy of Sciences)
o Mark Tremayne (UTA) o Min Wang (Google Research) o Jun Yang (Duke) o Cong Yu (Google Research)
![Page 30: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/30.jpg)
Acknowledgment Funding sponsors
Disclaimer: This material is based upon work partially supported by the National Science Foundation Grants 1018865, 1117369 and 1408928, 2011 and 2012 HP Labs Innovation Research Awards, and the National Natural Science Foundation of China Grant 61370019. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.
![Page 31: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department](https://reader034.fdocuments.in/reader034/viewer/2022042303/5ecee35756ef061a3268a277/html5/thumbnails/31.jpg)
Thank You! Questions? http://ranger.uta.edu/~cli http://idir.uta.edu [email protected] Please help us to label the data http://bit.ly/1FSj9pt