The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The...
Transcript of The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The...
The Noise and Value of Social Media Data:
An Applicable Solution
Angus W.H. CHEONG Ph.D. President, Macao Polling Research Association
Director, eRS e-Research & Solutions (Macao)
The 6th International workshop on internet survey and survey methodology 16-17 September 2014
Daejeon, Republic of Korea
Big data is an all-encompassing term for any
collection of data sets so large and complex
that it becomes difficult to process using on-
hand data management tools or traditional
data processing applications. (Wikipedia, 2014)
Social Media Data 2
Big Data: A Paradigm Shift
Past Present
• Sample • All Possible Data
• Single Source • Connecting the Dots (Cross-Over)
• Data Structured • Unstructured/Semi-Structured
• Static • Real-time/Nearly Real-time
• Lag information • Lead Information
• Data is Cost • Data is Corporate Asset
• Causality (Why?) • Correlation (What?)?
• Accuracy • Uncertainty
Social Media Data 3
Types of Big Data
Big reference data:
• Addresses, country list, yellow pages, etc.
Big transaction data:
• Telecommunications, e-commerce, chain supermarket, credit cards, etc.
Web logs / APPS :
• User viewing behaviors, selection behaviors, etc.
Sensor data (machine based):
• Radar, sonar, GPS unit, CCTV Camera, card reader, RFID, etc.
Social media data (human based):
• Facebook likes, Linkedin updates, tweets from SNS, UGC, etc.
4 Social Media Data
Forms of Big Data
5
Big Data
Internet Data
Non-structured
Industry/ enterprises
Data
Structured
IOT Data
Semi-structured
Social Media Data 5
Technologies/Methods for handling Big
Data
Transactional Business Intelligence (BI)/Online Analytical Processing (OLAP)
Cluster Analysis
Data Mining
SQL
A/B Testing
Non-
transactional
Textual Analysis
Sentiment Analysis Content Analysis
Network analysis
6 Social Media Data
Does Big data really replace sampling data?
In their book entitled Big Data: A Revolution That Will Transform How We Live, Work and Think, Kenneth Cukier and Viktor Mayer-Schonberger (2013) claimed that
“with less error from sampling we can accept more measurement error” and that “social sciences have lost their monopoly on making sense of empirical data, as big-data analysis replaces the highly skilled survey specialists of the past.”
“The new algorithmists will be experts in the areas of computer science, mathematics, and statistics; and they would act as reviewers of big data analyses and predictions.”
Social Media Data 7
8
Google Flu Trends, a tool which predicts flu rates through
search queries.
The service is meant to give real-time information on seasonal
flu outbreaks by tracking a series of search terms that tend to
be used by people who are currently suffering from the flu.
This would provide a bit of lead time over the methods used
in the US and abroad, which aggregate monitoring data
from a large number of healthcare facilities.
http://www.google.org/flutrends/intl/en_us/ Social Media Data
9 Social Media Data
10 Social Media Data
How that went wrong? 1. Big Data Hubris
(Lazer, et. al, 2014)
“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. …, quantity of data does not mean that one can ignore foundational issues of measurement, construct validity and reliability, and dependencies among data.
The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.
2. Algorithm dynamics
(Lazer, et. al, 2014)
Algorithm dynamics are the changes made by engineers to improve the commercial service and by consumers in using that service. Several changes in Google’s search algorithm and user behavior likely affected GFT’s tracking.
All empirical research stands on a foundation of measurement. Is the instrumentation actually capturing the theoretical construct of interest? Is measurement stable and comparable across cases and over time? Are measurement errors systematic? At a minimum, it is quite likely that GFT was an unstable reflection of the prevalence of the flu because of algorithm dynamics affecting Google’s search algorithm.
Social Media Data 11
What happens to social media
data?
Social Media Data 12
Social Media Data 13
The Pervasiveness of Social Media
Social Media Data 14
Facebook, Twitter, Weibo and Google+ are at the center of the social media ecosystem.
Publishing with blogging platforms (WordPress, Blogger, Live Journal, TypePad, Over-Blog…) and wikis (Wikipedia, Wikia, Mahalo…)
Sharing services for pictures, links, videos, music, products… (Delicious, Tumblr, Instagram, Pinterest, TheFancy, YouTube, Vimeo, Vine, Spotify, Deezer, SoundCloud, MySpace, Slideshare…)
Discussing with knowledge platforms (Quora, Github, Reddit, StackExchange…), mobile chat applications (Skype, Kik, WhatsApp, SnapChat…), and their Asian counterparts (WeChat, Sina Weibo, Tencent Weibo, KakaoTalk, Line…)
Networking for B to C audiences
(Badoo, Tagged…) and professionals
(LinkedIn, Viadeo, Xing…), as well as Russian
and Asian social networks
(VKontakte, Qzone, RenRen, Mixi)
People are at a high degree of social presence
The new communication is a people-to-people
communication process without social and
geographical boundaries.
It is a social level communication pattern which is
already part of our social life.
Web 2.0 has opened an efficient channel for social information flow.
The content-richness, timeliness, reachability and the
many to many communication mode are the best-fit
facilitators of social information flow.
The participation in the above communication process
offers people a high degree of social presence.
Social Media Data 15
Characteristics of social media Data
16
• Data is not tabularized Unstructured
• Data from inconsistent sources Variety
• Real-time nature of data Velocity
• Huge volume of data Volume
• Reliability and validity are
uncertain Veracity
Social Media Data
Numbers + Texts:
e.g. number of likes/shares, posts
The challenges
Social Media Data 17
Full of noises that may affect each step.
Most noises come from
Irrelevant
Repetitive
Fake
Automated
Social Media Data 18
Data that creates measurement errors and
representation problems
How to reduce the noise?
Social Media Data 19
Web Mining: Discover patterns from the Web
Web Mining, WM
Usage Mining Content Mining Structure Mining
20
• Web content mining aims to extract/mine useful information or
knowledge from web page contents.
• Web usage mining refers to the discovery of user access patterns
from Web usage logs.
• Web structure mining tries to discover useful knowledge from the
structure of hyperlinks.
Social Media Data
21
Collecting data from the web
Structuring data
Extracting useful data
Discovering meaning
Analysis &
Insight
Visualization
Atypical workflow of WM
Social Media Data 21
Issue Solutions
Reliability & Validity issue
Content analysis: inter-coder reliability & quality control
Measurement Errors
Search quires, extraction & quality control
Social Media Data 22
How to reduce the noise?
What is Intercoder Reliability?
Content analysis can be defined as “the systematic, objective
quantitative analysis of message characteristics” (Neuendorf,
2001, p.1).
Often it involves trained analysts, called coders, analyzing text,
video, or audio and describing the content according to a
group of open-ended (write-in) and close-ended (multiple
choice) variables.
Intercoder reliability refers to the extent to which two or more
independent coders agree on the coding of the content of
interest with an application of the same coding scheme.
Intercoder reliability is a critical component in the content
analysis of open-ended survey responses, without which the
interpretation of the content cannot be considered objective
and valid, although high intercoder reliability is not the only
criteria necessary to argue that coding is valid (Lavrakas, 2008)
Social Media Data 23
Quality control
Real-time coding
Real-time testing
Real-time surveillance
Real-time output
Social Media Data 24
How does it operate?
25
Internet/Web
Crawler
Web DB
Extraction
Ind
exin
g
Analytics Users
I-
DB
Order/Assignments
Real-time updating Structured
Clustering, Categorizatio
n, Coding, Statistical analysis
Unstructured
Social Media Data
What Values can we mine?
Social Media Data 26
The fifth V
The Web Mining part……
In the information overloaded era,
5W1H no longer exit
No where: diversified sources
No when: instant data from media & user generated content (UGC)
No who: anonymous on the net
No what: out of focus of an issue
No how: hard to track an issue
No why: hard to find patterns, hard to test theories empirically, hard to know the stories behind
The Unstructured Web!
Social Media Data 27
28 Social Media Data
From a public opinion perspective……
N
The Total Public Opinion (TPOP) Framework
29
•Standard
General Public
Opinion Surveys
•Deliberation:
Deliberative
Polling
•Public Forums,
Discussion
Groups: World
Café
•Content
Analysis
•Textual Analysis
•Web Ming
(WM): ePOP
eMiner
Social media Opinions
from grass root, elite, activists,
etc.
Traditional media
editorials & commentary;
Elites, mainstream
voices”
General Public
Opinion:
The majority of the silence:
representative-ness
Refined
Opinion:
deliberative,
detailed,
informed
voices
Social Media Data
The Total Public Opinion (TPOP)
Framework
The Total Public Opinion (TPOP) framework is a holistic
approach to study the opinions generated in a certain
society concerning a certain or a series of social
issue(s)/event(s).
It refers to the all-round access to diversified opinions held by
people in different populations through different channels.
From a methodological point of view, it includes data
collection methods such as CATI telephone interviews, face-
to-face interviews, content analysis and dynamic web
mining from the Internet.
A sophisticated e-opinion mining platform - ePOP eMiner has
been developed to implement dynamic surveillance and
analysis of online opinions and media content adopting
such approach.
30 Social Media Data
Web Mining (Web Content Ming)
31
The value of social media data in the TPOP approach
Social Media Data 32
32
TPOP CATI
(New samples and
Panel)
Field survey
(on site)
In-depth
interview
(from FS
respondents) Focus group
(from CATI)
Web mining
(social media)
Effectiveness
Social Media Data 33
TPOP CATI
Field
survey
In-depth
interview
Focus
group
Web
mining
Take a campaign
monitoring as an
example, using eMiner
to do the content
analysis, report to the
clients during the
monitoring process so
that they can know the
on-going situation and
adjust their strategies
and refine the
campaign process
Demonstration of eMiner
Social Media Data 34
Conclusions and further challenges
Big data, social media data in particular, is not the replacement of sampling data. Without considering its reliability and validity, the results are meaningless .
Transforming huge amount of textual data into quantifiable and meaningful information should not merely an algorithmic or a machine’s job.
Quality control in all scientific research is the base for having quality data for quality analysis, and social media data is no exception.
Based on the above, the value of social media data can be discovered when the combination of knowledge of data and machine works well.
How to make big data small or to do sampling is another big challenge.
Social Media Data 35
Thank you very much!
Social Media Data 36
About Macao Polling Research Association, MPRA
A non-profit research-oriented association founded by Dr. Angus Cheong and his colleagues in 2010, aiming at promoting professional practice of survey research and fostering collaboration with researchers and professionals in empirical research.
Signature longitudinal survey projects: Macao Happiness Index, an annual survey since 2010 Macao Confidence Index, a monthly survey since Jan 2010
37
http://www.macaopolls.org/
Social Media Data
38
eRS, an innovative research & solutions company founded by Dr. Angus Cheong in Macao.
In the BIG DATA era, eRS believe that best professional research solutions (S) come from the effective integration of the application of information technology (e) and scientific social research methods.
eRS currently serve clients from all over the world on survey research, social analytics, data mining and consultancy.
Social Media Data