The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The...

38
The Noise and Value of Social Media Data: An Applicable Solution Angus W.H. CHEONG Ph.D. President, Macao Polling Research Association Director, eRS e-Research & Solutions (Macao) [email protected] The 6 th International workshop on internet survey and survey methodology 16-17 September 2014 Daejeon, Republic of Korea

Transcript of The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The...

Page 1: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

The Noise and Value of Social Media Data:

An Applicable Solution

Angus W.H. CHEONG Ph.D. President, Macao Polling Research Association

Director, eRS e-Research & Solutions (Macao)

[email protected]

The 6th International workshop on internet survey and survey methodology 16-17 September 2014

Daejeon, Republic of Korea

Page 2: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Big data is an all-encompassing term for any

collection of data sets so large and complex

that it becomes difficult to process using on-

hand data management tools or traditional

data processing applications. (Wikipedia, 2014)

Social Media Data 2

Page 3: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Big Data: A Paradigm Shift

Past Present

• Sample • All Possible Data

• Single Source • Connecting the Dots (Cross-Over)

• Data Structured • Unstructured/Semi-Structured

• Static • Real-time/Nearly Real-time

• Lag information • Lead Information

• Data is Cost • Data is Corporate Asset

• Causality (Why?) • Correlation (What?)?

• Accuracy • Uncertainty

Social Media Data 3

Page 4: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Types of Big Data

Big reference data:

• Addresses, country list, yellow pages, etc.

Big transaction data:

• Telecommunications, e-commerce, chain supermarket, credit cards, etc.

Web logs / APPS :

• User viewing behaviors, selection behaviors, etc.

Sensor data (machine based):

• Radar, sonar, GPS unit, CCTV Camera, card reader, RFID, etc.

Social media data (human based):

• Facebook likes, Linkedin updates, tweets from SNS, UGC, etc.

4 Social Media Data

Page 5: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Forms of Big Data

5

Big Data

Internet Data

Non-structured

Industry/ enterprises

Data

Structured

IOT Data

Semi-structured

Social Media Data 5

Page 6: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Technologies/Methods for handling Big

Data

Transactional Business Intelligence (BI)/Online Analytical Processing (OLAP)

Cluster Analysis

Data Mining

SQL

A/B Testing

Non-

transactional

Textual Analysis

Sentiment Analysis Content Analysis

Network analysis

6 Social Media Data

Page 7: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Does Big data really replace sampling data?

In their book entitled Big Data: A Revolution That Will Transform How We Live, Work and Think, Kenneth Cukier and Viktor Mayer-Schonberger (2013) claimed that

“with less error from sampling we can accept more measurement error” and that “social sciences have lost their monopoly on making sense of empirical data, as big-data analysis replaces the highly skilled survey specialists of the past.”

“The new algorithmists will be experts in the areas of computer science, mathematics, and statistics; and they would act as reviewers of big data analyses and predictions.”

Social Media Data 7

Page 8: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

8

Google Flu Trends, a tool which predicts flu rates through

search queries.

The service is meant to give real-time information on seasonal

flu outbreaks by tracking a series of search terms that tend to

be used by people who are currently suffering from the flu.

This would provide a bit of lead time over the methods used

in the US and abroad, which aggregate monitoring data

from a large number of healthcare facilities.

http://www.google.org/flutrends/intl/en_us/ Social Media Data

Page 9: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

9 Social Media Data

Page 10: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

10 Social Media Data

Page 11: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

How that went wrong? 1. Big Data Hubris

(Lazer, et. al, 2014)

“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. …, quantity of data does not mean that one can ignore foundational issues of measurement, construct validity and reliability, and dependencies among data.

The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.

2. Algorithm dynamics

(Lazer, et. al, 2014)

Algorithm dynamics are the changes made by engineers to improve the commercial service and by consumers in using that service. Several changes in Google’s search algorithm and user behavior likely affected GFT’s tracking.

All empirical research stands on a foundation of measurement. Is the instrumentation actually capturing the theoretical construct of interest? Is measurement stable and comparable across cases and over time? Are measurement errors systematic? At a minimum, it is quite likely that GFT was an unstable reflection of the prevalence of the flu because of algorithm dynamics affecting Google’s search algorithm.

Social Media Data 11

Page 12: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

What happens to social media

data?

Social Media Data 12

Page 13: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Social Media Data 13

Page 14: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

The Pervasiveness of Social Media

Social Media Data 14

Facebook, Twitter, Weibo and Google+ are at the center of the social media ecosystem.

Publishing with blogging platforms (WordPress, Blogger, Live Journal, TypePad, Over-Blog…) and wikis (Wikipedia, Wikia, Mahalo…)

Sharing services for pictures, links, videos, music, products… (Delicious, Tumblr, Instagram, Pinterest, TheFancy, YouTube, Vimeo, Vine, Spotify, Deezer, SoundCloud, MySpace, Slideshare…)

Discussing with knowledge platforms (Quora, Github, Reddit, StackExchange…), mobile chat applications (Skype, Kik, WhatsApp, SnapChat…), and their Asian counterparts (WeChat, Sina Weibo, Tencent Weibo, KakaoTalk, Line…)

Networking for B to C audiences

(Badoo, Tagged…) and professionals

(LinkedIn, Viadeo, Xing…), as well as Russian

and Asian social networks

(VKontakte, Qzone, RenRen, Mixi)

Page 15: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

People are at a high degree of social presence

The new communication is a people-to-people

communication process without social and

geographical boundaries.

It is a social level communication pattern which is

already part of our social life.

Web 2.0 has opened an efficient channel for social information flow.

The content-richness, timeliness, reachability and the

many to many communication mode are the best-fit

facilitators of social information flow.

The participation in the above communication process

offers people a high degree of social presence.

Social Media Data 15

Page 16: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Characteristics of social media Data

16

• Data is not tabularized Unstructured

• Data from inconsistent sources Variety

• Real-time nature of data Velocity

• Huge volume of data Volume

• Reliability and validity are

uncertain Veracity

Social Media Data

Numbers + Texts:

e.g. number of likes/shares, posts

Page 17: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

The challenges

Social Media Data 17

Full of noises that may affect each step.

Page 18: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Most noises come from

Irrelevant

Repetitive

Fake

Automated

Social Media Data 18

Data that creates measurement errors and

representation problems

Page 19: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

How to reduce the noise?

Social Media Data 19

Page 20: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Web Mining: Discover patterns from the Web

Web Mining, WM

Usage Mining Content Mining Structure Mining

20

• Web content mining aims to extract/mine useful information or

knowledge from web page contents.

• Web usage mining refers to the discovery of user access patterns

from Web usage logs.

• Web structure mining tries to discover useful knowledge from the

structure of hyperlinks.

Social Media Data

Page 21: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

21

Collecting data from the web

Structuring data

Extracting useful data

Discovering meaning

Analysis &

Insight

Visualization

Atypical workflow of WM

Social Media Data 21

Page 22: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Issue Solutions

Reliability & Validity issue

Content analysis: inter-coder reliability & quality control

Measurement Errors

Search quires, extraction & quality control

Social Media Data 22

How to reduce the noise?

Page 23: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

What is Intercoder Reliability?

Content analysis can be defined as “the systematic, objective

quantitative analysis of message characteristics” (Neuendorf,

2001, p.1).

Often it involves trained analysts, called coders, analyzing text,

video, or audio and describing the content according to a

group of open-ended (write-in) and close-ended (multiple

choice) variables.

Intercoder reliability refers to the extent to which two or more

independent coders agree on the coding of the content of

interest with an application of the same coding scheme.

Intercoder reliability is a critical component in the content

analysis of open-ended survey responses, without which the

interpretation of the content cannot be considered objective

and valid, although high intercoder reliability is not the only

criteria necessary to argue that coding is valid (Lavrakas, 2008)

Social Media Data 23

Page 24: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Quality control

Real-time coding

Real-time testing

Real-time surveillance

Real-time output

Social Media Data 24

Page 25: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

How does it operate?

25

Internet/Web

Crawler

Web DB

Extraction

Ind

exin

g

Analytics Users

I-

DB

Order/Assignments

Real-time updating Structured

Clustering, Categorizatio

n, Coding, Statistical analysis

Unstructured

Social Media Data

Page 26: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

What Values can we mine?

Social Media Data 26

The fifth V

Page 27: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

The Web Mining part……

In the information overloaded era,

5W1H no longer exit

No where: diversified sources

No when: instant data from media & user generated content (UGC)

No who: anonymous on the net

No what: out of focus of an issue

No how: hard to track an issue

No why: hard to find patterns, hard to test theories empirically, hard to know the stories behind

The Unstructured Web!

Social Media Data 27

Page 28: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

28 Social Media Data

From a public opinion perspective……

N

Page 29: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

The Total Public Opinion (TPOP) Framework

29

•Standard

General Public

Opinion Surveys

•Deliberation:

Deliberative

Polling

•Public Forums,

Discussion

Groups: World

Café

•Content

Analysis

•Textual Analysis

•Web Ming

(WM): ePOP

eMiner

Social media Opinions

from grass root, elite, activists,

etc.

Traditional media

editorials & commentary;

Elites, mainstream

voices”

General Public

Opinion:

The majority of the silence:

representative-ness

Refined

Opinion:

deliberative,

detailed,

informed

voices

Social Media Data

Page 30: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

The Total Public Opinion (TPOP)

Framework

The Total Public Opinion (TPOP) framework is a holistic

approach to study the opinions generated in a certain

society concerning a certain or a series of social

issue(s)/event(s).

It refers to the all-round access to diversified opinions held by

people in different populations through different channels.

From a methodological point of view, it includes data

collection methods such as CATI telephone interviews, face-

to-face interviews, content analysis and dynamic web

mining from the Internet.

A sophisticated e-opinion mining platform - ePOP eMiner has

been developed to implement dynamic surveillance and

analysis of online opinions and media content adopting

such approach.

30 Social Media Data

Page 31: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Web Mining (Web Content Ming)

31

Page 32: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

The value of social media data in the TPOP approach

Social Media Data 32

32

TPOP CATI

(New samples and

Panel)

Field survey

(on site)

In-depth

interview

(from FS

respondents) Focus group

(from CATI)

Web mining

(social media)

Page 33: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Effectiveness

Social Media Data 33

TPOP CATI

Field

survey

In-depth

interview

Focus

group

Web

mining

Take a campaign

monitoring as an

example, using eMiner

to do the content

analysis, report to the

clients during the

monitoring process so

that they can know the

on-going situation and

adjust their strategies

and refine the

campaign process

Page 34: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Demonstration of eMiner

Social Media Data 34

Page 35: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

Conclusions and further challenges

Big data, social media data in particular, is not the replacement of sampling data. Without considering its reliability and validity, the results are meaningless .

Transforming huge amount of textual data into quantifiable and meaningful information should not merely an algorithmic or a machine’s job.

Quality control in all scientific research is the base for having quality data for quality analysis, and social media data is no exception.

Based on the above, the value of social media data can be discovered when the combination of knowledge of data and machine works well.

How to make big data small or to do sampling is another big challenge.

Social Media Data 35

Page 37: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

About Macao Polling Research Association, MPRA

A non-profit research-oriented association founded by Dr. Angus Cheong and his colleagues in 2010, aiming at promoting professional practice of survey research and fostering collaboration with researchers and professionals in empirical research.

Signature longitudinal survey projects: Macao Happiness Index, an annual survey since 2010 Macao Confidence Index, a monthly survey since Jan 2010

37

http://www.macaopolls.org/

Social Media Data

Page 38: The Noise and Value of Social Media Datakostat.go.kr/iwsm/download/2014/3-4. W.H. Cheong.pdf · The Pervasiveness of Social Media Social Media Data 14 Facebook, Twitter, Weibo and

38

eRS, an innovative research & solutions company founded by Dr. Angus Cheong in Macao.

In the BIG DATA era, eRS believe that best professional research solutions (S) come from the effective integration of the application of information technology (e) and scientific social research methods.

eRS currently serve clients from all over the world on survey research, social analytics, data mining and consultancy.

Social Media Data