FORUM Identifying the real di˜erences of opinion in social media...

International Journal of Market Research Vol. 55 Issue 6

1© 2013 The Market Research Society

DOI: 10.2501/IJMR-2013-000

FORUM

Identifying the real di�erences of opinion in social media sentiment

Annie PettitResearch Now

This study examined the differences in social media sentiment based on author gender, age and country. After creating ten category-generic datasets, millions of social media verbatims from thousands of websites were collected, cleaned of spam, and scored into five-point sentiment scales. The results showed that women exhibit more positive sentiment, older people exhibit more positive sentiment, and Australians exhibit more positive sentiment, while Americans share more negative sentiment. The differences were small but clear, suggesting that research methodologists should apply correction factors to ensure that their results more accurately reflect differences of opinion as opposed to differences of word choice. Business users of social media data can be reassured that correction factors are not required to improve the accuracy of their research.

Introduction

Scales are a cornerstone of market research. Every survey labels and categorises opinions based on scales from 1 to 5, 1 to 7, or even 1 to 100, helping us determine the strength and distribution of opinions on any topic imaginable. They’re how we determine that 49% of people like Coca-Cola and 42% of people like Pepsi, that men like watching sports more than women do, that younger people like electronics more than older people do, and that Canadians like Shania Twain more than Brits or Americans do. Or do they?

One of the problems with scales is that people interpret them differently. I have a rule that I never choose ‘strongly agree’ or ‘strongly disagree’ except

Received (in revised form): 4 September 2013

Forum: Identifying the real di�erences of opinion in social media sentiment

2

in the rarest 1% of circumstances. But other people are far more liberal, opting to use each scale option an equal percentage of the time. Similarly, when men describe once per week as being ‘often’, have we considered that women may instead describe once per month as ‘often’? Do these types of differences manifest themselves among older and younger people? People from Canada vs people from the United States? If one group of people has a naturally different way of responding to scales than another group of people, how are we to know whether any differences in scores are due to differences of opinion as opposed to differences in how they use scales?

Fortunately, it’s an easy thing to research. After running a survey, scores from different groups, whether they’re gender groups or age groups, can be quickly compared. Neutral topics that won’t create bias, such as soap or carpets or cell phones, products that everyone knows and uses, should be the underlying topic of the research to avoid introducing demographic bias.

But are these neutral topics really neutral? Perhaps women have an inherent affinity towards beauty care products like soap, or women are more tuned into carpeting because it affects the aesthetics of the home, or men are more likely to appreciate cell phones because they like electronics more. These are stereotyped affinities, but who’s to say whether they’re true or false? Even if you run a ‘generic’ survey about ‘generic’ soap and find that women are more likely to check off ‘strongly agree’, how can you know whether women like soap more or the ‘strongly agree’ box more? It’s not a neutral dataset.

Here is where social media sentiment comes into play. Social media datasets still need to be based on some search criteria and that criteria could be BlackBerry phones, Dove shampoo, McDonald’s hamburgers or any other brand familiar to the internet world. But there could always be unknown underlying demographic brand differences affecting those results. The spectrum could be broadened by collecting data about cell phones in general, shampoo in general or hamburgers in general but, again, the problem of unknown demographic differences remains. There is, however, a wonderful solution to the problem, a solution that is unique to social media data.

When it comes to social media opinions, there are no constraints or biases related to choosing a data source (e.g. we don’t have to worry how the panel company or email list provider sources their responders) or a topic (e.g. Dove users, soap users, beauty product users). Non-specific data can be selected from thousands of websites reflecting millions


3

of topics, virtually neutralising any effect of data source or research topic.

With social media data, datasets can be created based on completely neutral criteria. Instead of creating datasets about brand ‘A’, we can create datasets about the word ‘the’. And since test-retest reliability is essential, multiple neutral datasets based on criteria such as ‘this’, ‘that’ and ‘there’ can be created. Because these are grammatical essentials that cannot be avoided in everyday language, demographically different groups of people have no predisposition to use these words differently. Just try counting the words ‘the’, ‘this’, ‘that’ or ‘there’ in this introduction.

Without a validated measure of brain activity, we can’t truly know if one group of people has inherently more or fewer positive opinions than another group, but we can evaluate how those opinions are expressed in words. And so we progress to the next step. Let’s understand how men and women share their opinions online in written words. Let’s learn whether younger people naturally use more negative words or whether Canadians naturally use more positive words. Let’s learn what our underlying word usage in social media is, so that we may better discriminate between differences of opinion and differences of scale usage.

Method

Datasets

Ten datasets were created, each one consisting of a random sample of English comments/verbatims collected from social media websites. Five datasets were created on the basis that each verbatim contained a neutral grammatical target word (e.g. ‘there’, ‘then’, ‘this’, ‘just’, ‘that’), while the other five datasets contained a neutral category target word (e.g. ‘seven’, ‘circle’, ‘dog’, ‘blue’, ‘house’).

Target verbatims for one of the grammatical datasets might or might not refer to a brand, person or company, and could just as likely not mention any recognisable products or concepts. The only requirement was that they include the generic target word. For instance, verbatims for the ‘there’ dataset included:

• Okay, you wanna just meet there?• There’s no way I’m doin that.• It was there responsibility to bring enough copies and they screwed

up. [Note the grammatical error in the usage of ‘there’.]


4

Similarly, verbatims for the ‘seven’ dataset included:

• He graduated from there seven years ago.• I’ve been to that show seven times.• There seven a recipe for bacon cookies. [Note that ‘seven’ is a created

as a result of a typo.]

Each of the ten datasets contained between 500,000 and 2,600,000 records dated between 1 January 2012 and 31 December 2012, and reflected thousands of blogging, microblogging, video, photo, review, news and commenting sites. Major sites offering hundreds and thousands of verbatims, such as Facebook, Twitter, YouTube, Blogspot, Wordpress, Tumblr and Reddit, as well as virtually unknown websites providing just one or two verbatims, resulted in being data sources.

All data were collected, cleaned and scored using Conversition’s proprietary cleaning and scoring systems.

Spam elimination system

Using an automated spam elimination process ensured that every piece of data and every dataset was cleaned of the most serious chunks of spam in an identical, objective way. Verbatims eliminated due to this system included those with unusual punctuation (e.g. ‘jerseys, pants, rolex, rayban, tickets, watches’), repetitious wording (e.g. ‘britney britney gaga gaga beiber beiber’), unusual sales content (e.g. ‘buy free sales cheap discount best price’), non-normal amounts of profanity and vulgarity, and various other features that independent observers would agree was spam. After spam was eliminated, every remaining verbatim was included in every analysis as appropriate.

Sentiment scoring system

The sentiment scoring system used is a proprietary system fine-tuned to measure marketing concepts such as purchasing, pricing, recommending and advertising. It assigns continuous sentiment scores to a variety of language components including emoticons, slang, acronyms, incorrect spellings, phrases, clichés, hashtags and more. As with any sentiment system, human or automated, it is not 100% valid, though blind inter-rater reliability tests have pegged the system at 65% to 70% accurate compared to best-in-class human scoring systems, which are around 85% accurate.


5

Across millions of verbatims, the law of numbers ensures that the results are likely to be reliable, though not at the individual level nor for small datasets.

Sentiment scores ranged on a continuous scale from 1.000 (most negative) to 5.000 (most positive). Scores with non-zero decimal places were rounded to no decimal places, e.g. scores of 1.7 and 2.1 were rounded to 2. This method left five distinct scores including 1, 2, 3, 4 and 5. As such, there were five groups of scores reflecting strong negative, moderate negative, neutral, moderate positive and strong positive opinions. Neutral verbatims included various types of verbatim, from those with truly neutral sentiment (e.g. ‘I really don’t care if I get one or not’) to those with no sentiment (e.g. ‘there it goes’) and those with split sentiment (e.g. ‘it sucks but I kind of like it too’). Every verbatim collected and retained was assigned a sentiment score.

Demographic data

There are a variety of ways to obtain demographic data for social media authors, including IP addresses, website domain names (e.g. amazon.ca, amazon.uk, amazon.au), inferring from the verbatim (e.g. ‘I went to Cambridge today’), and capturing user-specified profiles. None of these methods is capable of collecting every piece of demographic data from every author, and for various reasons none of them is 100% valid. For this study, age, gender and country information data were collected from social media profiles where that data was made publicly available by the website. In this study, about 47% of verbatims were associated with a gender, 0.4% with an age and 15% with a country. Only results from demographic groups with sufficient sample sizes are shared.

Statistical analysis

As with most social media listening research, many of the sample sizes were so large that they would render all statistical tests significant at p < 0.01. Consequently, no significance tests were done. Readers are asked instead to consider the results in terms of effect sizes – the magnitude of differences among the groups. Recommendations are provided throughout, but readers are advised to apply their own experiences to consider at what point they would recommend their client take a different strategy as a result of the observed differences.


6

Results

Gender

After removing data that did not contain gender information, sample sizes for the ten male groups ranged from 77,000 to 900,000, while those for the female groups ranged from 86,000 to 880,000.

Sentiment distributions were calculated for each of the ten datasets. Except for one instance (out of 50) and by mere decimal places (0.04 percentage points), all ten datasets produced identical trends. As such, the ten distributions were averaged together into one aggregate result.

Table 1 shows that men trended towards the negative side of the distribution (by 0.2 points and 1.8 points) whereas women trended towards the positive side of the distribution (by 4.6 points and 0.4 points).

ImplicationsRegardless of whether men are more negative or their word choice is more negative, the norm for men is that their social media sentiment scores are inherently more negative than women’s scores. To compare the genders on a norm vs norm basis means that when men rate a brand slightly lower than women do, the two genders are actually rating the brand in the same way. If we are to fairly compare the opinions of men and women in reference to a brand or category, we must apply a correction factor equal to the differences seen here.

ApplicationIn the following example using Nissan automotive data, applying the correction changes some of the relationships among the scores. Table 2 shows that, after decreasing the male negative scores by 0.2 and 1.8 points (the natural difference discovered above), the male score becomes less than, as opposed to greater than, the female score. And, after increasing the male positive score by 4.6 and 0.4 points, it is now greater than, as opposed to

Table 1 Scale di�erences by gender

Male Female Di�erenceSample size 3,105,471 2,663,9541 (Strong negative) (%) 0.53 0.35 –0.22 (Moderate negative) (%) 15.2 13.3 –1.83 (Neutral) (%) 61.1 58.1 –3.04 (Moderate positive) (%) 22.7 27.3 4.65 (Strong positive) (%) 0.48 0.91 0.4


7

equal to, the female score. Where one might have concluded that women held more positive opinions about Nissan than men, one might now more correctly conclude that men and women have similarly positive opinions.

Theoretically, the differences between the corrected and uncorrected scores are tantalising and thought provoking. Men and women clearly exhibit sentiment differences in their social media conversations. However, in the business world, these differences are small, and give marketers and brand managers confidence to treat gender differences as true differences, not demographic differences that would require an additional stage of transforming data prior to drawing conclusions.

Age

Because of the consistent trends among the ten groups in the gender analysis, the ten datasets were combined into one aggregate dataset for the age analysis as well. This helped alleviate the sample size problem as only a tiny percentage of people share their date of birth in their social media profiles. The total sample size was greater than 64,000, while individual age groups ranged from 488 to 21,644.

Social media authors were assigned to groups based on the decade of their birth. Those born in 2000 or later do participate in social media but, for personal protection and privacy reasons, they are not included in the data and were not part of these analyses. Similarly, while social media authors born prior to 1930 also participate in social media, their sample size was insufficient to show reliable results ( n = 66). Also, despite a sample size that is typically good in survey research, caution is advised when interpreting the data from those born in the 1930s ( n = 488) due to the relatively low sample size.

Table 3 shows an interesting age trend. Except for the oldest age – probably a symptom of the small sample size – as authors aged, they offered more positive verbatims. This came at the expense of fewer neutral

Table 2 Scores corrected for inherent gender di�erences

Nissan scores Female MaleCorrection for

male data Male corrected1 (Strong negative) (%) 1.1 1.0 –0.20 0.92 (Moderate negative) (%) 10.2 9.5 –1.80 8.43 (Neutral) (%) 42.6 39.5 –3.00 39.64 (Moderate positive) (%) 42.9 46.6 4.60 47.55 (Strong positive) (%) 3.2 3.3 0.40 3.6


8

verbatims, while the percentage of negative verbatims remained essentially stable over the age groups.

We cannot determine whether this trend is associated with these specific cohorts (i.e. a culture effect) and will shift from decade to decade as the authors age, or whether it is rooted in age (i.e. an age effect). This determination will be possible in the future only by examining differences across years – yet another difficulty as social media data is still in its infancy such that data over the last three years is not necessarily directly comparable.

ImplicationsAs with gender, correction factors can be applied to normalise sentiment from each of the age groups and allow us to compare differences of opinion with the age effect removed. Brand scores would change once the correction is applied, but given these results the changes would be of interest to academics and would not require business users to adjust their marketing strategies accordingly.

Country

As with the previous analysis, the ten datasets were combined into one aggregate dataset. Five countries having English as their dominant language were selected for comparison, thereby helping to stave the additional interaction of accounting for word choices among people for whom English was a secondary language. Sample sizes for the individual countries ranged from 9,000 to 960,000.

Differences among the five English-speaking countries were slight (see Table 4 ), the most noticeable ones being less negativity among Australian and Irish authors (12.6% negative and 11.9% negative), more negativity among Americans (14.6% negative), and more polarisation among UK authors (1.3% strong negative and 1.1% strong positive). The Canadian

Table 3 Scale di�erences by age

Born 1930s

Born 1940s

Born 1950s

Born 1960s

Born 1970s

Born 1980s

Born 1990s

Sample size 488 2,676 4,337 6,492 8,307 21,644 20,1561 (Strong negative) (%) 0.0 0.3 0.1 0.2 0.2 0.2 0.32 (Moderate negative) (%) 15.0 9.4 9.7 10.1 10.3 10.4 10.23 (Neutral) (%) 65.5 66.0 66.9 68.1 68.4 69.9 71.94 (Moderate positive) (%) 19.4 23.9 23.2 21.4 21.0 19.3 17.45 (Strong positive) (%) 0.0 0.3 0.2 0.1 0.1 0.2 0.2


9

stereotype of peacekeeping played out as their scores were generally equal to or within the extremes of other countries.

ImplicationsAmong these five countries, the largest correction factor is 5.7 points (Ireland vs UK, neutral) and 3.6 points (Ireland vs UK, moderate positive), which are large enough to change trends between uncorrected and corrected data. However, once again, although the differences are large enough to pique the interest of academic researchers, they are probably insufficient to cause business users to change their marketing tactics.

CautionsAs with any research, the list of confounds that could fundamentally change the findings is endless. Those listed below are a few that are relevant to this study.

• Given the unlimited combinations and permutations of age, gender, country, website source, and myriad other differentiating variables in social media, secondary variables were not held constant during each analysis.

• Given that comparisons can be made only for people who share their demographic data, we must accept that people who are open with their personal data may use words in a different way (e.g. more emotive, more excitable) than more reserved people.

• It is impossible to manually confirm the validity of millions of authors’ demographic profiles. We must trust that the spam processes removed data that was more likely to contain demographic errors, and accept that no matter the research process used all volunteered data have a non-zero error rate.

• Though the generalised theory probably transcends the specific listening system used here, the specific results are unique to this

Table 4 Scale di�erences by country

Australia Canada Ireland UK USSample size 29,372 54,505 9,323 155,068 962,2701 (Strong negative) (%) 0.8 0.9 0.5 1.3 1.32 (Moderate negative) (%) 11.8 12.1 11.4 12.3 13.33 (Neutral) (%) 64.1 62.0 64.9 59.3 61.64 (Moderate positive) (%) 22.7 24.2 22.5 26.1 22.95 (Strong positive) (%) 0.6 0.9 0.6 1.1 0.9


10

system’s collecting, cleaning and sentiment processes. These processes may remove more or less spam than other systems thereby changing the distribution of sentiment scores seen elsewhere. Similarly, the sentiment system is uniquely positioned for the type of data that this system focuses on and so may generate correction factors that are different from those of other systems. It is likely that each social media data provider would generate different correction values.

• Finally, it is impossible to truly know how people’s opinions differ. Word choice does not equal true opinion, but rather an individual’s attempt to quickly find a familiar word that best suits their current frame of mind.

Conclusions

We know that different groups of people think differently, talk differently and answer surveys differently. Now it is clear that people write differently and share their opinions in social media differently. In today’s culture of ‘due yesterday’, it is desirable to measure social media sentiment quickly, as is, but this can easily lead to misinterpretations. The following generalisations will help readers better interpret social media results that differ based on demographic groups.

• When it comes to gender, expect men to be slightly more negative and neutral. Take notice when men’s scores are equal to women’s scores because it means that the men have slightly more positive scores than women.

• When it comes to age, expect sentiment to become more positive and less neutral as authors age. When all age groups exhibit similar sentiment, it means that the older people have more negative opinions.

• When it comes to country, expect Australian and Irish people to be slightly less negative, Americans to be slightly more negative and Brits to be slightly more polarising. Expect Canadians to be the peacekeepers in the middle.

Small differences between demographic groups are important to academic and theoretical researchers. To know how much men and women, or older and younger people, differ in their online word choices is intriguing and thought provoking. Like it or not, cultural norms tell us which words are appropriate to use given our demographic characteristics, and every culture and generation invokes its unique take on language. Understanding


11

these differences helps us to better relate to one another, and create brands and marketing strategies that are more relevant to each group of people.

However, outside of academia, differences of these magnitudes are insufficient for clients to change their marketing approaches. Business owners generally want to see differences of 5 percentage points or more between groups before they take differential actions. Differences of that magnitude were rare in this study. Efficient and effective marketing strategies must be based on large and meaningful differences of opinion, not slightly different opinions.

While practitioners should keep these differences in mind as they conduct their research, it is important to recognise that, for the most part, they do not need to change their marketing strategies to account for demographic differences in social media word choices. Their interpretation of data should continue without undertaking the extra step of data transformation. Chances are that when a practitioner decides that the differences are large enough to warrant a unique marketing strategy, the observed differences will be much larger than those discovered here.

About the author

Annie Pettit is Vice President, Research Standards at Research Now and Chief Research Officer at Peanut Labs. She specializes in social media listening research, survey methods and research data quality, and is a globally sought after conference speaker. Annie is Editor in Chief of MRIA’s Vue magazine. She writes the LoveStats marketing research blog, and is the author of The Listen Lady , a novel about social media research.

Address correspondence to: Annie Pettit, Research Now, 3080 Yonge Street, Suite 2000, Toronto, Ontario, Canada, M4N 3N1

Email: [email protected]: @LoveStats

FORUM Identifying the real di˜erences of opinion in social media...

Documents

Transcript of FORUM Identifying the real di˜erences of opinion in social media...