Big Data Final Paper - Warriors Final

Project Final Report

Exploratory Analysis of Social Media Images to Inform Product Innovation, Marketing & Promotions

Big Data Analytics Summer 2015

Matthew Blough, Eric DeFina, Zixin Mao, Sandilya Tumma8/12/2015

1

Abstract

Social Listening is an established activity allowing organizations to generate

consumer/customer insights and make more informed business decisions from public data in

social media. While traditionally based on text analytics tools, the rise of platforms such as

Instagram, Pinterest, Snapchat, Tumblr and more, have transformed the content, and therefore

data, generated in social media. As such, the analysis of unstructured data from images will be

critical to “social listening” on today’s platforms to fully understand context, sentiment,

meaning, and more. Through this research, we will explore whether we can use big data

platforms to read, analyze (trends, commonalities) and summarize unstructured data from social

media images to develop insights that feed business and marketing decisions for an online travel

agency company (e.g. Travelocity).

Introduction

Social Listening is an activity that allows brands and organizations to learn from public

data generated by consumers in social networks. By mining this unstructured data, companies

can generate insights from observing online consumer conversation, and then use these insights

to make smarter and more informed business decisions, such as product innovations, decisions

and changes, marketing campaigns, promotional offers, and more.

Over the past couple years, however, there has been a transformational shift in the

content published by consumers to social networks. With the rise of platforms such as

Instagram, Pinterest, Snapchat, and more, social conversation has become dominated by visual

communication and content. In addition, “traditional” social networks such as Facebook and

Twitter have also seen an influx of visual posts verses traditional comments, tags, and other text-

2

based content. In 2014, 500 million image-based posts shared each day in social media, often

without the pairing of text to provide context to the image. This shift has been largely enabled

by the advancement and adoption of smartphones, as well as faster data connection speeds. For

users, visuals are easier and faster to consume.

In order to continue to mine the full sphere of social media for business insights and

questions, we must go beyond text analytics and use big data tools to collect and analyze

imagery quickly and efficiently. As image content now dominates the social web, it will be

critical to understand the context, sentiment and meaning of images in the same way tools have

historically parsed this data from text.

Business Significance

Market research, and specifically understanding consumer needs and the market

environment, has long been a tenant of running a successful and profitable business. In 2014

alone, the market research industry boasted of over $40 billion in sales globally. Today, the

Internet and rise of social media has created new opportunities for research and insights. It is no

longer necessary, in many cases, to set up formal, and expensive, studies in order to understand

and listen to consumers. In addition, the scale and size of the data has offered the ability to

analyze behavior of much larger groups of people compared to the smaller sample sizes of

traditional research studies. Since access to social media has become freely available to

interested organizations, many have turned to the analysis of this massive public data set as a

new form of consumer, market, and competitive insights.

Advertising has been a key activity for large online travel agencies to convert consumers

and drive sales. Expedia and Travelocity along spent over $4 billion in advertising in 2014

3

alone. To make that advertising most effective, it is critical to understand the consumer insight

and create advertisement plans that can drive consumers down the purchase funnel, from product

awareness, to actually purchasing a trip. Analysis of social media images can provide these key

insights that can bolster our advertisement effectiveness and ultimately sales. By knowing what

the types of images consumers are posting, and what the images consist of, we can draw

conclusions of what travel options consumers are looking for, who people most commonly travel

with, and what activities they are doing. This information, far more insightful than transactional

data we have traditionally had access to, can be utilized to create more engaging opportunities

for our advertising creativity and promotional bundles to better meet the wants and needs of our

target audience.

Problem Statement

An online travel agency, TravelWeb, would like to determine the most effective new

advertising and promotion campaign based on consumers’ travel behavior, activity and trends, in

order to increase sales. Based on information extracted from social media imagery, we want to

answer:

• What imagery and creative should we be using for our marketing campaigns, ads,

website imagery and social media content?

• What deals and packages should we be offering?

• How should we structure our offers to best meet the wants and desires of

consumers?

• What bundles and deals should we create?

4

Methodology

The process of this project is to take a large set of unstructured data from social media in

the form of images, transform it into a structured data set by using a computer vision algorithm,

then analyze the structured data set via data mining techniques in order to gain insights into

consumer compositions and preferences.

The first step of our methodology is hypothesis development, in other words, we needed

to outline our business interests. Our hypothesis spans across several topics. One is to understand

type of imagery, whether it may be hiking, camping, skiing, cruise travel, can be used to create

marketing campaigns or ads for prospective clients. Another hypothesis is to see who people are

traveling with in order to understand how to cater to their needs and interests on vacation trips.

What kind of bundles should be offered in terms of activities, foods, excursions based off of

these social media images. For example, people generally enjoy taking pictures on water skiing

more than when sitting around a campfire due to the thrill of the activity. Traveling companies

can offer packages for jet skiing in order to maximize revenue for that specific activity.

The dataset of images has already been given to us. Following Figure 1.2, the next task is

to take the images and run them through Microsoft ComputerVision API to extract structured

quantitative and qualitative information in each picture. This provides information on facial

recognition features such as gender and age, image colors, object categories, and how well these

predictions are doing in terms of a score. The dataset has over 150,000 images and the big data

platform can be useful in running these pictures quickly and efficiently. Since API calls are made

per image and one output in JSON format is produced for each image, we end up with 150,000

individual JSON records.

5

Figure 1.1

While these JSON records we obtained from ComputerVision API are structured data, in

order to conveniently perform analysis, the data set needs to be further transformed into a

relational data structure. To achieve this takes two steps. The first step is to aggregate the

individual JSON records into one single file. This is necessary due to the flexible nature of JSON

format, i.e. it doesn’t require individual files to share the same number of fields. Therefore, in

order to make sure fields from all the files are included, these files need to be properly

aggregated. The second step is converting the single JSON data file into a simple relational

structures with columns and rows. To accomplish this, we utilized a tool named Konklone.

Background InformationHypothesis StatementBusiness InsightBig Data usefulnessVision APIPython to extract raw dataClean up the data for analysis (ETL process)Data Mining Tool IBM SPSS ModelerClassificationClusteringAssociationExploratory analysisFind insight for business valueBusiness decisions for advertising companies

6

After a relational database is constructed, it is time to perform analytics. The platforms

we selected are Microsoft Azure and its Machine Learning Studio component. Azure is a

powerful big data platform with easy navigation and access to numerous plug-ins. Its Machine

Learning Studio allows us to apply different types of analytical techniques on the data. We can

easily perform descriptive analytics by slicing and dicing the data set using SQLite queries then

calculate their statistics. At a little more advanced level, we can create and run data mining

models such as classification, clustering and association. Here we can attempt to find different

patterns, trends, and correlations which can be useful to the business insight at hand. The

business implication here is to see what kind of images consumers are taking and begin advising

traveling companies and/or agencies on how to better promote their advertisements to specific

activities and leisure events. The goal here is to assist these companies in increasing revenue and

maximizing profits so that there are no dead costs in promoting the wrong activities. Why

promote parasailing at a location which isn’t suited for that type of activity as opposed to

parasailing on some off shore islands which is much better with consumers taking daily pictures?

7

Figure 1.2

Project Domain

The project domain is broken down by the ETL process - Extraction, Transformation,

Load. Extraction is the challenging aspect of the process where we must connect the online

vision API to Microsoft Azure. This will allow us to feed images into python so that the script

can pull information from the API and give us output JSON files on each image. Python will be

running a loop function to run through all the images on the directory folder and spit out

thousands of files for analysis. Once we have compiled these JSON files together, we are ready

to transform these files into useful data. We have a couple options here. We can go either go

through Amazon Web Service’s MapReduce in order to compress the data into one big file. Once

we have one big file of the unstructured data, we can run this through Microsoft Excel as a CSV

file and clean it up as a proper dataset. This dataset will be our primary source of analysis once

we load it into the IBM SPSS Modeler application. The load process takes us to the modeler

where we can perform exploratory analysis and find key insights into the data. The key findings

8

are what will be useful to businesses promoting their vacation packages in a more efficient

manner and spending resources where they find it best to maximize revenue.

Analytical Methods

We are looking to use three main categories of analytics: Classification, clustering, and

association. Classification will allow us to understand how categories of color, objects, age, and

scores are seen together as a large collection of images. Which image characteristics are more

common with social media images? Is there a commonality of pictures being taken of a younger

generation than the elder generation? Classification can help us understand the trend of these

images. Another tool we can look into is clustering. Clustering allows us to group together

characteristics of images which have more relation with each other. This can be with colors or

category images, just to name a few. The association method gives us connection analysis on

images of various categories. Age, gender, and color attributes can be analyzed to see which

combination of characteristics are closely associated with each other. Observing images which

are associated with each other should have a high confidence % in determining how closely they

are related to each other. One example of this: Are there more buildings in the background vs.

pictures of faces with buildings in the background?

Output/Results

We were interested in looking into insightful results through exploratory analysis and see

if we can identify any patterns of trends throughout the dataset of image information. To start

this off we built a simple model in Microsoft Azure identifying descriptive statistics from

important variables most relevant to the insight. Figure 1.3 shows the different nodes connected

9

as we imported our dataset in the reader node and connected it to the project columns to select

the columns which we were most interested in looking at it. The project columns resembles a

“filter” node from other applications and we are able to concentrate on specific variables which

are of value to us. Lastly the descriptive statistics node had to be connected to the project

columns in order to spit out the statistical values of our categories for analysis.

Figure 1.3

In order to get some deeper insights, we needed to drill down the data set by slicing and

dicing it. For instance, an interesting aspect to look at is consumer composition by gender,

gender association and age groups. Figure 1.4 shows us the different slices we created for our

analysis. For example, we ran the following query to isolate records about images with two male

faces.

10

Another absolutely important use for queries is to filter trustworthy data from noisy data.

This is important because although computer vision is getting more accurate day by day, it is still

far from being 100% accurate. Therefore, we need to take into account of filtering out data that

have very low prediction accuracy or ambiguous categories that are too generic for any

meaningful analysis. Take the following query for example, it removes records in which

ComputerVision API produced a prediction accuracy of lower than 10% in the first object

category. In addition, it takes out records that are categorized as “abstract” or “others”.

Figure 1.5 shows us the detailed results of the statistics of various types. The count,

median, mode, range, min, max, average statistics are displayed for numeric variables. Especially

for variables such as category score and face age, where numeric values are given to these

categories and we can identify certain patterns such as the average age of faces being produced

from this collection of images is around the age of 30. But you can also identify face ages which

vary as low as 1 and as high as 96.

Figure 1.4

11

Figure 1.5

12

Clustering analysis helps identify similar characteristics grouped together. With image

clustering, one can identify the different types of images which are similar to each other through

various measurements. One of them being the Euclidian distance which allows to determine how

far one cluster is from another. Figure 1.6 shows how the cluster model was built through

Microsoft Azure. A reader and project column node were once again inserted to filter through

selected variables. We were most interested in identifying 3 columns of categories along with

their main color categorization. Then we added the “train clustering model” model in order to

train the variables into forming four different clusters. This process took some extra time as it

had to train the model so that we can extract results from it. Running it through azure did take 5

times as fast as running through other platforms such as IBM SPSS Modeler. This was a major

advantage to us from a big data perspective. Running 120,000+ images can be done in a quicker

process through azure than other platforms. The K-Means clustering node was added to cluster

our final results together. The metadata editor was used to name the cluster names as 1, 2, 3, 4

numeric values.

Figure 1.6

13

The clustering results were extracted into a CSV format through the K-Means Clustering node.

From here we compiled the CSV table results into clustering results shown in Figure 1.7. Four

clusters formed categories and colors which were in close proximity to each other in

characteristics. They were all distinctively different and shows the type of category names

associated with the colors.

Figure 1.7

Cluster 1:

Outdoor

Building

Street Tree Text

Grey

Black

Cluster 2:

Food

Drinks

Crowd People

Yellow Blue

Green Black

Cluster 3:

Abstract

Others

Cluster 4:

Beach

Water

Sky

Blue

White

Scope & Limitations

14

The main limit of this project is the amount of data. With more data the scores generated

would be more robust resulting in greater precision and accuracy in analyzing the images.

Moreover, our data being images from just one source is another limit. While the old adage, “a

picture is worth a thousand words,” may stand true, we are hoping to narrow results to find the

most important elements of description for analysis of an image. Additionally, due to the lack of

specificity of the ComputerVision API, classifications of the data were unable to be performed

for enhanced insights.

Policy/Managerial Im plications

Improved picture recognition and description can allow managers to bolster “social

listening” on today’s media platforms to better understand context, sentiment, and the meaning

behind why a picture is shared and the context of the image. With enhanced image recognition,

more informed business decisions, such as product innovations, decisions and changes,

marketing campaigns, and promotional offers can emulate successful targeted fields for own

campaigns. Moreover, this project can help understand which elements of an image make it go

viral. Greater analysis and understanding of what customers take pictures of and what they share

allows a business to create a better product search for customers as well.

Conclusions & Future R esearch

Images are the new text on the web. They are easy to share and more engaging than text.

The trend will continue in favor of images and we believe that analysis of images will grow

tremendously in the coming years. Expanding on the importance of social listening, more

insights can be drawn from interpreting pictures. Enhancements in image perspective analysis,

GPS and sentiment overlay, will improve clustering and classification in order to better predict

15

what appeals to specific customers to increase sales. Social media company Snapchat, an

ephemeral photo and video sharing app currently charges $400,000 worth of ad space for a story

generating 20 million views. Meanwhile, Facebook has utilized its massive storage of photos to

developed a way to recognize people in photos even if their faces are obstructed, identified

individuals with 83% accuracy using a method dubbed PIPER, an acronym for pose invariant

person recognition. As the quantity of images shared online increases the quality of data

algorithms processing photos will bolster analysis. Why pictures were taken, understanding the

important elements, what sparked the instance, and how to better react and cater to what

customer desires are the driving forces on how image analytics will proceed in the future.

Sources

1. http://www.fastcompany.com/3000794/rise-visual-social-media

2. http://blogs.adobe.com/digitalmarketing/social-media/visual-social-snapchat-pinterest-

and-the-rise-of-media-rich-marketing/

3. http://wersm.com/visual-web-the-next-big-thing/

4. http://blogs.wsj.com/digits/2015/06/23/facebook-claims-photo-recognition-breakthrough/

5. http://recode.net/2015/06/17/snapchats-making-some-pretty-serious-money-from-live-

stories/

6. https://www.esomar.org/uploads/industry/reports/global-market-research-2014/

ESOMAR-GMR2014-Preview.pdf

7. http://skift.com/2015/02/20/priceline-and-expedias-advertising-arms-race-in-2014/

8. Mary Meeker; 2014 Internet Trends Report

9. https://www.forrester.com/Big+Datas+Big+Meaning+For+Marketing/quickscan/-/E-

res114782

10. http://www.forbes.com/sites/groupthink/2015/05/01/visual-listening-social-medias-next-

frontier/3/

http://www.forbes.com/sites/groupthink/2015/05/01/visual-listening-social-medias-next-frontier/3/

http://www.forbes.com/sites/groupthink/2015/05/01/visual-listening-social-medias-next-frontier/3/

https://www.forrester.com/Big+Datas+Big+Meaning+For+Marketing/quickscan/-/E-res114782

https://www.forrester.com/Big+Datas+Big+Meaning+For+Marketing/quickscan/-/E-res114782

http://skift.com/2015/02/20/priceline-and-expedias-advertising-arms-race-in-2014/

https://www.esomar.org/uploads/industry/reports/global-market-research-2014/ESOMAR-GMR2014-Preview.pdf

https://www.esomar.org/uploads/industry/reports/global-market-research-2014/ESOMAR-GMR2014-Preview.pdf

http://recode.net/2015/06/17/snapchats-making-some-pretty-serious-money-from-live-stories/

http://recode.net/2015/06/17/snapchats-making-some-pretty-serious-money-from-live-stories/

http://blogs.wsj.com/digits/2015/06/23/facebook-claims-photo-recognition-breakthrough/

http://wersm.com/visual-web-the-next-big-thing/

http://blogs.adobe.com/digitalmarketing/social-media/visual-social-snapchat-pinterest-and-the-rise-of-media-rich-marketing/

http://blogs.adobe.com/digitalmarketing/social-media/visual-social-snapchat-pinterest-and-the-rise-of-media-rich-marketing/

http://www.fastcompany.com/3000794/rise-visual-social-media

Big Data Final Paper - Warriors Final

Documents

Transcript of Big Data Final Paper - Warriors Final