Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data...

21
Technical Appendix Catch the Pink Flamingo Analysis Produced by: Prabhat Tripathi Acquiring, Exploring and Preparing the Data Data Exploration Data Set Overview The table below lists each of the files available for analysis with a short description of what is found in each one. File Name Description Fields ad-clicks.csv Contain records of all clicks on ads by players Timestamp: the date-time when the event (click) occurred txId: a unique id within this file userSessionId: the id of the user session where this click occurred teamid: id of the team of the player who clicked on the ad adCategory: the category of the ad clicked buy-clicks.csv All in-app purchase entries Timestamp: the time when the purchase occurred txId: a unique if within this data set userSessionId: the id of the user session where this purchase occurred buyId: the id of the item bought userId: user that purchased the item price: price of the item purchased users.csv Contains all unique users in the game Timestamp: the date-time when each user “first” played the game (when the record was created) id: unique id given to each user nick: nick name chosen by the user

Transcript of Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data...

Page 1: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Technical Appendix

Catch the Pink Flamingo Analysis Produced by: Prabhat Tripathi

Acquiring, Exploring and Preparing the Data

Data Exploration Data Set Overview The table below lists each of the files available for analysis with a short description of what is found in each one.

File Name Description Fields

ad-clicks.csv Contain records of all clicks on ads by players

Timestamp: the date-time when the event (click) occurred txId: a unique id within this file userSessionId: the id of the user session where this click occurred teamid: id of the team of the player who clicked on the ad adCategory: the category of the ad clicked

buy-clicks.csv All in-app purchase entries Timestamp: the time when the purchase occurred txId: a unique if within this data set userSessionId: the id of the user session where this purchase occurred buyId: the id of the item bought userId: user that purchased the item price: price of the item purchased

users.csv Contains all unique users in the game

Timestamp: the date-time when each user “first” played the game (when the record was created) id: unique id given to each user nick: nick name chosen by the user

Page 2: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

twitter: twitter handle of the user dob: date of birth country: two letter country code

team.csv All teams terminated in the game

teamid: unique id assigned to each team name: name of the team teamCreationTime: when team record was created teamEndTime: timestamp when the “last” member left the team currentLevel: current level reached by the team

team-assignments.csv Player-team mapping table time: when the user joined the team team: team id userid: user id Assignmentid: a unique id assigned to each user-team mapping

level-events.csv Logs of teams completing game levels

time: when a level start or end occurred eventid: a uniqueid assigned to each record teamid: team id level: level started or completed eventType: start or end

user-session.csv Each line in this file describes a user session, which denotes when a user starts and stops playing the game. Additionally, when a team goes to the next level in the game, the session is ended for each user in the team and a new one started.

userSessionid: a unique id for the session. assignmentid: the team assignment id for the user to the team. startTimeStamp: a timestamp denoting when the session started. endTimeStamp: a timestamp denoting when the session ended. team_level: the level of the team during this session. platformType: the type of platform of the user during this session.

Page 3: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

game-clicks.csv Contains all clicks by users time: when the click occurred. clickid: a unique id for the click. userid: the id of the user performing the click. usersessionid: the id of the session of the user when the click is performed. isHit: denotes if the click was on a flamingo (value is 1) or missed the flamingo (value is 0)

Aggregation Amount spent buying items 21407

# Unique items available to be purchased 6

A histogram showing how many times each item is purchased:

A histogram showing how much money was made from each item:

Page 4: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Filtering A histogram showing total amount of money spent by the top ten users (ranked by how much money they spent).

The following table shows the user id, platform, and hit-ratio percentage for the top three buying users: Rank User Id Platform Hit-Ratio

(%)

1 2229 iphone 11.5970%

Page 5: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

2 12 iphone 13.0682%

3 471 iphone 14.5038%

Page 6: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Data Classification Analysis

Data Preparation Analysis of combined_data.csv

Sample Selection Item Amount

# of Samples 4619

# of Samples with Purchases 1411

Attribute Creation A new categorical attribute was created to enable analysis of players as broken into 2 categories (HighRollers and PennyPinchers). A screenshot of the attribute follows:

<Fill In: Describe the design of your attribute in 1-3 sentences.> Here price_categories attribute is added as “numeric binner” data manimulation node. This attribute partitions rows into two bins – HighRollers (AvgPrice > 5.0) and PennyPichers (AvgPrice <= 5.0). The new attribute, price_categories is “appended” at the end.

Page 7: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

The creation of this new categorical attribute was necessary because <Fill in 1-2 sentences>. We are creating a model to predict some characteristics of “highroller” users. In order to create this model and learn this behavior using train data-set, we would need to create this new attribute.

Attribute Selection The following attributes were filtered from the dataset for the following reasons:

Attribute Rationale for Filtering

userId This id attribute should not have any impact on the prediction model.

userSessionId This Id attribute (column) should not have any influence on whether a user is highroller or pennypincher.

Avg(price) We have derived attribute (price-category) from this attribute. IT is therefore no longer needed for analysis.

Data Partitioning and Modeling The data was partitioned into train and test datasets. The train data set was used to create the decision tree model. The trained model was then applied to the test dataset. This is important because we would want to test our model on data that was “not” used to train it. When partitioning the data using sampling, it is important to set the random seed because it ensures that you will get the same partitions every time you execute this node. This is important to get reproducible results. A screenshot of the resulting decision tree can be seen below:

Page 8: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Evaluation A screenshot of the confusion matrix can be seen below:

Page 9: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Confusion matrix provides comparison between “actual” and “predicted” values of the test attribute (price-categories in this case). Out of the test data, which is 40% of the total 1411 or ~565 rows, we see that our model predicted:

1. 308 PennyPinchers were correctly predicted but 2. 27 PennyPinchers were incorrectly predicted as HighRollers 3. 192 HighRollers were correctly predicted but 4. 38 HighRollers were incorrectly predicted as PennyPinchers

Analysis Conclusions The final KNIME workflow is shown below:

What makes a HighRoller vs. a PennyPincher? From the analysis, it appears that “platformType” predicts HighRoller vs PennyPincher. iPhone users are more likely to be HighRollers.

Specific Recommendations to Increase Revenue

1. Promote game more to iPhone users (ios platform)

2. There are still android high-rollers so do not stop android platform development – just the focus and marketing energy should be on iphone platform.

Page 10: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Yes, the analysis clearly identifies that iPhone users are HighRollers and other platformType users are PennyPinchers. Yes, and at least one is related to marketing to iPhone users. Few HighRollers are seen outside of iphone users. This maybe due to lack of useful application or hardness for use from extra-iphone-platforms. We should inspect the application or interface problem for extra-iphone-users.

Page 11: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Clustering Analysis

Attribute Selection

Training Data Set Creation

The training data set used for this analysis is shown below (first 5 lines): Dimensions of the training data set (rows x columns) : 531 x 4 # of clusters created: 2

Cluster Centers

These clusters can be differentiated from each other as follows: First number (field1) is each array refers to number of sessions, second number (field 2) is the hit ratio whereas the third number (field 3) is the revenue per user.

Page 12: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Therefore: compare first number in two clusters to find our how users in two clusters differ in terms of their game engagements? compare second number to find out how users differ in terms of game expertise levels in the two clusters? compare the third number to find out how much revenue users in two clusters generate? Cluster 1 is different from the others in that… In the first cluster (cluster 1) players play the game more often (almost 50% more) but they generate revenue more than 5 times. The hit ratio in two clusters appears to be similar and therefore does not have much differentiation. Cluster 2 is different from the others in that… Players in clusters are less engaged and generate less revenue compared to players in cluster 1.

Recommended Actions We are assuming that Eglence Inc. gets paid for showing ads and for hosting in-app purchase items.

Page 13: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Graph Analytics Analysis

Modeling Chat Data using a Graph Data Model (Describe the graph model for chats in a few sentences. Try to be clear and complete.)

The above model shows the nodes and relationships on the Chat-Data graph model. In the following description, nodes are marked bold. User creates/joins/leaves TeamChatSessions Every TeamChatSessions are owned by a Team User creates ChatItems which are always part of an existing TeamChatSession a ChatItem can be in ResponseTo another ChatItem and may Mention a User

User TeamChatSession

Team

ChatItem

ChatItem

UserJoins/Leaves

CreatesSession

CreateCh

at

OwnedByPartOf

Joins/Leaves

CreatesSession

Men=oned

ResponseTo

PartOf

Page 14: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Creation of the Graph Database for Chats Describe the steps you took for creating the graph database. As part of these steps

i) Write the schema of the 6 CSV files 1) File: chat_create_team_chat.csv Description: A line is added to this file when a player creates a new chat with their team. User -> CreatesSession -> TeamChatSession Example: userid, teamid, TeamChatSessionID, timestamp 559,48,6288,14567 876,15,6289,24244 1166,68,6290,65522 2) File: chat_item_team_chat.csv Description: User creates a ChatItem which as a part of an existing TeamChatSession User -> CreateChat -> ChatItem ChatItem -> PartOF -> ChatSessionId Example: userid, teamid, timestamp 1956,6299,6305 2081,6296,6311 1166,6290,6316 3) File: chat_join_team_chat.csv Creates an edge labeled "Joins" from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Joins edge.

Page 15: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

User -> Joins -> TeamChatSession Example: userid, TeamChatSessionID, teamstamp 559,6288,12345 876,6289,15468 1166,6290,15648 4) File: chat_leave_team_chat.csv User -> Leaves -> TeamChatSession Creates an edge labeled "Leaves" from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Leaves edge. Example: userid, chatid, timestamp 1244,6821,1464241204.0 1074,6838,1464243024.0 350,6777,1464246654.0 5) File: chat_mention_team_chat.csv ChatItem -> Mentioned -> User Each row represent a chat item that mentioned a user Example: ChatItem, userid, timeStamp 6349,2508 6366,2491

Page 16: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

6371,104 6) File: chat_respond_team_chat.csv A line is added to this file when a player responds to a chat post. ChatItem -> ResponseTo -> ChatItem Example: chatid1, chatid2,timestamp 6326,6305,21564 6364,6326,54544 6371,6366,54567

ii) Explain the loading process and include a sample LOAD command Loading process includes:

- Put all .csv files into the neo4j database folder (when you launch neo4j it shows its database dir, the files need to be inside that)

- One my one load each csv file and create the node – edges as per the schema of the file The following example LOAD command, for instance, loads the data file that contains relationship between two ChatITems. Each line describes the ChatItem id (Column 1, index 0) which is in ResponseTo another ChatItem id given in column 2 (index 1). LOAD CSV FROM "file:///chat_respond_team_chat.csv" AS row MERGE (ci:ChatItem {id: toInt(row[0])}) MERGE (ci2:ChatItem {id: toInt(row[1])}) MERGE (ci)-[:ResponseTo{timeStamp: row[2]}]->(ci2)

iii) Present a screenshot of some part of the graph you have generated. The graphs must include clearly visible examples of most node and edge types. Below are two acceptable examples. The first example is a rendered in the default Neo4j distribution, the second has had some nodes moved to expose the edges more clearly. Both include examples of most node and edge types.

Page 17: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

Finding the longest conversation chain and its participants Report the results including the length of the conversation (path length) and how many unique users were part of the conversation chain. Describe your steps. Write the query that produces the correct answer.

1) How many chats are involved in it? The longest path length with relation ResponseTo Query: match p=(a:ChatItem)-[:ResponseTo*]->(c:ChatItem) return length(p) order by length(p) desc limit 1; Result: 9 • That means, 10 chatitems are involved in the longest chat path.

2) How many users participated in this chain? Cypher command:

match p=(ci:ChatItem)-[:ResponseTo*]->(ci2:ChatItem) where length(p) = 9 with p match (d:ChatItem)<-[r:CreateChat]-(u:User) where d in nodes(p)

Page 18: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

return count(distinct u); - note that the path length 9 is from the part 1.

* Result: 5 users Analyzing the relationship between top 10 chattiest users and top 10 chattiest teams Describe your steps from Question 2. In the process, create the following two tables. You only need to include the top 3 for each table. Identify and report whether any of the chattiest users were part of any of the chattiest teams. Chattiest Users match (u:User)-[:CreateChat]->(ci:ChatItem) return u, count(distinct ci) as chats order by chats desc limit 3;

Users Number of Chats

394 115 2067 111 1087 109 Chattiest Teams match (ci:ChatItem)-[:PartOf]->(s:TeamChatSession)-[:OwnedBy]->(t:Team) return t, count(distinct ci) as chatitems order by chatitems desc limit 3;

Teams Number of Chats

82 1324

Page 19: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

185 1036 112 957 Finally, present your answer, i.e. whether or not any of the chattiest users are part of any of the chattiest teams. Top 10 Chattiest teams: ╒═════════╤═════════╕ │t │chatitems│ ╞═════════╪═════════╡ │{id: 82} │1324 │ ├─────────┼─────────┤ │{id: 185}│1036 │ ├─────────┼─────────┤ │{id: 112}│957 │ ├─────────┼─────────┤ │{id: 18} │844 │ ├─────────┼─────────┤ │{id: 194}│836 │ ├─────────┼─────────┤ │{id: 129}│814 │ ├─────────┼─────────┤ │{id: 52} │788 │ ├─────────┼─────────┤ │{id: 136}│783 │ ├─────────┼─────────┤ │{id: 146}│746 │ ├─────────┼─────────┤ │{id: 81} │736 │ └─────────┴─────────┘ Top 10 chattiest Users: ╒══════════╤═════╕ │u │chats│ ╞══════════╪═════╡ │{id: 394} │115 │ ├──────────┼─────┤ │{id: 2067}│111 │ ├──────────┼─────┤ │{id: 209} │109 │ ├──────────┼─────┤ │{id: 1087}│109 │ ├──────────┼─────┤ │{id: 554} │107 │

Page 20: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

├──────────┼─────┤ │{id: 1627}│105 │ ├──────────┼─────┤ │{id: 516} │105 │ ├──────────┼─────┤ │{id: 999} │105 │ ├──────────┼─────┤ │{id: 461} │104 │ ├──────────┼─────┤ │{id: 668} │104 │ └──────────┴─────┘ Once we find team for all these 10 users, we see that:

- Only userId 999 belongs to a team that is in top 10 chattiest team.

How Active Are Groups of Users? Creating InteractsWith Relationship: 1. One user mentioned another user in a chat match (u1:User)-[cs:CreateChat]->(ci:ChatItem)-[:Mentioned]->(u2:User) create (u1)-[:InteractsWith]->(u2)

2. One user created a chatItem in response to another user’s chatItem match (u1:User)-[cs:CreateChat]->(ci1:ChatItem)-[:ResponseTo]-(ci2:ChatItem) with u1,ci1,ci2 match (u2)-[:CreateChat]-(ci2) create (u1)-[:InteractsWith]->(u2)

Page 21: Acquiring, Exploring and Preparing the Data...Acquiring, Exploring and Preparing the Data Data Exploration ... The creation of this new categorical attribute was necessary because

3. Delete Self relationships Match (u1)-[r:InteractsWith]->(u1) delete r

Most Active Users (based on Cluster Coefficients)

User ID Coefficient 209 0.95238 554 0.90476 1078 0.8