Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1...

19
Streamwatch: evaluation of a twitter-based music recommendation system. Ruben F. de Vries 10260218 Bachelor thesis Credits: 18 EC Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Isaac Sijaranamual dhr. dr. Evangelos Kanoulas Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 26th, 2015 1

Transcript of Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1...

Page 1: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

Streamwatch: evaluation of a

twitter-based music

recommendation system.

Ruben F. de Vries10260218

Bachelor thesisCredits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

SupervisorsIsaac Sijaranamual

dhr. dr. Evangelos Kanoulas

Informatics InstituteFaculty of Science

University of AmsterdamScience Park 904

1098 XH Amsterdam

June 26th, 2015

1

Page 2: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

Abstract

Streamwatchr is an online music service that provides real-time infor-mation about music listening behaviour around the world on their website,where the collected songs can be listened to. After each song an follow-upsong is recommended by Streamwatchr which leads to an Internet ra-dio function. How well does this music recommender work and how welldoes it work in comparison to more popular music recommender systemslike YouTube and LastFM? The first question leads to an arbitrary an-swer since not enough data is available in combination with the fact thata question concerning how well does something work? shall always bearbitrary. Second question has a relative answer that is achieved by de-veloping two di↵erent methods; implicit and explicit comparison. Implicitcomparison implies that the users are not aware that the three music rec-ommender systems are compared, as with explicit comparison the usersare. An A/B testing framework is set up locally for the implicit com-parison, but is not run in practice and has thus no results. The explicitcomparison is done by developing an separate website where the threemusic recommender systems return their follow-up song given an initialsong, which can then be rated for how good the follow-up song is and ifthe song is original. Results show that Streamwatchr scores not as goodas YouTube and LastFM on the quality of the follow-up song, but scorestwice as good in originality.

2

Page 3: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

Contents

1 Introduction 4

2 Literature Review 5

2.1 Music Recommender Systems . . . . . . . . . . . . . . . . . . . . 52.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Radio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Method 7

3.1 Research Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Research Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Explicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.1.1 YouTube . . . . . . . . . . . . . . . . . . . . . . 93.2.1.2 LastFM . . . . . . . . . . . . . . . . . . . . . . . 103.2.1.3 Streamwatchr . . . . . . . . . . . . . . . . . . . 113.2.1.4 Playing songs . . . . . . . . . . . . . . . . . . . . 123.2.1.5 Rating songs . . . . . . . . . . . . . . . . . . . . 123.2.1.6 In Summary . . . . . . . . . . . . . . . . . . . . 123.2.1.7 Design choices . . . . . . . . . . . . . . . . . . . 13

3.2.2 Implicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Results 16

5 Conclusion 17

6 Discussion and future work 17

7 Repository 18

3

Page 4: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

1 Introduction

Streamwatchr is an online music service that provides real-time informationabout music listening behaviour around the world. This is achieved by collectingand analysing Tweets from Twitter users. All the real-time listening data isconverted into statistics on the website of Streamwatchr which enables the userto find the most popular, now listened to and unexpected songs.

Music has been integrated into our lives since the introduction of portable musicplayers (discman, mp3-players, etc). Nowadays a separate portable music playeris unnecessary; music can be streamed or played on every sort of smartphone.This means nearly infinite access to music, wherever the user may be. Becauseof this accessibility, the focus has shifted from not just listening to music to alsosharing the listened music. By analysing this non-private sharing information,trends and flows from music listeners can be derived.

Besides real-time statistics about listened songs there is the possibility to playmusic on Streamwatchr’s website. Every listened song is followed by a recom-mended song, which makes the listening function of Streamwatchr an online‘radio’. These follow-up songs consist only of songs that have been tweeted atleast once and occur in MusicBrainz[6], and are thus listened to by a sharinguser. This means that the Streamwatchr radio is a radio provided by the users,for the users.

New song recommendation is the main-focus of this bachelor thesis. This main-focus can be divided into two research questions. Firstly, it is currently notknown how well the recommendation works; is the recommended song a goodfollow-up to the previous song? Secondly, the computed performance of theprevious research question needs to be compared to other recommendation sys-tems to gain knowledge about how well it works in comparison to more popularsystems like LastFM and YouTube. More music recommendation systems areavailable present day (spotify, pandora etc.) but LastFm is chosen since itis especially developed for recommending music and YouTube for its popular-ity. Involving more than two systems would not be possible regarding the timesince other music recommender systems are not as easy accessible as LastFMand YouTube.

For the first research question it is mandatory to give a definition of a ’good’recommended song. This paper will show that it is not possible to define howappropriate a recommended song is, given the limited dataset that Streamwatchrhas provided. A di↵erent approach has thus been taken; not the recommendedsongs are rated, but the total user experience of the radio function.

While the first research question has an absolute answer, the second question hasa more relative one. Streamwatchr is compared to the recommendation systemsof YouTube and LastFM in an implicit and explicit way. The implicit methoduses A/B testing in combination with the experience measure created for thefirst research question, while for explicit testing a separate webpage was set up

4

Page 5: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

where users can explicitly rate all three recommender systems by comparingrecommended songs given an initial song.

2 Literature Review

2.1 Music Recommender Systems

The current music recommender of Streamwatchr is based on the Google NewsPersonalization ranking[1]. This news recommendation system is a domain in-dependent approach which makes it possible for Streamwatchr to adept thissystem. Where Google uses the news articles that users read, Streamwatchruses the songs that are listened to and are tweeted. Three di↵erent algorithmsare used for generating a recommendation; MinHash clustering, ProbabilisticLatent Semantic Indexing (PLSI), and covisitation counts. Details about thesealgorithms are left out of this section since these algorithms are not coveredduring this project.

Music recommendation can be done in various ways. Next to the domain in-dependent approach, like the Google News Personalization recommender, arethe domain specific approaches. Not every (specific) music recommender usesthe same input signals. Knowledge about these di↵erent signals, and how theyare addressed, can lead to insights for the improvement and comparison ofStreamwatchr’s current recommender.

The Local Implicit Feedback Mining recommender from Yang is based on onlineand o✏ine signals, in combination with a supervised learning algorithm[9]. Fromevery user is known which song they like by means of a rating; an o✏ine signal.Date, time and location of a user listening to a song were collected as onlinesignals. Using the online signals as predicting data, and the o✏ine data aslabels, a supervised learning algorithm was trained. The evaluation of thislearning algorithm was done by computing the precision and recall. Where thegathered online signals are in accordance to the data gathered by Streamwatchr,the o✏ine labels are not; Streamwatchr does not know from every user whichmusic they like.

Lee et al. (2011) also developed a music recommender with the usage of o✏inedata, namely music playlists[5]. Their method consisted of combining playlistsfrom di↵erent users, where the playlist is divided in to a head (the X mostlistened songs) and a tail (the remaining songs). Suppose there is a user1 anduser2. If the head of the playlist of user1 occurs in the tail of user2, the headof user2 might be interesting for user1. The Evaluation of this music recom-mender was done by presenting the users a list of 20 recommended tracks, whichcould then be rated (o✏ine evaluation). At the moment Streamwatchr does notprovide a playlist function, this might be a useful upgrade for the music recom-mender though.

5

Page 6: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

2.2 Evaluation

An online/implicit evaluation framework need to be set up for the evaluationof the three music recommender systems. Controlled experiments with onlinewebsite users are often developed in combination with the A/B testing frame-work. With this method it will be examined which of the possibilities is moree�cient. In order to provide a clear explanation of this framework it will beexplained by means of an example:

Suppose you own a website where electronic devices are sold and you are notsatisfied with the number of products sold. You want to know whether the placeof the purchase button on the product page a↵ects sales numbers. The onlineA/B testing framework can be used to perform such an experiment. Here, theusers (experimental units) are divided when entering the website into group Aor group B where users from group A are presented with a product page wherethe purchase button is displayed on the left side, and users in Group B a pagewhere the button is displayed on the right side. Over time, comparing sales ofboth concepts can lead to the answer of where to display the buy button.

With this online framework it may thus be possible that at the end of the studycan be determined which option (A or B) is the most e�cient, and therefore, ispreferred over the other.[3].

Besides online/implicit evaluation is o✏ine/explicit evaluation, where the ex-perimental units are more explicitly involved in to the research. With of-fline/explicit evaluation the users are aware that multiple systems are beingcompared to each other and are explicitly asked to rate the systems, as wherewith implicit comparison the users are not. The experimental users are thereforemore focussed on the compared systems (in this paper three music recommendersystems) and the purpose of the experiment, which will lead to a more safe out-come which can then be related and compared to the online/implicit evaluationoutcome.

2.3 Radio

The initiators of Streamwatchr have added playing music and song recommen-dations to create a radio function with songs that have been tweeted by peopleall over the world. This means that users of Streamwatchr are presented a radio,which is implicitly made by music sharing people.

Radio is originally a form of wireless telecommunication in which a radio channelspreads messages in the form of radio waves. The first radio was successfullydeveloped in the late 19th century by Guglielmo Marconi, who built upon earlierwork of Nikola Tesla and Heinrich Hertz.

The first radios were analogous in which the transmitted radio waves have beenmodulated either with the use of amplitude modulation (AM) or frequency

6

Page 7: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

modulation (FM). This technique was no longer necessary with the introductionof the digital signal, also referred to as DAP[8].

The newest medium with respect to the listening radio is the Internet, or inother terms; Internet radio. Every popular radio station in the Netherlands cannow be listened to on the website www.nederland.fm.

Streamwatchr di↵ers from the other radio stations with respect to providingmusic. Where with popular stations like Radio 538 and SkyRadio the music isprovided by radio DJ’s, at Streamwatchr the music is provided by an algorithmand indirectly by the people around the world who have shared their musictastes on Twitter.

3 Method

3.1 Research Question 1

For answering the first research question (how well does the music recommenderof Streamwatchr work? ) it is mandatory to define what an appropriate follow-up song is. First intend was to train an unsupervised learning algorithm on thecollected data from Streamwatchr’s website, to cluster listened to songs as anappropriate follow-up song or not.

The saved data from Streamwatchr’s website was already visually accessible instatistics and graphs via their back-end website, but for the means of trainingan algorithm the raw data was needed, which could be extracted by executingthe following GET request:

curl --silent --user SecretAPIKey:http://zookst22.science.uva.nl:8008/api/v0/query -d @sw.json | jq-c ’.hits.hits[]’ > out.jsons

By executing this query a database was received with 33.744 raw data points inthe following format:

{"_id":"Qre3bYTqSXiWLcCO2r6stQ","_index":"streamwatchr","_score":1, "_source" :{"created_at":"2014-10-30T12:31:59.863+01:00","event_properties":{"show":true},"event_type":"toggle_video","received_at":"2014-10-30T12:32:01.565354+01:00","state":{"browser":"chrome","browser_dimensions":{"height":1308,"width":1388},"browser_version":"38.0.2125.111","current_location":"Ip-address" currently_playing":{"artist":"Ed Sheeran","mbId":[{"mbId":"b8a7c51f-362c-4dcb-a259-bc6e0095f0a6","name":"Ed Sheeran"}],"song":"thinking out loud","state":"playing","video_id":"rp1DJL_SIys"},"ip_address":"IP-address" language":null,"page":"home","platform":"macos","player_video_shown"

7

Page 8: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

:false,"screen_resolution":{"height":1440,"width":2560},"user_agent_string":"Mozilla/5.0 (Macintosh; IntelMac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko){"height":651,"width":1388},"visitor_key":"0091f408-08cc-422f-9f00-a7849e695ed9"}},"_type":"event"}

Every raw data point is created during an event that has occurred when a useris using Streamwatchr’s website. Events like listening to a song, skipping to thenext song, watching the included YouTube movie and closing the browser arestored in the event-type parameter. Hence, every data point corresponds to anevent, which means that there has been a total of 33.744 events since the exis-tence of Streamwatchr (2 years). From this total, only 1306 events were createdwhere the users were actually playing a song on the website. This means thatthere are not enough data-points for training and developing an unsupervisedlearning algorithm in the provided project time. A di↵erent approach was thustaken for qualifying Streamwatchr’s radio function where all raw-data pointscan be concluded; creating a total user-experience score for their behaviour onStreamwatchr’s website.

The visitor-key is like event-type another parameter, as can be seen in thedata-point example. A visitor-key is assigned to every user that starts a newsession on Streamwathr’s website. By chronologically sorting the data-points(read events) containing the same visitor-key, it is possible to create a timelinefrom the time that a user starts a session, until the time that the user leavesthe session (close their browser). This means that all data-points are concluded,which could lead to a timeline like figure 1.

The entire user-experience (timeline of all of his/her events) indicates indirectlyhow well the performance of the music recommender is. By means of ratingevery single event-type in combination with its parameters from zero to one, atotal user-experience score can be computed. See the following example.

Figure 1: Example of user-experience timeline from one unique user.

A python program was written to compute the average among all the user-experience scores, given the entire raw data set (dataextraction.py with out.txtas input). It is possible to adjust the event-type scores to give more or lessweight to certain events or to add and delete event scores.

The arbitrariness is high with this method; the given event scores are not trainedvalues and are picked intuitively. Answering a question concerning how well a

8

Page 9: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

system works with an absolute answer shall though always be arbitrary. Onemethod for reducing the arbitrariness is to calibrate the event values given theoutcome of the implicit and explicit comparison, which is discussed in section3.2 of this paper.

3.2 Research Question 2

Following section will focus on a research question with a relative answer,whereas the first research question had an absolute answer. How well doesthe music recommender fromm Streamwatchr work in comparison to other mu-sic recommender systems like LastFM an YouTube?. In the remain of this paperthere will be referred to Streamwatchr, LastFM and YouTube as the three musicrecommender systems or the three systems.

First intend was to compare the three systems by means of the unsupervisedlearning approach that should have been developed during answering the firstresearch question (how this would be done is described below). Since the lack ofdata the choice has been made to approach the comparison with two di↵erentmethods; explicit and implicit comparison.

3.2.1 Explicit

For the explicit comparison a separate website was set up where the three musicrecommender systems return their follow-up song, given an initial input songby the user. The best working music recommender can be deduced from lettingthe users explicitly rate the obtained follow-up songs.

In order to achieve this it is necessary that all three systems have an API, wherea recommended song can be obtained given an initial input song. Next step is tocreate a playback function, so the user can listen to recommended songs, whichcan then be rated. For each of the three systems a brief description is statedbelow of how the follow-up song can be retrieved, followed by a section aboutthe playback and rating.

3.2.1.1 YouTube YouTube is a website where videos can be uploaded,shared and viewed among users. In recent years, however, it often servesas a website for listening to music. YouTube automatically provides a nextvideo (read song) on the right side of the website when a user is listening to asong.

This is the recommendation of YouTube, given a listened to song. For theexplicit comparison, it is obliged to extract this recommendation, which canbe done by sending a search request to the YouTube server in the followingformat[2]:

9

Page 10: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

Figure 2: Picture of YouTube’s recommendation.

https://www.googleapis.com/youtube/v3/search?relatedToVideoId="+VideoID+"&type=video&part=snippet&key=APIkey

The result of this search query is a list of recommended videos (songs) withits additional video data. One of the queries’ parameters is the VideoID (ev-ery video on YouTube has a unique VideoID), which is needed to retrieve itsassociated recommended video’s. It is therefore important that the VideoIDcorresponding to the initial song on YouTube is obtained. A YouTube searchquery can be done for acquiring such a VideoID:

https://www.googleapis.com/youtube/v3/search?part=snippet&q="+Query+"&type=video&key=APIkey

The ‘Query’ parameter consists of the search keywords (song name and artistfrom the initial song) concatenated to each other with the + sign.

3.2.1.2 LastFM LastFM is a website which is especially developed for giv-ing music recommendations.

The information from the webpage example above can also be required by meansof sending a search query to the LastFM server. A top X recommendations isreturned given an input song, from which the first recommendation is extracted.The search query of LastFM has the following format and can be acquired witha GET request[4]:

http://ws.audioscrobbler.com/2.0/?method=track.getsimilar&artist="+ArtistQuery+"&track="+SongQuery+"&limit=2&autocorrect=1&api_key=APIkey&format=json

Parameters for this search query are the artist and the name from the initialsong, an API key and the amount of returned recommendations. Here theparameters also have to be concatenated with the + sign.

10

Page 11: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

Figure 3: Picture of LastFM’s recommendation.

3.2.1.3 Streamwatchr Streamwatchr’s recommendations are generated withthe Google News Ranking System, which is domain independent approach. Justas with YouTube and LastFM it is possible to retrieve a list of song recommen-dation generated by Streamwatchr, given an input song:

http://streamwatchr.com/recommend-radio?song="+SongQuery+"&artists[0][mbId]="+artistid+"&artists[0][name]="+ArtistQuery

ArtistQuery and SongQuery are the artist and song name, once again con-catenated with the + sign. Main di↵erence in comparison to the LastFM andYouTube search query is the artistsid which is the unique artistid from an on-line music database; MusicBrainz[6]. For obtaining music recommendationsfrom Streamwatchr it is thus mandatory to acquire the corresponding artistidon Musicbrainz. Once again a search query has to be send, this time to Mu-sicBrainz, which returns the artistid for the artist from the inputsong.

http://musicbrainz.org/ws/2/artist/?query=artist:"+ArtistQuery

The only parameter in this query is the name of the artist, where the spaces arereplaced by the + sign.

By means of combining these two search queries, it is possible to extract thefirst recommendation of Streamwatchr, given an initial song.

Figure 4: Picture of Streamwatchr’s recommendation.

11

Page 12: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

3.2.1.4 Playing songs It is important that the recommended songs can belistened to on the website, so the user can hear clearly which song the threemusic recommender systems have returned, given the initial song. The songscan be played with an embedded YouTube player, which require a VideoID fromthe video (read song) that has to be played.

3.2.1.5 Rating songs The songs provided by the three music recommendersystem can be rated with two di↵erent scores; one score for the quality of therecommended song (is it a good follow-up song given the initial song?) andsecond is the originality. The originality score is introduced since the musicrecommender systems can return a song which is from the same artist andalbum, which is not suitable for a radio system. Scores can be given in therange from 1 till 5, meaning from worst to best.

3.2.1.6 In Summary All the di↵erent aspects of the explicit comparisonmethod are described above. This section will give a summary of how theseaspects work together and lead to one final webpage where the experiment canbe conducted.

Figure 5: Schematic overview of the back-end from the final webpage.

When an experimental unit participates in the experiment it is first directed toForm.html where it can fill in an input song (artist and song title separated).A random song is automatically picked when the song is misspelled or oneof the music recommender systems cannot find an appropriate follow-up song.The song the experimental unit provided (or the random song) is then sent toPython.php, which contains a special code to run python programs on the initi-ated from php files, and initiates Recommendation.py with as input argumentsthe artist and song title.

Recommendation.py is the main file where all the follow-up songs from thethree music recommender systems are collected and are stored in an html codesnippet that is then written to an external file named total.html. Since a Pythonprogram cannot be shown directly in the browser it is mandatory to let theprogram write html code to the external total.html, which can then be shownin the browser. This file also contains the rating form that initiates next.php

12

Page 13: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

when the form is submitted. Next.php stores the ratings in a MySQL databaseand initiates Recommendation.py with as input the song that scored highest inprevious rating. Again, when not every music recommender system provides afollow-up song from the input song, a random song is picked.

Figure 6: The final explicit rating webpage.

3.2.1.7 Design choices Some choices were made while designing the ex-plicit comparison website. The beta version of the website had no YouTube butSpotify playback. However, due to the limitations of the Spotify data-set andthe syntactic di�culties regarding their search query, YouTube was chosen forthe playback service.

The first webpage presented to the user during the experiment contains a formwhich requires artist and song title from which the recommended songs need tobe retrieved. If this song is not known to one of the three music recommendersystems, or the number is misspelled, a random song is automatically selectedvia the http://www.randomlists.com/random-songs website.

It is also possible that the music recommender system of Streamwatchr doesnot return a follow-up song when it cannot find one. The decision was madeto choose a new random initial song till all three music recommender systemshave a follow-up song. The failure to return a follow-up song by Streamwatchrhappens often and is an item that needs to be improved in the future. For thisexperiment, it is mandatory that the focus is at the songs that are recommended,so their originality and quality can be compared with the systems of LastFMand YouTube.

At least three types of videos can be found on YouTube; non music, music(o�cial released work) and live/unreleased music footage. Completely excludingthe unreleased and live footage is hard to achieve but for the o✏ine evaluationthe recommendations from YouTube are filtered on syntactic phrases like ’live’and ’concert’. If the first recommendation of YouTube consists one of thesewords the choice is made to continue with an random initial song instead ofthe second recommendation, since the second recommendations of LastFM andStreamwatchr are also not covered.

13

Page 14: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

The three featured songs are displayed in random order so it is not clear to theusers which song is from which music recommender systems. This avoids biasestowards one particular system.

3.2.2 Implicit

Implicit comparison implies that the website users from Streamwatchr are notaware from which one of the three music recommender systems a follow-up songis received. A commonly used framework for testing and comparing multiplesystems in an online environment is the A/B testing framework.

With A/B testing the experimental units (website users from Streamwatchr)are divided in to, in this case three (A/B/C testing), equally separated groups.All three groups are presented a di↵erent music recommender when the site isentered, without knowing which one.

Three distinct data sets are created after running the A/B/C testing frame-work on Streamwatchr’s website. An overview of how well the music recom-menders work in comparison to each other can be derived by computing theuser-experience (see method research question 1) per data set and compare thescores. Arbitrariness does not play a role anymore since the computing user-experience is equally arbitrary among the music recommenders.

Three variants of A/B/C testing have been implemented in the website’s sourcecode; the standard A/B/C testing framework as described above where a useralways receives a follow-up song from the same music recommender, a variantwhere the website users receives a follow-sup song randomly from one of the threerecommenders interleaved and the last variant is where the music recommenderis unlike variant one and two associated to the users sessionID; the user receivesfollow-up songs from one of the three music recommenders when the website isvisited. A more detailed description is given below.

The advantages of implementing three distinct variants are that the websiteusers make use of the di↵erent music recommenders in three di↵erent ways. Bycomparing the user-experience between the three variants of one user, it canbe deduced how a single user responds to di↵erent recommenders. This willexclude biases like specific user behaviour. With large amounts of users thesebiases are less significant but Streamwatchr does not have su�cient users whichleads to the reason why there was chosen for three variants.

In order not to directly adjust the current site, the A/B/C framework was firstdeveloped on a local node.js server. Node.js is a software platform on whicha JavaScript application can be run and developed [7]. The entire front endof the Streamwatchr website has been set up at the node.js, which directlycommunicates with the back-end of the original website. In this way, withouta↵ecting the current Website, the A/B/C framework can be implemented anddebugged.

14

Page 15: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

Playing music and retrieve the next song is initiated by the player.js file, and isthus the place where the A/B/C framework should be implemented. However,the A/B/C framework was not directly implemented into the player.js file, buton a remote Flask server that could communicate directly with the player.js file.A Flask server is a local server that is running a python program. The applica-tion of the follow-up songs from the three di↵erent music recommender systemscan be achieved with sending an GET request to the local Flask server:

http://127.0.0.1:5000/rec/’+recartist+’+-+’+recsong+’/’+clientip+’/’+sessionid+’/variant

The query consists of the location where the program is running, followed bythe artist and title of the song from which the recommended song has to beobtained. Last three parameters are ClientIP (the IP-address of the user whichis using the website), the sessionid of the session initiated when a user arriveson the website, and finally an integer from 1 to 3, which indicates which ofthe three A/B/C variants should be used; 1) IP variant (via ClientIP), 2) thethree music recommender systems interleaved and 3) the session variant (viaSessionID).

The Flask server returns one recommended song given the variant that was sentvia the incomming request. Three variants can be received and are all addressedin a di↵erent manner.

If variant 1 is received, a link between IP-address and one of the three musicrecommender systems is created and then stored in a python dictionary, wherethe key corresponds to the IP address and the value to the music recommendersystem. When the IP-address is already known and thus present in the dictio-nary, the corresponding recommender is retrieved from the dictionary insteadof creating a new link. From the dictionary can thus be concluded which of thethree music recommendeder systems should be used for retrieving a follow-upsong. First intend was to develop a hash function to hash an IP-address to oneof three music recommender systems but due to the limited number of usersof the website it cannot be assumed that the IP-addresses are uniformly dis-tributed and therefore the decision was made to link IP addresses at random toone of the three music recommendation systems.

When option number 2 is received the Flask server returns a follow-up songfrom either three of the music recommender systems, and the user is thereforenot linked to one fixed music recommender system. This is done by choosingone of the music recommender systems at random with the ’random’ functionfrom Python at every incoming request.

Third and last variant is the same as option 1, but the dictionary is built up byand consists of SessionID’s instead of IP addresses.

The frameworks and servers described are all implemented and can be run atany moment.

15

Page 16: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

Figure 7: Schematic overview of the back-end from the implicit comparison,regarding song retrieval.

4 Results

This section will focus on the results that were gathered from the explicit com-parison method since the implicit method (A/B/C testing framework) was onlydeveloped on a local server regarding the time and was not conducted in a livesituation.

The website used for the explicit comparison method received a total of hundredsubmitted comparisons; every music recommender system received two scoreswhich makes a total of 600 received scores. An average is computed per scoreand can be found in figure 8.

Figure 8: Barchart containing the results of the explicit comparison.

16

Page 17: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

5 Conclusion

Two separate research question were stated in the beginning of this thesis,both regarding the music recommender system of Streamwatchr; ’How well doesthe music recommender of Streamwatch work?’ and ’How well does the musicrecommender of Streamwatchr work in comparison to the systems of LastFMand YouTube?’.

First research question does not (yet) have an satisfactory conclusion since thelack of users that Streamwatchr has at this very moment. However, an approachwas made to provide a method for rating Streamwatchr’s music recommendersystem, which can be improved in the future by combining conclusions andresults from research question number two.

The second question was addressed with two di↵erent approaches; implicit andexplicit comparison. Implicit comparison was only developed on a local serverand has thus no results from real Streamwatchr users. However, the frame-work for implicit comparison is created and can thus be run and used in thefuture.

Explicit comparison was achieved by letting users explicitly rate follow-up songthat were provided by the three music recommender systems. It can be con-cluded that given the average scores (see figure 8) that LastFM provides the bestfollow-up songs followed by YouTube and Streamwatchr respectively, though thedi↵erence are not significant.

A more significant di↵erence can be found at the originality score. Here, Streamwatchrscores twice as good as the system of LastFM and YouTube do. This makesStreamwatchr’s music recommender more suitable for the purpose that is served;a radio function.

6 Discussion and future work

This research has shown that the music recommender system of Streamwatchr ismore suitable than the systems from LastFM and YouTube for creating a radiofunction. With this result is mainly an answer given to the second researchquestion. The first research question was addressed during this project but hasnot yielded the desired conclusion. This was mainly caused by the fact that atpresent there are not enough Streamwatchr users; certainly in the short timethat this project took place. In the future the method created for scoring theuser experience, in particular the giving of scores to separate event-types, can becalibrated on the basis of the outcome from comparing the explicit and implicitrecommendation systems.

During the development of the o✏ine comparison site the syntactic part of theretrieval of follow-up songs took more time than expected. Typos, punctuation

17

Page 18: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

marks and even capitol letters were not accepted by every system in the execu-tion of a GET request towards their servers. Ultimately there is therefore chosento proceed with a random initial song if one of the three music recommenda-tion systems did not generated a follow-up song, since it was often because ofsyntactical problems rather than that the systems did not had a follow-up songavailable.

All three music recommender systems have a di↵erent database available forretrieving their follow-up song. This seems unfair given that the systems arecompared to each other. YouTube has access to both popular and unpopularmusic, where the overlap with LastFM and Streamwatchr lies mainly in the fieldof popular music. Theoretically, this would not result in a bias since especiallythe popular music is shared by users on Twitter, so that in practice, most use ismade of music that is available in all three of the music recommender systemsdatabases.

Originally, in addition to the two research questions discussed in this paperwas also the intention to improve Streamwatchr’s music recommender systemduring this project. The results show that there is still room for improvement,especially in terms of recommending a number that fits well to the previous track(recommendation quality). Following the literature review would suggest thatmost progress and improvement can be achieved through playlists and listenhistory of unique users. By viewing the overlap between users can be derivedwhich users have the same music tastes, to then exchange and recommend songsthat do not occur in both playlists or listening history.

7 Repository

All described files in this paper can be found in the following repository:https://drive.google.com/folderview?id=0B1SRB-TxVhf0aWhrSFNMSTdQdW8usp=sharing

References

[1] Abhinandan S Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram.Google news personalization: scalable online collaborative filtering. In Pro-ceedings of the 16th international conference on World Wide Web, pages271–280. ACM, 2007.

[2] Google Developers. Youtube search api, 2015. [Online; accessed 02-May-2015].

[3] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne.Controlled experiments on the web: survey and practical guide. Data miningand knowledge discovery, 18(1):140–181, 2009.

18

Page 19: Streamwatch: evaluation of a twitter-based music … · 2020. 8. 30. · 3.1 Research Question 1 For answering the first research question (how well does the music recommender of

[4] LastFM. Api - last.fm, 2015. [Online; accessed 04-May-2015].

[5] Kibeom Lee and Kyogu Lee. My head is your tail: applying link analysis onlong-tailed music listening behavior for music recommendation. In Proceed-ings of the fifth ACM conference on Recommender systems, pages 213–220.ACM, 2011.

[6] MusicBrainz. Musicbrainz - the open music encyclopedia, 2015. [Online;accessed 27-April-2015].

[7] Wikipedia. Node.js — wikipedia, the free encyclopedia, 2015. [Online;accessed 06-June-2015].

[8] Wikipedia. Radio — wikipedia, the free encyclopedia, 2015. [Online; ac-cessed 17-June-2015].

[9] Diyi Yang, Tianqi Chen, Weinan Zhang, Qiuxia Lu, and Yong Yu. Localimplicit feedback mining for music recommendation. In Proceedings of thesixth ACM conference on Recommender systems, pages 91–98. ACM, 2012.

19